Author

Upa Gupta

ORCID Identifier(s)

0000-0002-5383-2076

Graduation Semester and Year

2016

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Leonidas Fegaras

Abstract

Distributed frameworks, such as MapReduce and Spark, have been developed by industry and research groups to analyze the vast amount of data that is being generated on a daily basis. Many graphs of interest, such as the Web graph and Social Networks, increase their size daily at an unprecedented scale and rate. To cope with this vast amount of data, researchers have been using distributed processing frameworks to analyze these graphs extensively. Most of these graph algorithms are iterative in nature, requiring repetitive distributed jobs. This dissertation presents a new design pattern for a family of iterative graph algorithms for the distributed framework. Our method is to separate the immutable graph topology from the graph analysis results. Each compute node participating in the graph analysis task reads the same graph partition at each iteration step, which is made local to the node, but it also reads all the current analysis results from the distributed file system (DFS). These results are correlated with the local graph partition using a merge-join and the new improved analysis results associated with only the nodes in the graph partition are generated and dumped to the DFS. Our algorithm requires one job for pre-processing the graph and the repetition of one map-based job for the actual analysis. Unfortunately, in most of these iterative algorithms, such as for Page-Rank, if the graph is modified with the addition or deletion of edges or vertices, the Page-Rank has to be recomputed from scratch. We improved our previous design approach and to handle continuous updates, an update function collects the changes to the graph and applies them to the graph partitions in a streaming fashion. Once the changes are made, the iterative algorithm is resumed to process the new updated data. Since a large part of the graph analysis task has already been completed on the existing data, the new updates require fewer iterations to compute the new graph analysis results as the iterative algorithm will converge faster.

Keywords

Distributed computing, Apache Hadoop, Apache Spark, Big graph analysis, Incremental graph, PageRank

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS