Graduation Semester and Year
2016
Language
English
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science and Engineering
First Advisor
Leonidas Fegaras
Abstract
Distributed frameworks, such as MapReduce and Spark, have been developed by industry and research groups to analyze the vast amount of data that is being generated on a daily basis. Many graphs of interest, such as the Web graph and Social Networks, increase their size daily at an unprecedented scale and rate. To cope with this vast amount of data, researchers have been using distributed processing frameworks to analyze these graphs extensively. Most of these graph algorithms are iterative in nature, requiring repetitive distributed jobs. This dissertation presents a new design pattern for a family of iterative graph algorithms for the distributed framework. Our method is to separate the immutable graph topology from the graph analysis results. Each compute node participating in the graph analysis task reads the same graph partition at each iteration step, which is made local to the node, but it also reads all the current analysis results from the distributed file system (DFS). These results are correlated with the local graph partition using a merge-join and the new improved analysis results associated with only the nodes in the graph partition are generated and dumped to the DFS. Our algorithm requires one job for pre-processing the graph and the repetition of one map-based job for the actual analysis. Unfortunately, in most of these iterative algorithms, such as for Page-Rank, if the graph is modified with the addition or deletion of edges or vertices, the Page-Rank has to be recomputed from scratch. We improved our previous design approach and to handle continuous updates, an update function collects the changes to the graph and applies them to the graph partitions in a streaming fashion. Once the changes are made, the iterative algorithm is resumed to process the new updated data. Since a large part of the graph analysis task has already been completed on the existing data, the new updates require fewer iterations to compute the new graph analysis results as the iterative algorithm will converge faster.
Keywords
Distributed computing, Apache Hadoop, Apache Spark, Big graph analysis, Incremental graph, PageRank
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Gupta, Upa, "Distributed Data Intensive Computing on Graph" (2016). Computer Science and Engineering Dissertations. 267.
https://mavmatrix.uta.edu/cse_dissertations/267
Comments
Degree granted by The University of Texas at Arlington