Author

Kiran Bolaj

ORCID Identifier(s)

0009-0008-1383-9338

Graduation Semester and Year

2023

Language

English

Document Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Engineering

First Advisor

Sharma Chakravarthy

Abstract

Graph mining analyzes the real-world graphs for finding core substructures in chemical compounds (e.g., Benzene), identify the structure that occurs frequently in a given graph or forest. These identified structures are important as they reveal an inherent feature or property in the given graph or forest. Substructures represent interesting and repeating patterns found within an application, offering insights into hidden regularities. Therefore, the process of finding these interesting and frequent patterns in an unsupervised manner is known as substructure discovery. SUBDUE was the first main-memory algorithm developed for substructure discovery. Since then, for scalability, the algorithm has been extended to Database approaches and more recently to Map/Reduce framework to exploit distributed processing. Graph sizes have been increasing due to the advent of Internet and Social networks applications. To model complex data sets (sets with multiple types of entities and relationships), Multilayer networks, or MLNs, have been Proposed. MLNs have also been shown to be superior as compared to simple and attribute graphs for modeling complex data. MLNs are classified into three different types: Homogeneous (HoMLNs), Heterogeneous (HeMLNs), and Hybrid (HyMLNs). Homogeneous MLNS have same type of entities in each layer but differ in connectivity as it represents a unique relationship. Heterogeneous MLNs have different types of entities in each layer and also differ in connectivity and have explicit interlayer edges that connects nodes between the layers. Hybrid MLNs are the combination of Homogeneous and Heterogeneous MLNs. Algorithms have been developed for substructure discovery in HoMLNs, which differ from earlier approaches of aggregating MLN layers to a single graph and using the single graph algorithms. Drawing inspiration from the earlier work of finding substructures in Homogeneous Multilayer Networks, this thesis focuses on substructure discovery in Heterogeneous Multilayer Networks without aggregating the layers into a single graph. We use the decoupling-based approach by processing each layer separately/independently which can be done in parallel, if necessary, to find the substructures and then composing independently generated layer substructures after each iteration with interlayer edges to get the substructures across HeMLN for that iteration. The algorithm is implemented using the Map/Reduce paradigm. The main focus of this dissertation is the correctness of the algorithm by verifying with ground truth for which we use the SUBDUE algorithm. Another focus is the scalability of the approach to handle arbitrary MLN sizes. For this, we perform extensive experimental analysis on large synthetic data sets (generated by Subgen) and real-world datasets to analyze the speedup over diverse graph characteristics.

Keywords

Substructure discovery, Heterogeneous multilayer networks, Hadoop, Map/reduce, Canonical substructure, Canonical instance, Beam, Minimum description length (MDL)

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS