Graduation Semester and Year
2023
Language
English
Document Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science and Engineering
First Advisor
Upendranath Chakravarthy
Abstract
Substructure discovery is a process in data analysis and data mining that involves identifying and extracting meaningful patterns, structures, or components within a larger dataset. These substructures can be of various types, such as frequent patterns, motifs, or any other relevant features within the data. The growth of the internet and the proliferation of mobile devices have led to the generation of enormous amounts of data. Companies like Facebook and Twitter can generate large datasets from user interactions on their websites, such as connections between users and user generated content. Moreover, advances in processing power and storage capacity have made it possible to collect and store these large amounts of data, known as big data. The present research in big data focuses on developing new techniques and technologies for modeling, processing, and analyzing large and complex datasets. MLNs (Multilayer Networks) are more effective at modeling complex, diverse data, as compared to traditional data modeling techniques such as transactions or graphs (simple or attributed). MLNs consist of multiple layers, each layer representing a different feature of the dataset. For example, in a social network, each layer can represent a different type of connection between users, such as friendship, work, or family. We can also model the same set of users across different social networks, with each network as its own layer making the data easier to understand and comprehend. Substructure discovery in Multilayer Networks typically requires aggregation and integration of information from multiple layers, as substructures can extend beyond individual layers. Existing algorithms designed for single and attributed graphs and do not work natively on MLNs. As MLNs are a relatively new representation model, there is a need to develop algorithms that are specifically designed for the analysis of MLNs or to develop frameworks that can apply existing algorithms to MLNs while preserving the structure and semantics of the data. Currently available algorithms for graph mining are for single graphs or forests. They are memory based, disk based or use a database system approach. These algorithms have been extended to work on the MapReduce framework to improve scalability and have the capacity to process increasingly large datasets. But most of these need to aggregate the multilayer network to a single graph using projection or other types of aggregation (e.g., union). This thesis presents two algorithms for computing substructures in homogeneous MLNs (where each layer has the same set of nodes but different connectivity) without layer aggregation, utilizing the decoupling approach. The first approach involves applying composition \textit{after each iteration} of the substructure discovery algorithm. In this method, composition is carried out frequently, with the aim of attaining the same accuracy as ground truth. In the alternative approach, composition is only applied once at the end of the substructure discovery process. This approach is more computationally efficient, as it reduces the overhead associated with frequent composition. This approach is a trade-off between accuracy and efficiency. One of the goals for this dual approach is to analyze and understand this trade-off. In both approaches, the iterative substructure discovery process remains the same. We begin with individual layers and identify substructures within each layer. Subsequently, the generated results from each layer are used to compose substructures that span across multiple layers. We use a decoupling approach (divide and conquer paradigm) for MLN substructure discovery to accommodate layers of arbitrary size. We employ the MapReduce framework to leverage distributed data processing for substructure discovery. This allows us to achieve better scalability and response time. The decoupling approach can be applied to handle multiple layers. Mining substructure in this way posed several challenges which we had to address: i) finding layer-wise substructures, ii) composing the layer-wise results to find substructures across layers, iii) maximizing performance by exploiting parallelism as much as possible, and iv) verifying the correctness of results. For verification, the proposed algorithms were applied to synthetic (Subgen) and real-world datasets like Amazon and DBLP.
Keywords
Graphs, Data mining, Multilayer networks, Distributed computing
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Singh, Arshdeep, "HoMLN-SD: Substructure Discovery In Homogeneous Multilayer Networks" (2023). Computer Science and Engineering Theses. 522.
https://mavmatrix.uta.edu/cse_theses/522
Comments
Degree granted by The University of Texas at Arlington