ORCID Identifier(s)

0000-0002-8103-0310

Graduation Semester and Year

2020

Language

English

Document Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Engineering

First Advisor

Sharma Chakravarthy

Abstract

The utility and widespread use of Relational Database Management Systems(RDBMSs) comes not only from its simple, easy-to-understand data model (a relation or a set) but mainly from the ability to write non-procedural queries and their optimization by the system. Queries produce exact answers that match the contents of the database. Query processing of RDBMSs has been researched for more than 4 decades and includes extensions to more complex analysis on data warehouses. In contrast, search has not been addressed by RDBMSs. As the use of other other data types (key-value store, column-store, and graphs to name a few) are becoming popular for modeling to match the data set characteristics, query processing and optimization are becoming important again. The approaches used in RDBMSs, such as cost-based, I/O focused may not be applicable in the same way to new models and queries. Hence, new approaches need to be developed that are suited for the data model used and the expressiveness of the queries to be supported. This thesis addresses query processing of large graphs (or forest) and develops algorithms for query processing as well as develops heuristics for improving the response time using graph characteristics. Although search (unlike RDBMS) has received a lot of attention for graphs, query processing, in contrast, has received very little attention. With the advent of large social networks and other large graphs(e.g., freebase, knowledge and entity graphs), querying to understand the data set and retrieve relevant/exact information becoming critical. This thesis builds on the previous work at the Information Technology laboratory at UTA (IT Lab) to scale query processing to arbitrary-size graphs (or forests) and to exploit parallelism as much as possible. Partitioning (a form of divide and conquer) and Map/Reduce (for parallel processing) are used as basic ingredients for scalability. Partitioning a graph for query processing and computing all answers poses a number of challenges: i) partitioning schemes, ii) scheduling or choosing which partition or partitions to schedule for processing, iii) developing heuristics for reducing the total response time exploiting query and graph characteristics, and iv) importantly,correctness of results. This thesis address all of the above challenges using the map/reduce framework.The choice of map/reduce framework allows us to make partitions based on available resources and optimize parallelism based on the number of partitions to schedule at a time. We use a partitioning strategy that has been shown to be good for substructure discovery. We develop a number of heuristics that are based on query and graph characteristics. The query itself is expressed as a graph without having to cast in some other language. Relational comparison operators, Boolean operators, wild cards, and union queries are supported. There is no restriction on node and edge labels, and uniquely labeled multiple edges are supported. Extensive experimental analysis of the approach (partitioning sizes, algorithm, and heuristics) using large data sets (real-world and synthetic) are shown for speedup, scalability, and efficacy of the heuristics proposed.

Keywords

Graph Query Processing, Graph mining, MapReduce, Big data

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

29126-2.zip (5290 kB)

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.