Graduation Semester and Year

2021

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Leonidas Fegaras

Abstract

Most programs written to operate on data are usually expressed in terms of array operations in sequential loops. However, these programs do not scale to large amount of data generated by scientific experiments and industrial and commercial markets. Given the success of machine learning algorithms on large amount of data and the recent shift of industries to data-driven decision making, the data scientists who are not familiar with Big Data frameworks have to rewrite the sequential programs to distributed data-parallel programs by hand. We present a novel framework, called SQLgen, that automatically translates sequential loops to distributed data-parallel programs. SQLgen translates array-based loops to Spark SQL programs. At first, it translates the input programs to monoid comprehensions, which is the formal framework of our translation model. Then, it translates the comprehensions to Spark SQL programs. We chose Spark SQL because, unlike the Spark Core API programs, Spark SQL programs are optimized by the query optimizer Catalyst. We compare the performance of our generated programs with that of related work on real-world problems and show significant performance gain (up to 78x) while keeping performance close to hand-written Spark SQL programs. Linear algebra operations, such as matrix multiplication, play a significant role in many machine learning algorithms. The performance of these operations is highly dependent on the storage implementations of these matrices. For example, in a distributed system, computations on block matrices are significantly faster than the computations in the coordinate format in terms of computation and communication cost. Hence, instead of coordinate arrays, these operations can be implemented in block matrices, which are distributed collections of non-overlapping dense/sparse ar- rays. Moreover, many graph algorithms can be expressed as repetitive computations that resemble matrix multiplication in which the addition and multiplication operations have been replaced with generalized operations that form an algebraic structure known as a semiring. Similar to matrix multiplication, these graph algorithms can be implemented in a distributed system using block arrays. We present a novel framework OSQLgen that automatically parallelizes array-based loop programs containing linear algebra operations and graph programs that are equivalent to a semiring to distributed data-parallel programs on block arrays. We compare the performance of OSQLgen with GraphX, GraphFrames, MLlib, and hand-written Spark SQL programs on coordinate and block arrays on various real-world problems. On certain problems, OSQLgen is up to 36x faster than GraphX, 25x faster than GraphFrames, 3x faster than MLlib, and 99x faster than hand-written Spark SQL programs on coordinate arrays, giving performance close to that of hand-written Spark SQL programs on block arrays.

Keywords

Big data, Arrays, SQL, Graph processing, Machine learning, Spark

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS