ORCID Identifier(s)

0000-0002-0092-9886

Graduation Semester and Year

2020

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Gautam Das

Abstract

The exponential rise in data, along with its heterogeneity and complexity, helped individuals, businesses, hospitals, enterprises and even governments to thrive on data-driven decision making. However, as the size and complexity of the data surged, challenges in searching for similar objects within databases (proximity search) compounded. As is well known, proximity search is the key and successful method used in Information Retrieval (IR) in vast databases including, Genomics Databases, Image Databases, Video Databases, Text databases, etc. The objective is to retrieve contents from the database, similar to a given object in the database. This dissertation revisits a suite of popular and fundamental proximity problems (such as, KNN, clustering, minimum spanning tree) that repeatedly perform distance computations (sometimes by making calls to a third party distance oracle, such as Google Maps or Bing Maps) to compare the distance between a pair of objects during their execution. The chief effort here is to design principled solutions to minimize distance computations for such problems in general metric spaces, especially for the scenarios where calling an expensive oracle to resolve unknown distances are the dominant cost of the algorithms for these problems. The fundamental understanding of the structure of the proximity problem solutions reveals the underlying dependency on repetitive conditional statements predicated on the distance between pairs of objects. Thus, Direct Feasibility Test is designed to study how distance comparisons between two different pairs of objects could be modelled as a system of linear inequalities that assists in saving distance computations. Direct Feasibility Test offers the maximal savings for proximity problem solutions and to the best of the knowledge is the first attempt to provide a theoretical understanding of the problem. Furthermore, the study also offers an alternative formalism with the goal of computing distance bounds. This work also develops a suite of graph-based algorithms that trade-off between running time and tightness of the produced bounds, whilst producing identical and exact solutions to these problems. A comparison of these designed solutions conceptually and empirically concerning a broad range of existing works is also presented in this work. The work also presents a comprehensive set of experimental results using multiple large scale real-world datasets and a suite of popular proximity algorithms to demonstrate the effectiveness of the proposed approach. To the best of the knowledge, the work is one of the first that takes a systematic stride to minimize repeated distance computation costs inside proximity problems. Finally, a deep understanding of the challenges in large scale signal reconstruction is also addressed in this dissertation as a practical solution to large scale signal reconstruction problem.

Keywords

Proximity, Metric, Database

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS