ORCID Identifier(s)

0000-0003-1954-8431

Graduation Semester and Year

2018

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Chengkai Li

Abstract

Many real-world applications analyze data to find objects that ``stand out'' with regard to various contexts and ways of valuing the objects. Examples of such application scenarios include vendors recommending products to potential customers, social networks improving content selection for users, and Google Scholar notifying newly published articles based on profiles. Besides, journalists identify conditions to substantiate the significance of an event or the interestingness of an object. Interesting events can be retrieved from stock data, weather data, and criminal records. Apart from journalists, those events convey significant information for financial analysts, scientist, and citizens. The aforementioned application needs can be modeled as contextual and reverse Pareto-optimality queries. Given a set of objects and a way of valuing them, a Pareto-optimality query results in a set of objects of which each resulting object is not worse than any other object in the set. The resulting objects are called Pareto-optimal objects or Pareto frontier. A ``contextual'' Pareto-optimal object stands out against other objects in a context. A reverse Pareto-optimality query finds contexts where a given object belongs to the Pareto frontiers. We formalize the real-world problem of finding outstanding objects as finding Pareto frontiers. In an ever-growing database, a straightforward brute-force approach to answering the contextual and reverse Pareto-optimal queries would compute Pareto frontiers in every context with regard to each existing object, separately. To resolve this drawback, we pursue object reduction, context pruning, and sharing computation across ways of valuing. In this dissertation, we study efficient evaluation of contextual and reverse Pareto-optimality queries. We first discuss methods for efficient evaluation of contextual and reverse Pareto-optimality queries in the context of modeling and satisfying user preferences. Specifically, we study the problem of continuous object dissemination---given a large number of users and continuously arriving new objects, deliver an incoming object to all users who prefer the object. Many real-world applications analyze users' preferences for effective object dissemination. For continuously arriving objects, timely finding users who prefer a new object is challenging. We consider an append-only table of objects with multiple attributes and users' preferences on individual attributes are modeled as strict partial orders. An object is preferred by a user if it belongs to the Pareto frontier with respect to the user's partial orders. Users' preferences can be similar. Exploiting shared computation across similar preferences of different users, we design algorithms to find target users of a new object. In order to find users of similar preferences, we study the novel problem of clustering users' preferences that are represented as partial orders. We also present an approximate solution to the problem which is more efficient than the exact one while ensuring sufficient accuracy. Furthermore, we extend the algorithms to operate under the semantics of sliding window. We present the results from comprehensive experiments for evaluating the efficiency and effectiveness of the proposed techniques. We further discuss efficient evaluation of contextual and reverse Pareto-optimality queries in the form of finding new, prominent situational facts, which are emerging statements about objects that stand out within certain contexts. Many such facts are newsworthy---e.g., an athlete's outstanding performance in a game, or a viral video's impressive popularity. Automated identification of these facts assists journalists in reporting. More specifically, we consider an ever-growing table of objects with dimension and measure attributes. A situational fact is a ``contextual'' skyline tuple that stands out against historical tuples in a context specified by a conjunctive constraint involving dimension attributes, while the objects are being compared on a set of measure attributes. New tuples are constantly added to the table, reflecting events happening in the real world. Our goal is to discover constraint-measure pairs that qualify a new tuple as a contextual skyline tuple and discover them quickly before the event becomes yesterday's news. A brute-force approach requires exhaustive comparison of the new tuple with every existing tuple, under every constraint, and in every measure subspace. We design algorithms in response to these challenges using three corresponding ideas---tuple reduction, constraint pruning, and sharing computation across measure subspaces. We further extend the algorithms to allow for updates on data. We also adopt a simple prominence measure to rank a large number of discovered facts. Experiments over three real datasets validate the effectiveness and efficiency of our techniques. Finally, we present FactWatcher, a system that assists journalists in identifying data-backed, attention-seizing facts which serve as leads to news stories.

Keywords

Pareto-optimality queries

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS