Graduation Semester and Year
2017
Language
English
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science and Engineering
First Advisor
Chengkai Li
Abstract
This thesis studies the problem of finding facts from semi-structured and structured data. The amount of data in our world is exploding, and the proliferation of data is making them increasingly inaccessible. It is now more challenging than ever how to efficiently identify useful information where a vast amount of data is available. This thesis first studies the problem of finding facts in semi-structured data, specifically, in knowledge graphs. We built Maverick, a general, extensible framework that discovers exceptional facts about entities in knowledge graphs. We model an exceptional fact about an entity of interest as a context-subspace pair, in which a subspace is a set of attributes and a context is defined by a graph query pattern of which the entity is a match. The entity is exceptional among the entities in the context, with regard to the subspace. The search spaces of both patterns and subspaces are exponentially large. Maverick conducts beam search on the patterns which uses a match-based pattern construction method to evade the evaluation of invalid patterns. It applies two heuristics to select promising patterns to form the beam in each iteration. Maverick traverses and prunes the subspaces organized as a set enumeration tree by exploiting the upper bound properties of exceptionality scoring functions. Maverick demonstrated substantial performance improvement of the proposed framework over the baselines as well as its effectiveness in discovering exceptional facts. This thesis further investigates the problem of finding facts in structured data. In particular, our objective is to find top-k prominent streaks in multi-dimensional multi-sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values, e.g., consecutive games of outstanding performance in sports. To efficiently discover prominent streaks, we exploited the properties of local prominent streaks (LPS), which is a superset of prominent streaks. We showed that LPS-based algorithms exhibited orders of magnitude performance improvement against the baseline method. This thesis also presents a fact-finding system FactWatcher, which helps journalists identify data-backed, attention-seizing facts which serve as leads to news stories. FactWatcher discovers three types of facts, including situational facts, one-of-the-few facts, and prominent streaks, through a unified suite of data model, algorithm framework, and fact ranking measure. Furthermore, FactWatcher provides multiple features in striving for an end-to-end system, including fact ranking, fact-to-statement translation and keyword-based fact search.
Keywords
Computational journalism, Fact-finding, Knowledge graph mining, Sequence data mining, Interestingness ranking
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Zhang, Gensheng, "FRAMEWORKS, ALGORITHMS, AND SYSTEMS FOR EFFICIENT DISCOVERY OF DATA-BACKED FACTS" (2017). Computer Science and Engineering Dissertations. 279.
https://mavmatrix.uta.edu/cse_dissertations/279
Comments
Degree granted by The University of Texas at Arlington