Graduation Semester and Year

2016

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Gautam Das

Abstract

An online community network links entities (e.g., users, products) with various relationships (e.g., friendship, co-purchase) and make such information available for access through a web interface. There are numerous such networks on the web, ranging from Twitter, Instagram, which links users as ``followers-followees'', to amazon.com, ebay.com, which links products with relationships such as ``also buy''. Online community networks often feature a web interface that only allows local-neighborhood queries - i.e., given a entity (e.g., users, products) of the community network as input, the system only returns the immediate neighbors of the entity. Further, the rate limit constraint restricts the number of queries/API calls that could be issued per day over the online community networks. These restrictives makes it extremely difficult for a third party to crawl all data from a community network, as a complete crawl requires as many queries as the number of entities in the network. Moreover, the web interfaces of these networks often support features such as keyword search and ``get-neighbors'' - so a visitor can quickly find entities (e.g., users/products) of interest. Nonetheless, the interface is usually too restrictive to answer complex queries such as (1) find 100 Twitter users from California with at least 100 followers or (2) find 100 books with at least 200 5-star reviews at amazon.com. These restrictions prevent scientists and third parties with limited resources from performing novel analytics tasks over these data. On the other hand, the available content in an online community network may help users in decision making. For example, detailed reviews are critical for activities such as buying an expensive digital SLR camera, reserving a vacation package, etc. Since writing a detailed review for a product (or, a service) is usually time-consuming and may not offer any incentive, the number of useful reviews available in the Web is far from many. The corpus of reviews for making informed decisions also suffers from spam and misleading content, typographical and grammatical errors, etc. In this proposal, we present efficient techniques for exploratory data analysis over online community networks. This is achieved by developing novel algorithms that allows a user to overcome some of the restrictions enforced by web interfaces. There are three main contributions. First, we introduce a novel, general purpose, technique for faster sampling of nodes over an online social network. Specifically, unlike traditional random walks which wait for the convergence of sampling distribution to a predetermined target distribution - a waiting process that incurs a high query cost - we develop WALK-ESTIMATE, which starts with a much shorter random walk, and then proactively estimate the sampling probability for the node taken before using acceptance-rejection sampling to adjust the sampling probability to the predetermined target distribution. Second, we introduce the novel problem of answering complex queries that involve non-searchable attributes through the web interface of an online community network. We propose a unified approach that (approximately) transforms the complex query into a small number of supported queries based on a strategic query-selection process. The third contribution of my dissertation is the development of new techniques that engage the lurkers (i.e., people who read reviews but never take time and effort to write one) to participate and write online reviews by systematically simplifying the reviewing task. Given a user and an item that she wants to review, the task is to identify the top-k meaningful phrases (i.e., tags) from the set of all tags (i.e., available user feedback for items) that, when advised, would help her review an item easily. Enabling such automatic review system raise unexpected challenges that we tackle through the design of exact and approximate algorithms. For all the problems, we provide rigorous theoretical analysis and extensive experiments over synthetic and real-world popular online community networks.

Keywords

Online community networks, Data analytics, Social network analysis

Disciplines

Computer Sciences | Physical Sciences and Mathematics

License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Comments

Degree granted by The University of Texas at Arlington

Recommended Citation

Nazi, Azade, "Exploratory Data Analysis over Online Community Networks" (2016). Computer Science and Engineering Dissertations - Archive. 319.
https://mavmatrix.uta.edu/cse_dissertations/319

26376-2.zip (2322 kB)

Download

Included in

Computer Sciences Commons

COinS

Computer Science and Engineering Dissertations - Archive

Exploratory Data Analysis over Online Community Networks