Graduation Semester and Year
2007
Language
English
Document Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science and Engineering
First Advisor
Gautam Das
Abstract
A large part of the data on the World Wide Web is hidden behind form-like interfaces. These interfaces interact with a hidden back-end database to provide answers to user queries. Generating a uniform random sample of this hidden database by using only the publicly available interface gives us access to the underlying data distribution. In this thesis, we propose a random walk scheme over the query space provided by the interface to sample such databases. We discuss variants where the query space is visualized as a fixed and random ordering of attributes. We also propose techniques to further improve the sample quality by using a probabilistic rejection based approach and conduct extensive experiments to illustrate the accuracy and efficiency of our techniques.
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Dasgupta, Arjun, "A Random Walk Approach To Sampling Hidden Databases" (2007). Computer Science and Engineering Theses. 319.
https://mavmatrix.uta.edu/cse_theses/319
Comments
Degree granted by The University of Texas at Arlington