Graduation Semester and Year
2012
Language
English
Document Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science and Engineering
First Advisor
Chengkai Li
Abstract
We are surrounded by data in various forms such as instant messages, Twitter tweets, Facebook status updates, news, media, blogs and much more. Extracting meaning from such a massive collection of unstructured data would lead to interesting stories. Examples of such stories can be ``\emph{Who was the most popular actor in a particular month}''or ``\emph{Which diseases were people most concerned about in year 2008}''. In this thesis, we propose to discover popular entities mentioned in blog articles based on the concept of prominent streak. Given a sequence of values for a named entity (e.g., a person, a place, etc.), where each value is the occurrence frequency of the entity in blog articles during a corresponding period of time, a prominent streak is a long consecutive subsequence of only large (small) values. Whether a streak is prominent also depends on how it fares against streaks for comparable entities. Using the distributed data processing framework Mapreduce, particularly Hadoop which is one of its open-source implementations, we find entity occurrences in a set of blog articles with a trie-based data structure. Prominent streak discovery algorithms are applied over the detected sequences of entities occurrences to derive interesting stories. Our experiments and evaluation are done over the ICWSM'09 Spinn3r blog dataset, which contains over 44 million blog articles for the months of August and September in 2008.
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Philip, Jijo John, "Prominent Streaks Discovery On Blog Articles" (2012). Computer Science and Engineering Theses. 207.
https://mavmatrix.uta.edu/cse_theses/207
Comments
Degree granted by The University of Texas at Arlington