Graduation Semester and Year

2006

Language

English

Document Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Engineering

First Advisor

Alp Y. Aslandogan

Abstract

We address the problem of irrelevant results for short queries on Web search engines. Short queries fail to provide sufficient context to disambiguate possible meanings associated with the search terms resulting in a set of irrelevant pages that the user has to filter through navigation and sometimes examination. First, we predict the potential concept topics, which are the domains for the search terms. This prediction is based on word occurrences and relationships observed in the various domains (categories) of a corpus. Next, we expand the search terms in each of the predicted domains in parallel. We then submit separate queries, specialized for each domain, to a general purpose search engine. The user is presented with categorized search results under the predicted domains. The theoretical foundations of our approach include concept identification in the form of associated terms through Latent Semantic Indexing, in particular the WordSpace model, one sense per collocation and one domain per discourse assumptions, and sense disambiguation through sufficient context. User evaluations of our approach indicate that it helps the users avoid having to examine irrelevant Web search results, especially with shorter queries. Another contribution of our work is the development of a web-based corpus of documents including sufficiently rich collections in multiple subject categories. We also created a mapping between these subject categories from the Open Directory Project and the domains from WordNet Domains.

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS