ORCID Identifier(s)

0000-0001-9188-4048

Graduation Semester and Year

2020

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Ramez Elmasri

Abstract

Natural Language Processing (NLP) is the use of computers for the analysis of text. The approaches used to solve problems in NLP has transitioned from human tagging to statistical methods and currently many of the approaches make use of techniques from Machine Learning (ML) and Artificial Intelligence (AI). Some of the text-based problems in NLP include question answering, summarization, and text-based recommendation systems. The amount of text data that is available is increasing by the day. With the growing size of text data, the research in NLP needs more sophisticated methods than traditional statistical methods. Current day approaches make use of deep learning since the advent of word vectors and advances in neural networks. This research focused on identification of MultiWord Expressions (MWEs), summarization of text, Question Answering from opinionated writing, and how MWEs affect Natural Language Understanding (NLU). The MWEs identification problem was solved as a classification task using a Convolutional Neural Network model. The model provided reasonable F-scores and the neural network architecture may be used for similar tasks. In fact, the task studying the effect of MWEs on NLU used this neural network architecture. There was no significant difference between questions with MWEs and questions without MWEs. Summarization of text and Question Answering from opinionated writing were performed using unsupervised clustering based algorithms. SIMSTER was used for summarization of text and an extension of SIMSTER, SimsterQ, was used to answer opinionated questions. SIMSTER did not limit the number of clusters, whereas, SimsterQ limited the number of clusters to ten. In each task two variants, first and med, were used. The First variant returns the first sentence from each cluster and the med variants returns a sentence of median length from each cluster. ROUGE scores and other benchmark shows satisfactory performance of both the algorithms in their respective tasks. For the question answering task, the quality of the clusters generated were also reported. The algorithm was able to naturally select the appropriate number of clusters in more than 80\% of instances. This research work tackled NLP tasks that are both difficult and currently worked on by the NLP community. The neural network architecture, the two unsupervised algorithms and their ability to provide performance comparable to or exceeding baselines are the major contributions of this research to the NLP knowledge base.

Keywords

MWE, Summarization, Question answering

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS