Document Type
Honors Thesis
Abstract
Resource-poor, morphologically complex languages are at a disadvantage in natural language processing tasks, such as automatic text summarization or machine translation, due to the shortage of quality linguistic data available in these languages. Recently, researchers have introduced a language-independent, centroid-based method for automatic text summarization which garnered international attention for its success. This thesis explores methods for improving Rossiello et al.’s summarization approach on resource-poor, morphologically complex languages by implementing additional preprocessing steps on the data. Thereafter, stemming is shown to marginally improve research benchmark ROUGE scores for summarizations in German, a relative morphologically complex language, as well as in Turkish, an agglutinative language. In addition, a manual semantic analysis of the associated Word2Vec models in this approach showed improved accuracy when models were constructed on stemmed corpora. This result has implications for research on word embeddings in low-resource and morphologically complex languages.
Publication Date
5-1-2018
Language
English
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Goss Manshack, Kalen, "IMPROVING AUTOMATIC SUMMARIZATION FOR LOW- AND MODERATE-RESOURCE, MORPHOLOGICALLY COMPLEX LANGUAGES" (2018). 2018 Spring Honors Capstone Projects. 28.
https://mavmatrix.uta.edu/honors_spring2018/28