Document Type

Honors Thesis


Resource-poor, morphologically complex languages are at a disadvantage in natural language processing tasks, such as automatic text summarization or machine translation, due to the shortage of quality linguistic data available in these languages. Recently, researchers have introduced a language-independent, centroid-based method for automatic text summarization which garnered international attention for its success. This thesis explores methods for improving Rossiello et al.’s summarization approach on resource-poor, morphologically complex languages by implementing additional preprocessing steps on the data. Thereafter, stemming is shown to marginally improve research benchmark ROUGE scores for summarizations in German, a relative morphologically complex language, as well as in Turkish, an agglutinative language. In addition, a manual semantic analysis of the associated Word2Vec models in this approach showed improved accuracy when models were constructed on stemmed corpora. This result has implications for research on word embeddings in low-resource and morphologically complex languages.

Publication Date






To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.