Graduation Semester and Year
Fall 2024
Language
English
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science and Engineering
First Advisor
Vassilis Athitsos
Second Advisor
Gautam Das
Third Advisor
Leonidas Fegaras
Fourth Advisor
Lynn Peterson
Fifth Advisor
Laurel S. Stvan
Abstract
This thesis studies the topic of identifying author similarity, grouping authors together based on that similarity. To solve that problem, the thesis proposes concrete solutions to a series of subproblems. The initial sub-problems are: how to identify a pool of possible features for representing documents, and how to select and combine some of those features to map a document into a feature vector. Another sub-problem is how to evaluate the usefulness of such feature vectors in identifying language style similarity. This thesis proposes, as part of addressing that sub-problem, a novel method for evaluating the quality of document representations obtained, using a dataset of annotations where human volunteers look at triples of documents and select the most dissimilar based on perception of authorship. An additional contribution of the thesis is a novel dataset of these annotations. Finally, the thesis proposes and evaluates methods for representing authors and evaluating author similarity. This work begins to identify and quantify similarities between individual language styles, called idiolects. Building on this, work toward useful groupings of very similar idiolects (individual authors) is presented. The term congrualect is introduced to describe such a grouping. The proposed methods are evaluated using a well-known dataset (Amazon Web Services Customer Review Dataset) of documents with multiple known, individual authors, identifying authors whose linguistic style (idiolect) is similar. The experiments evaluate different methods for document and author representations obtained using hand-crafted features, machine-learned features or fusion features, a mixture of both.
Keywords
Natural Language Processing, Data Mining, Machine Learning, Idiolect, Congrualect
Disciplines
Numerical Analysis and Scientific Computing
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Koone, Mary E. PhD, "IDENTIFICATION AND QUANTIFICATION OF AUTHORIAL STYLE SIMILARITY" (2024). Computer Science and Engineering Dissertations. 401.
https://mavmatrix.uta.edu/cse_dissertations/401