ORCID Identifier(s)

0009-0002-5282-9120

Graduation Semester and Year

Fall 2024

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Vassilis Athitsos

Second Advisor

Gautam Das

Third Advisor

Leonidas Fegaras

Fourth Advisor

Lynn Peterson

Fifth Advisor

Laurel S. Stvan

Abstract

This thesis studies the topic of identifying author similarity, grouping authors together based on that similarity. To solve that problem, the thesis proposes concrete solutions to a series of subproblems. The initial sub-problems are: how to identify a pool of possible features for representing documents, and how to select and combine some of those features to map a document into a feature vector. Another sub-problem is how to evaluate the usefulness of such feature vectors in identifying language style similarity. This thesis proposes, as part of addressing that sub-problem, a novel method for evaluating the quality of document representations obtained, using a dataset of annotations where human volunteers look at triples of documents and select the most dissimilar based on perception of authorship. An additional contribution of the thesis is a novel dataset of these annotations. Finally, the thesis proposes and evaluates methods for representing authors and evaluating author similarity. This work begins to identify and quantify similarities between individual language styles, called idiolects. Building on this, work toward useful groupings of very similar idiolects (individual authors) is presented. The term congrualect is introduced to describe such a grouping. The proposed methods are evaluated using a well-known dataset (Amazon Web Services Customer Review Dataset) of documents with multiple known, individual authors, identifying authors whose linguistic style (idiolect) is similar. The experiments evaluate different methods for document and author representations obtained using hand-crafted features, machine-learned features or fusion features, a mixture of both.

Keywords

Natural Language Processing, Data Mining, Machine Learning, Idiolect, Congrualect

Disciplines

Numerical Analysis and Scientific Computing

Available for download on Saturday, December 13, 2025

Share

COinS