The Waldo Dataset

Mary E. Koone, University of Texas at Arlington
Rosie Kallie
Vassilis Athitsos, University of Texas at Arlington
Laurel S. Stvan, University of Texas at Arlington

Abstract

Distinct from the task of predicting the author of a document (authorship attribution), we focus on addressing the issue of how to estimate the similarity between the written language styles of authors. To do so, we present a dataset of metadata derived by asking human annotators, who were presented with three documents, to identify which two were written by the same author and which was written by a different author. The dataset has over 400 such annotations, creating a companion to the Amazon Web Services (AWS) customer review dataset, laying the groundwork for crowdsourcing applications to other natural language processing (NLP) endeavors.