The Waldo Dataset
Abstract
Distinct from the task of predicting the author of a document (authorship attribution), we focus on addressing the issue of how to estimate the similarity between the written language styles of authors. To do so, we present a dataset of metadata derived by asking human annotators, who were presented with three documents, to identify which two were written by the same author and which was written by a different author. The dataset has over 400 such annotations, creating a companion to the Amazon Web Services (AWS) customer review dataset, laying the groundwork for crowdsourcing applications to other natural language processing (NLP) endeavors.