Document Type
Dataset
DOI
https://doi.org/10.32855/dataset.2026.02.046
Production/Collection Date
January 23, 2024 - Jan 31, 2026
Production/Collection Location
Arlington, Texas
Depositor
Laurel Smith Stvan
Deposit Date
3-26-2026
Data Type
The dataset is provided as a csv file. The text consists of metadata for each product review along with typed annotations by participants. The dataset has these headers: annotater, CIDa, review#a, CIDq, review#q, CIDb, review#b, Waldo, confidence, comments. Each row below the header row represents the data collected from one annotation exercise. Annotator Column The sixteen annotators are represented by capital letters, allowing researchers to group annotations by annotator without revealing the identity of study participants. Each annotator provided up to 129 annotations. Columns Designating Reviews Viewed The annotator views three reviews. Two of the reviews, designated A and B, come from the pool of reviews attributed to a specific CID (customer identification) number. The third review is designated Q, from a different CID. The columns whose headers include a, b or q are either the CID or the Amazon-provided ID (review number) for each specific review used. Waldo Column The column headed by the title 'Waldo' contains a symbol designating which review the annotator chose as the odd one out, that is, A, B, or Q. Q indicates the annotator successfully chose the review attributed to a different author. An A or B in the column indicate confusion about which review was from a different author, and which one of the reviews from the same CID was chosen. Confidence Column Our study had two main phases and in the second phase, we collected additional data from the annotators. We asked annotators to rate their confidence in their selections. This confidence level was collected using a slider and was converted into a range from zero (low confidence) to 100 (high confidence). The column labeled 'confidence' reports this number, and is blank if the annotation is from the first phase of the project. Comments Column After making a selection, and without knowing if the selection was correct or confused, each annotator was given an input box for free-form comments of any kind. Some annotators expressed difficulty with the decision, or gave reasons for the choice made. Others chose to provide personal demographic data. Among other uses, these comments could provide insight about outlier data or inspiration for features to extract for a clustering or other machine learning model.
Abstract
Distinct from the task of predicting the author of a document (authorship attribution), we focus on addressing the issue of how to estimate the similarity between the written language styles of authors. To do so, we present a dataset of metadata derived by asking human annotators, who were presented with three documents, to identify which two were written by the same author and which was written by a different author. The dataset has over 400 such annotations, creating a companion to the Amazon Web Services (AWS) customer review dataset, laying the groundwork for crowdsourcing applications to other natural language processing (NLP) endeavors.
Disciplines
Computer Sciences | Linguistics
Publication Date
Spring 3-26-2026
Language
English
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
Koone, Mary E.; Kallie, Rosie; Athitsos, Vassilis; and Stvan, Laurel S., "The Waldo Dataset" (2026). Social Work Datasets. 3.
https://mavmatrix.uta.edu/socialwork_datasets/3