Social Work Datasets

The Waldo Dataset

Mary E. Koone, University of Texas at ArlingtonFollow
Rosie KallieFollow
Vassilis Athitsos, University of Texas at ArlingtonFollow
Laurel S. Stvan, University of Texas at ArlingtonFollow

ORCID Identifier(s)

Mary E. Koone ORCID 0009-0002-5282-9120 Rosie Kallie: ORCID 0009-0004-9878-4911 Vassilis Athitsos ORCID 0000-0002-1281-6168 Laurel Smith Stvan: ORCID 0000-0003-0833-6871

Document Type

Dataset

DOI

https://doi.org/10.32855/dataset.2026.02.046

Related Publication(s)

Koone, Mary E. PhD. “Identification and Quantification of Authorial Style Similarity” 2024. Computer Science and Engineering Dissertations. 401. University of Texas at Arlington. https://mavmatrix.uta.edu/cse_dissertations/401. Koone, Mary E., Rosie Kallie, Laurel S. Stvan, and Vassilis Athitsos. In press. Measuring Writing Style Similarity.  Pervasive Technologies Related to Assistive Environments: 19th International Conference, PETRA 2026, Crete, Greece, July 12–15, 2026, Proceedings, edited by Fillia Makedon, Vassilis Athitsos and Ming Li.

Production/Collection Date

January 23, 2024 - Jan 31, 2026

Production/Collection Location

Arlington, Texas

Depositor

Laurel Smith Stvan

Deposit Date

3-26-2026

Data Type

The dataset is provided as a csv file. The text consists of metadata for each product review along with typed annotations by participants. The dataset has these headers: annotater, CIDa, review#a, CIDq, review#q, CIDb, review#b, Waldo, confidence, comments. Each row below the header row represents the data collected from one annotation exercise. Annotator Column The sixteen annotators are represented by capital letters, allowing researchers to group annotations by annotator without revealing the identity of study participants. Each annotator provided up to 129 annotations. Columns Designating Reviews Viewed The annotator views three reviews. Two of the reviews, designated A and B, come from the pool of reviews attributed to a specific CID (customer identification) number. The third review is designated Q, from a different CID. The columns whose headers include a, b or q are either the CID or the Amazon-provided ID (review number) for each specific review used. Waldo Column The column headed by the title 'Waldo' contains a symbol designating which review the annotator chose as the odd one out, that is, A, B, or Q. Q indicates the annotator successfully chose the review attributed to a different author. An A or B in the column indicate confusion about which review was from a different author, and which one of the reviews from the same CID was chosen. Confidence Column Our study had two main phases and in the second phase, we collected additional data from the annotators. We asked annotators to rate their confidence in their selections. This confidence level was collected using a slider and was converted into a range from zero (low confidence) to 100 (high confidence). The column labeled 'confidence' reports this number, and is blank if the annotation is from the first phase of the project. Comments Column After making a selection, and without knowing if the selection was correct or confused, each annotator was given an input box for free-form comments of any kind. Some annotators expressed difficulty with the decision, or gave reasons for the choice made. Others chose to provide personal demographic data. Among other uses, these comments could provide insight about outlier data or inspiration for features to extract for a clustering or other machine learning model.

Abstract

Distinct from the task of predicting the author of a document (authorship attribution), we focus on addressing the issue of how to estimate the similarity between the written language styles of authors. To do so, we present a dataset of metadata derived by asking human annotators, who were presented with three documents, to identify which two were written by the same author and which was written by a different author. The dataset has over 400 such annotations, creating a companion to the Amazon Web Services (AWS) customer review dataset, laying the groundwork for crowdsourcing applications to other natural language processing (NLP) endeavors.

Disciplines

Computer Sciences | Linguistics

Publication Date

Spring 3-26-2026

Language

English

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Recommended Citation

Koone, Mary E.; Kallie, Rosie; Athitsos, Vassilis; and Stvan, Laurel S., "The Waldo Dataset" (2026). Social Work Datasets. 3.
https://mavmatrix.uta.edu/socialwork_datasets/3

ReadMe(March30).txt (2 kB)

Download

Included in

Computer Sciences Commons, Linguistics Commons

COinS

Social Work Datasets

The Waldo Dataset

ORCID Identifier(s)

Document Type

DOI

Related Publication(s)

Production/Collection Date

Production/Collection Location

Depositor

Deposit Date

Data Type

Abstract

Disciplines

Publication Date

Language

License

Recommended Citation

Included in

Search

Browse

Author & Creator Corner

Links

Social Work Datasets

The Waldo Dataset

Authors

ORCID Identifier(s)

Document Type

DOI

Related Publication(s)

Production/Collection Date

Production/Collection Location

Depositor

Deposit Date

Data Type

Abstract

Disciplines

Publication Date

Language

License

Recommended Citation

Included in

Share

Search

Browse

Author & Creator Corner

Links