ORCID Identifier(s)

ORCID iD: 0009-0007-9942-9550

Graduation Semester and Year

Spring 2024

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Chengkai Li

Abstract

Recent efforts to combat misinformation have increasingly focused on automating the fact-checking process. For fact-checking systems to be automated, they need to recognize claims worth checking, match them with previously fact-checked claims, and determine their truthfulness. We refer to the task of matching unvetted claims with previously fact-checked claims as claim-matching. To be precise, the task is defined as follows: given a factual claim, identifying the fact-checks from a repository that could be helpful, or partially helpful, for vetting the given claim. A solution to this task is useful in practice since claimants often repeat the same claims even if they have been debunked in the past.

To help automate claim-matching, we created ClaimMatcher, a neural network model that builds upon Transformer embeddings trained on the ClaimPairs dataset. The ClaimPairs benchmark dataset, created using our in-house crowdsourcing data annotation platform, consists of pairs of fact-checks and claims. The fact-checks are from PolitiFact, and the claims are from events, including the United States Primary and General Presidential Election Debates and the State of the Union Addresses. This dataset also includes auxiliary information, such as the claimants and the timestamps of the claims. We employed Krippendorff's Alpha to evaluate the annotators' agreement on the ClaimPairs dataset, which yielded an alpha value of α =0.773, signifying a substantial agreement level among annotators.

ClaimMatcher, trained on ClaimPairs using less advanced embedding models, outperformed SentenceTransformers, a widely used framework that employs advanced embedding vectors for various tasks, such as computing the semantic similarity between two sentences. Specifically, in our evaluations, ClaimMatcher outperformed SentenceTransformers --- state-of-the-art models for gauging the semantic similarity of texts --- based on correlation and error metrics. In per-class error analysis, it performed the best in three out of five classification classes while remaining competitive in the rest.

The ClaimMatcher framework, the ClaimPairs benchmark dataset, and the formulation of the claim-matching task are our key contributions. Specifically, we are the first to define helpfulness levels of fact-checks for vetting claims in formulating the claim-matching task. The ClaimPairs dataset, being the unique dataset for our claim-matching problem formulation, can become a valuable resource to the research community. We created a dedicated data annotation platform to curate the ClaimPairs dataset. The design and implementation of the platform could be adapted for other annotation tasks. We conducted comprehensive experiments and case studies to compare ClaimMatcher's performance with SentenceTransfomers. Overall, ClaimMatcher outperformed them in both correlation and error measurements.

Keywords

Claim-matching, Fact-checking, Misinformation, Data annotation, Semantic similarity, Deep learning models

Disciplines

Computer Sciences

Available for download on Thursday, May 28, 2026

Share

COinS