ORCID Identifier(s)


Graduation Semester and Year




Document Type


Degree Name

Master of Science in Computer Science


Computer Science and Engineering

First Advisor

Chengkai Li


The task of health tweet classification entails identifying whether a given tweet is health-related or not. While existing research in this area has made significant progress in classifying tweets into specific sub-domains of health, such as mental health, COVID-19, or specific diseases, there is a need for a more comprehensive approach that considers a broader range of health-related topics. This thesis addresses this need by proposing a diverse and comprehensive dataset that includes various existing health-related datasets, data collected through a keyword-based approach, and manually annotated data. However, the use of health-related keywords in a figurative or non-health context poses a significant challenge to the classification task. To overcome this challenge, the thesis explores the use of Transformer-based models, such as BERT, BERTweet, RoBERTa, and DistilBERT, which have the ability to understand the contextual meaning of words. The study experiments with these models to assess their effectiveness in classifying health-related tweets. Based on the findings of the thesis study, Transformer-based models, including BERT, DistilBERT, and RoBERTa, had lower F1-scores of 0.882, 0.870, and 0.872, respectively when evaluated on test data. The highest F1-score of 0.900 was achieved by adding the BiLSTM layer to the BERTweet model, which was then fine-tuned on our proposed dataset and RHMD (Reddit Dataset). Additionally, an ablation analysis was conducted to highlight the significance of the BiLSTM layer and the RHMD dataset in enhancing the BERTweet model's performance for health tweet classification.


Healthcare, Deep Learning, Twitter Data Analysis, Transformers, Classification


Computer Sciences | Physical Sciences and Mathematics


Degree granted by The University of Texas at Arlington