ORCID Identifier(s)

0000-0002-9530-0289

Graduation Semester and Year

Spring 2026

Language

English

Document Type

Thesis

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Junzhou Huang

Second Advisor

Dajiang Zhu

Third Advisor

Jean Gao

Fourth Advisor

Meng Ye

Abstract

In the evolving field of artificial intelligence, the efficacy of deep learning models is often gated by the quality of their training and the clarity of their decision-making processes. This dissertation addresses these crucial challenges by focusing on two key areas: enhancing pre-training strategies and improving model interpretability. Our approach is twofold, integrating novel pre-training methodologies that embed domain-specific knowledge early in the model training process, and developing advanced techniques for disentangling and clarifying the decision-making mechanisms within these models. The first direction of our research employs MoDNA, a motif-oriented pre-training framework specifically designed for DNA language models. By leveraging self-supervised learning, MoDNA harnesses the vast amounts of unlabeled genomic data while infusing biological priors into the training process, significantly boosting the model’s performance on downstream regulatory tasks such as promoter prediction and transcription factor binding site identification. This method demonstrates how targeted pre-training can overcome the limitations posed by sparse labeled data in the genomics field. The second direction focuses on the interpretability of graph neural networks (GNNs), which are crucial for analyzing structured data. Here, we introduce a novel method, Interpretable Graph Neural Networks with Disentangled Subgraph (IGNN-DS), that disentangles the causal and spurious factors influencing model predictions. By formalizing the interactions among graph structures, their labels, and the derived subgraphs through a Structural Causal Model (SCM), this approach clarifies how predictions are made while enhancing the model’s robustness to distribution shifts and out-of-distribution data. Building upon this foundation, we extend the framework by dividing SCM into two modes: Fully Informative Invariant Features (FIIF) and Partially Informative Invariant Features (PIIF). We introduce Causal Subgraphs and Information Bottlenecks (CSIB), which integrates invariance principles based on graph information bottleneck to guide the generation of causal subgraphs, achieving superior performance in out-of-distribution scenarios. Together, these strategies make deep learning models more reliable and understandable, thereby increasing their applicability and trustworthiness in critical domains such as genomics and structural data analysis. By pushing the frontiers of pre-training and interpretability, this research sets new benchmarks for what deep learning can achieve, facilitating breakthroughs that transform both the field of machine learning and its numerous applications.

Keywords

DNA Language Models;Graph Neural Networks;Model Interpretability

Disciplines

Computer Engineering

License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.