Author

Hehuan Ma

ORCID Identifier(s)

0000-0002-5971-0053

Graduation Semester and Year

2023

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Junzhou Huang

Abstract

Drug discovery has always been a crucial task for society, and molecular property prediction is one of the fundamental problem. It is responsible for identifying the target properties or severe side-effects, so that certain molecules can be selected as the candidates of drugs. Traditional methods usually conduct a series of biochemical experiments to test the molecular properties, which may take up to decades. Nowadays, this process can be facilitated due to the rapid growth of deep learning methods. I present my work toward solving this critical problem by utilizing deep learning techniques. My research study can be summarized in three directions: 1) designing informative and powerful molecular representation learning models; 2) exploring the imbalanced data distribution and developing corresponding techniques to further improve the prediction performance, and 3) taking advantage of the unlabeled data to address the limited labeled data problem in molecular property prediction. How to accurately and properly represent the molecule is a dominant perspective to target molecular property prediction. Precisely depicting molecules is a crucial factor that significantly impacts property predictions. To tackle this challenge, I propose a cross-dependent graph neural network to learn and generate informative molecular representation by exploring the molecular graph structure, which takes both atom-oriented graph structure and edge-oriented graph into consideration. Moreover, the molecular data could be imbalanced since some molecules may be easily predicted while some others are not. Therefore, I explore the data distribution and propose an attentive loss function to allow the network to learn the sample importance with respect to different molecules, which further improves the model performance. Lastly, acquiring accurate molecular property information remains a demanding task due to the labor and resource-intensive nature of labeling. To address this issue, effectively utilizing unlabeled data stands out as a potential solution. I propose a robust self-training strategy to include unlabeled data to promote molecular property prediction. Furthermore, I propose a data-augmentation strategy using graph neural network by incorporating these two methods to solve a realistic problem, drug-induced liver injury (DILI) prediction, and have obtained significant improvement on this extremely small dataset, which only contains hundreds of molecules. Moreover, I a sequence-based multi-label learning method to improve the performance of the property prediction with limited data in a semi-supervised manner.

Keywords

Deep learning models, Molecular property prediction, Graph neural networks, Drug discovery, Semi-supervised learning, Representation learning

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS