ORCID Identifier(s)

0000-0002-2575-9633

Graduation Semester and Year

2022

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

X Jean Gao

Abstract

The emergence of high-throughput sequencing technology has generated a wealth of “multi-omics” data, capturing information about different types of biomolecules at multiple levels. Since large-scale genomics, transcriptomics, and proteomics data are becoming publicly available, integrated systems analysis utilizing these data sources has taken the front seat in deriving valuable insights for identifying cancer biomarkers or predicting interactions and functions for novel molecules such as LncRNAs. The graph representation learning paradigm can address these challenging tasks as among the most promising approaches to improve predictions over sparsely annotated molecular entities and to provide representation capacity and interpretability over heterogeneous and hierarchically structured data. This dissertation investigates novel graph machine learning approaches for biomarker discovery in microRNA co-expression graphs, functional representation of LncRNA sequences for link prediction, aggregation of heterogeneous relations to predict protein functions, and the pipelines to enable reproducible graph integration of public biological databases. Prior works on multi-omics integrative analysis have had significant shortcomings in addressing the challenges due to the heterogeneity and scale of graph-based datasets. For instance, univariate analyses cannot produce robust results when identifying biomarkers for genetically heterogeneous cancer diseases with multi-omics data without considering the interconnectivity between the various omics. We constructed the MicroRNA Dysregulatory Synergistic Network to extract features from aberrant MicroRNA-MessengerRNA interactions and applied a multivariate technique that considers the grouping effect of biomarkers. Aside from inferring gene-disease associations, we also proposed the rna2rna method to predict the regulatory interactions and the functional similarities of non-coding RNAs (ncRNAs) where there are non-existent annotations for novel sequences. By leveraging the diverse array of interaction, sequence, annotation, and expression multimodal data, our method can characterize the functional similarity and interaction topologies of a novel ncRNA from sequence. Then, we formulated a generalized algorithm named LATTE to deal with the complexity of heterogeneous networks, where multiple node types are connected in various ways. This graph neural network method is applied to the automatic protein function prediction problem in an architecture called LATTE2GO that aims to aggregate information from higher-order relations to extract integrated representations of protein-protein networks and the hierarchical Gene Ontology. Finally, as data integration and feature engineering are vital steps in large-scale bioinformatics projects, we developed an open-source software called OpenOmics. Our tool assists in systematically integrating heterogeneous multi-omics datasets and interfacing with popular public annotation and interaction databases for increased reproducibility and standardization of biomedical data integration. The performance evaluation of our proposed methods, algorithms, and tools validates the utility and effectiveness compared to existing state-of-the-art methods.

Keywords

Biomarker discovery, Functional similarity, Regulatory interactions, Non-coding RNAs, Graph embedding, Link prediction, Node classification, Graph neural networks, Heterogeneous graphs, Attention mechanism

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS