ORCID Identifier(s)

0000-0002-7446-4639

Graduation Semester and Year

2016

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Jean Gao

Abstract

Advancement of the Next Generation Sequencing (NGS), also known as the High Throughput Sequencing (HTS) technologies allow researchers investigate genome, transcriptome, or epigenome of any organism from any perspective, thereby contributing to the enrichment of the biomedical data repositories for many of the lesser known phenomena. The regulatory activities inside genome by the non-coding RNAs (ncRNAs), the transcribed product of the long-neglected "junk DNA" molecules is one such phenomenon. While large-scale data about the ncRNAs are becoming publicly available, the computational challenges are being imposed to the bioinformaticians for efficient mining to get reliable answers to few subtle questions. Given the fact that a huge number of transcript sequences are retrieved every day, how can one distinguish a coding transcript from an ncRNA transcript? Can the structural patterns of the ncRNAs define their functions? Finally, from the accumulating evidences of dysregulations by ncRNAs leading to their association with a wide variety of human diseases, can one devise an inference engine to model the existing disease links as well as deduce unexplored associations? Most prior works on ncRNA data analysis are not applicable for addressing the challenges due to the size and scope of the available datasets. In this dissertation, we present efficient in silico integrative methods to mine biomedical data pertaining to answering aforementioned questions. We design CNCTDiscriminator method for reliably classifying the coding and non-coding RNAs coming from any part of the genome. This is achieved through an extensive feature extraction process for learning an ensemble classifier. We design algorithm, PR2S2Clust, to characterize functional ncRNAs by considering their structural features. For this, we formulate the problem as a clustering of the structures of the patched RNA-seq read segments, which is first of its kind in literature. Finally, we propose three algorithms to deal with the disease-ncRNA association inference problem. The first algorithm formulates the inference as a modified Non-negative Matrix Factorization (NMF) problem that can handle additional features of both the entities. The second algorithm formulates the problem as an Inductive Matrix Completion (IMC) problem presenting a generalized feature integration platform overcoming the cold-start issue common to most of the prior works including the NMF strategy. The final algorithm, Robust Inductive Matrix Completion (RIMC) is presented to solve two major issues with the IMC formulation pertaining to data outliers and sparsity. For all the problems, we provide rigorous theoretical foundations of the proposed algorithms and conduct extensive experiments over real-world biomedical data available in the public domains. The performance evaluation validates the utility and effectiveness of the proposed algorithms over existing state-of-the-art methods.

Keywords

Machine learning, Data integration, Biomedical data analysis, Non-coding RNA

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS