Author

Young Bun Kim

Graduation Semester and Year

2008

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Jean Gao

Abstract

During the last decade, the advent of microarray technology has stimulated rapid research advances in bioinformatics. Microarray data pose great challenges for computational data analysis, because of their large dimensionality (up to several tens of thousands of genes) and their small sample sizes. In order to deal with these particular characteristics of microarray data, the need and importance for feature selection techniques were realized. While a lot of research deals with classification methods and their application to microarray data, only a few approaches are explicitly designed to consider interaction among the investigated features. It is well known that the interactions between genes or proteins are important for many biological functions, i.e. signals from the outside of a cell are mediated to the core of the cell by protein-protein interactions of the signaling molecules. Hence, to achieve optimal classification accuracy, these interactions among features need to be taken into account. My research goal is to develop algorithms which not only effectively select the most informative features but also identify the relationship among those features. For the clustering of the genes, researchers have attempted to apply feature subset selection to select a subset of genes that are common for all possible un-known classes. However, the fact that a certain set of genes may be only related to a subset of experiments due to experiment design and no enough knowledge on gene function is overlooked. In the thesis, a new subspace semi-supervised clustering algorithm called EPSCMIX (Emerging Pattern Subspace Clustering by MIXure models) is designed. This algorithm is used to find gene expression patterns which in turn could be used to predict pathological phenotypes and identify genes that might anticipate the clinical behavior of diseases. Our method is based on feature saliency measure, the probability of feature relevance, which is estimated by an Expectation Maximization (EM) algorithm. This approach employs Emerging Patterns (EPs) to identify effectively relationships among genes. The best number of classes and the relevant set of genes are discovered by EPSCMIX. To address the problem of identifying informative genes from a large amount of gene expression data when no prior knowledge is available, we develop a hybrid methodology for unsupervised gene (feature) selection and sample clustering. The algorithm, PFSBEM (hybrid PCA based Feature Selection and Boost-Expectation-Maximization clustering), introduces a new PCA (principal component analysis) based feature selection within a wrapper framework. PFSBEM uses a three-step approach to feature selection and data clustering. The first step initially reduces high-dimension feature space by retrieving feature subsets with original physical meaning based on their capacities to reproduce sample projections on PCs (principal components). Each feature subset corresponds to a certain PC. The second step then determines the important PCs that contribute to data clustering. A boost-EM (expectation-maximization) clustering method is developed to achieve stable data grouping. Finally, from the merged feature subsets of important PCs, the best feature subset that maximizes data clustering is selected. Feature pattern (combination of features) identification techniques could be used to capture more underlying semantics than single feature. However, it is very hard to find meaningful patterns in large datasets like microarray data because of the huge search space. Furthermore, infrequent patterns are often irrelevant or do not improve the accuracy of the classification. To tackle these problems, we finally design a discriminative feature patterns identification system named DFPIS. Instead of simply identifying genes contributing to the network, this methodology takes into consideration of gene interactions which are represented as Strong Jumping Emerging Patterns (SJEP). Furthermore, infrequent patterns though occurred are considered irrelevant. The whole framework consists of three steps: feature (gene, protein) selection, feature pattern identification, and pattern annotation.

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS