Jung Hun Oh

Graduation Semester and Year




Document Type


Degree Name

Doctor of Philosophy in Computer Science


Computer Science and Engineering

First Advisor

Jean Gao


High-resolution MALDI-TOF (matrix-assisted laser desorption/ionization time-of-flight) mass spectrometry has recently shown promise as a screening tool for detecting discriminatory peptide/protein patterns. The major computational obstacle in finding such patterns is the large number of mass/charge peaks (features, biomarkers, data points) in a spectrum. To tackle this problem, we have developed methods for data preprocessing and biomarker selection. The preprocessing consists of binning, baseline correction, and normalization. An algorithm, Extended Markov Blanket (EMB), is developed for biomarker detection, which combines redundant feature removal and discriminant feature selection. The biomarker selection couples with support vector machine (SVM) to achieve sample prediction from high-resolution proteomic profiles. Disease progresses in several stages. Therefore, there exist biomarkers corresponding to each stage. To deal with such a multi-class problem, we propose a classification and a feature selection method. The proposed classification method consists of two schemes: error-correcting output coding (ECOC) and pairwise coupling (PWC). In prediction for a test sample, aggregated results of both schemes are considered. In PWC scheme, important features for each pair of classes are found by using extended Markov blanket (EMB) feature selection. To identify the molecular formulae of the biomarkers, we develop a de novo peptide sequencing method. De novo peptide sequencing that determines the amino acid sequence of a peptide via tandem mass spectrometry (MS/MS) has been increasingly used nowadays in proteomics for protein identification. Current de novo methods generally employ a graph theory, which usually produces a large number of candidate sequences and causes heavy computational cost while trying to determine a sequence with less ambiguity. We present a novel de novo sequencing algorithm that greatly reduces the number of candidate sequences. By utilizing certain properties of b- and y-ion series in MS/MS spectrum, we propose a reliable two-way parallel searching algorithm to filter out the peptide candidates that are further pruned by an intensity evidence based screening criterion. LDA is a traditional statistical scheme for feature reduction which has been widely used in a diversity of application areas. In a case where the dimensionality exceeds the sample size, however, the classical LDA faces a problem known as singularity. Since the dimensionality of the mass spectrometry data is considerably huge, the singularity problem necessarily happens. Another drawback of the classical LDA is its linear property with which LDA fails for nonlinear problems. To solve the problem, nonlinear based LDA methods have been proposed. However, they suffer from high cost in running. We propose a new fast kernel discriminant analysis (FKDA).


Computer Sciences | Physical Sciences and Mathematics


Degree granted by The University of Texas at Arlington