Graduation Semester and Year
Spring 2026
Language
English
Document Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science and Engineering
First Advisor
Jacob Luber
Second Advisor
Junzhou Huang
Third Advisor
Dajiang Zhu
Fourth Advisor
Kenny Zhu
Abstract
Protein sequencing is fundamental to understanding biological processes, disease mechanisms, and therapeutic developments. Current protein sequencing techniques, primarily relying on mass spectrometry and Edman degradation, face significant limitations in accurately identifying all amino acids within a protein, hindering comprehensive proteome analysis. Recent advances in click chemistry and bioorthogonal chemistry have enabled the identification of specific amino acids and their positions within a peptide; however, these techniques are still limited by the number of amino acids that can be identified with high accuracy and specificity, resulting in partially known sequences. This thesis presents a novel computational approach that leverages protein language models to predict complete protein sequences from such partial experimental data.
We introduce a modified ProtBERT-based transformer model, finetuned for the downstream task of predicting masked residues in protein sequences. Our method simulates partial sequencing data by selectively masking amino acids that are experimentally difficult to identify, using protein sequences from the UniRef database. This targeted masking strategy mimics the real-world constraints of click chemistry-enhanced Edman degradation platforms. We train species-specific models on two experimentally motivated sets of identifiable amino acids: a minimal set of four amino acids K1 = {K, C, Y, M}, corresponding to an 88.5% masking rate) and an expanded set of nine amino acids K2 = {K, C, Y, M, R, H, W, S, T}, corresponding to a 67.1% masking rate).
We evaluate our approach on three Escherichia bacterial species (E. coli, E. albertii, and E. fergusonii) and one phylogenetically distant species (Listeria monocytogenes). Our results demonstrate high prediction accuracy even with extremely limited known amino acids. With only four known amino acids K1, we achieve per-residue accuracy up to 90.5%. With nine known amino acids K2, accuracy reaches up to 94.1%. Cross-species experiments reveal that prediction accuracy correlates with phylogenetic distance, demonstrating the model's capacity to capture evolutionary relationships. Additional experiments with intermediate amino acid sets provide prioritization guidance for future wet-lab development of click chemistry reagents.
We validate the biological relevance of our predictions through structural assessment, using AlphaFold2 to generate three-dimensional structures from both predicted and true sequences. Evaluation using template modeling score (TM-score) and the local distance difference test (lDDT) confirms that predicted sequences with unmasking accuracy above 75% maintain high structural fidelity. This integration of simulated experimental constraints with computational predictions offers a promising avenue for enhancing protein sequence analysis, with potential applications in proteomics, structural biology, and the development of clinically useful protein sequencing platforms for liquid biopsy diagnostics.
Keywords
protein language model, sequencing, bioinformatics
Disciplines
Bioinformatics
License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Recommended Citation
Pham, Thuong Le Hoai, "PEPTIDE SEQUENCING VIA PROTEIN LANGUAGE MODELS" (2026). Computer Science and Engineering Theses. 541.
https://mavmatrix.uta.edu/cse_theses/541