Graduation Semester and Year

Spring 2026

Language

English

Document Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Engineering

First Advisor

Jacob Luber

Second Advisor

Junzhou Huang

Third Advisor

Dajiang Zhu

Fourth Advisor

Kenny Zhu

Abstract

Protein sequencing is fundamental to understanding biological processes, disease mechanisms, and therapeutic developments. Current protein sequencing techniques, primarily relying on mass spectrometry and Edman degradation, face significant limitations in accurately identifying all amino acids within a protein, hindering comprehensive proteome analysis. Recent advances in click chemistry and bioorthogonal chemistry have enabled the identification of specific amino acids and their positions within a peptide; however, these techniques are still limited by the number of amino acids that can be identified with high accuracy and specificity, resulting in partially known sequences. This thesis presents a novel computational approach that leverages protein language models to predict complete protein sequences from such partial experimental data.

We introduce a modified ProtBERT-based transformer model, finetuned for the downstream task of predicting masked residues in protein sequences. Our method simulates partial sequencing data by selectively masking amino acids that are experimentally difficult to identify, using protein sequences from the UniRef database. This targeted masking strategy mimics the real-world constraints of click chemistry-enhanced Edman degradation platforms. We train species-specific models on two experimentally motivated sets of identifiable amino acids: a minimal set of four amino acids K₁ = {K, C, Y, M}, corresponding to an 88.5% masking rate) and an expanded set of nine amino acids K₂ = {K, C, Y, M, R, H, W, S, T}, corresponding to a 67.1% masking rate).

We evaluate our approach on three Escherichia bacterial species (E. coli, E. albertii, and E. fergusonii) and one phylogenetically distant species (Listeria monocytogenes). Our results demonstrate high prediction accuracy even with extremely limited known amino acids. With only four known amino acids K₁, we achieve per-residue accuracy up to 90.5%. With nine known amino acids K₂, accuracy reaches up to 94.1%. Cross-species experiments reveal that prediction accuracy correlates with phylogenetic distance, demonstrating the model's capacity to capture evolutionary relationships. Additional experiments with intermediate amino acid sets provide prioritization guidance for future wet-lab development of click chemistry reagents.

We validate the biological relevance of our predictions through structural assessment, using AlphaFold2 to generate three-dimensional structures from both predicted and true sequences. Evaluation using template modeling score (TM-score) and the local distance difference test (lDDT) confirms that predicted sequences with unmasking accuracy above 75% maintain high structural fidelity. This integration of simulated experimental constraints with computational predictions offers a promising avenue for enhancing protein sequence analysis, with potential applications in proteomics, structural biology, and the development of clinically useful protein sequencing platforms for liquid biopsy diagnostics.

Keywords

protein language model, sequencing, bioinformatics

Disciplines

Bioinformatics

License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Pham, Thuong Le Hoai, "PEPTIDE SEQUENCING VIA PROTEIN LANGUAGE MODELS" (2026). Computer Science and Engineering Theses - Archive. 541.
https://mavmatrix.uta.edu/cse_theses/541

Download

Included in

Bioinformatics Commons

COinS

Computer Science and Engineering Theses - Archive

PEPTIDE SEQUENCING VIA PROTEIN LANGUAGE MODELS

Graduation Semester and Year

Language

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Keywords

Disciplines

License

Recommended Citation

Included in

Search

Browse

Author & Creator Corner

Computer Science and Engineering Theses - Archive

PEPTIDE SEQUENCING VIA PROTEIN LANGUAGE MODELS

Author

Graduation Semester and Year

Language

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Keywords

Disciplines

License

Recommended Citation

Included in

Share

Search

Browse

Author & Creator Corner