Graduation Semester and Year
Spring 2026
Language
English
Document Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science and Engineering
First Advisor
Dr Ming Li
Second Advisor
Dr Faysal Hossain Shezan
Third Advisor
Prof Jimmie Bud Davis
Abstract
Video conferencing has become pervasive in daily life, with users often typing sensitive information while their webcam is active. Even when the keyboard and hands are not visible, subtle vibration-induced pixel displacements may be present in the captured video, potentially may exhibit patterns correlated with typing activity.
These signals are typically imperceptible to human observers, yet they may provide a basis for automated analysis.
This thesis focuses on the role of machine learning models in analyzing such vibration-induced visual signals. We employ a signal processing pipeline to extract compact vibration features from webcam video, represented as GFCC features, which serve as inputs for learning-based modeling. The feature extraction process is designed to provide consistent and structured representations suitable for model evaluation, rather than to optimize signal recovery.
Building on this representation, we conduct a systematic study of sequence learning architectures for modeling temporal dependencies in the extracted signals.
Specifically, we evaluate five models—Vanilla RNNs, GRUs, LSTMs, Seq2Seq with Attention, and Transformer Encoders—and compare their performance across multiple dimensions, including predictive accuracy, data efficiency under limited training conditions, and computational cost.
In addition, we investigate the robustness of these models under diverse real- world conditions, including variations in hardware configurations, environmental settings, and user typing behaviors. This analysis provides insight into how different architectures generalize across heterogeneous scenarios and how external factors influence learning performance.
Through comprehensive experiments, this thesis characterizes the strengths and limitations of different sequence learning approaches in this context, and provides practical guidance for selecting and designing models for learning-based analysis tasks involving sensitive data contexts.
Keywords
Optical Vibration Side Channel, Keystroke Inference, Sequence-to-sequence Learning, Attention Mechanism, Transformer Encoder, Recurrent Neural Network, LSTM, GRU, Rolling Shutter Temporal Sampling, Video Conference Security
Disciplines
Artificial Intelligence and Robotics | Cybersecurity | Graphics and Human Computer Interfaces | Information Security | Signal Processing
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
Badgujar, Sanket Suresh, "ML-Based Keystroke Recovery via Optical-Vibration Side Channels in Video Conferencing" (2026). Computer Science and Engineering Theses. 4.
https://mavmatrix.uta.edu/cse_theses2/4
Included in
Artificial Intelligence and Robotics Commons, Cybersecurity Commons, Graphics and Human Computer Interfaces Commons, Information Security Commons, Signal Processing Commons