Graduation Semester and Year
2021
Language
English
Document Type
Thesis
Degree Name
Master of Science in Computer Science
Department
Computer Science and Engineering
First Advisor
Vassilis Athitsos
Abstract
Interaction between human beings brings about improvements in science and technology. However, the interaction is limited for people who are deaf or hard-of-hearing, as they can only communicate with others who also know their sign language. With the help of recent technologies, such as Deep Learning, the gap can be bridged by converting Sentence-based Sign Language videos into English language speech. The methods discussed in this thesis are taking a step closer to solve that problem. There are four steps involved in converting ASL (American Sign Language) videos to English language speech. Step 1 is to recognize the phrases performed by the user in the videos. Step 2 is to convert the SVO (Subject-Verb-Object) phrases in the ASL glossary to English language text format. Step 3 would be to convert the English language text (graphemes) to English language phonemes. Step 4 would be to convert the English language phonemes to English language spectrogram. We developed the Video-to-Gloss module by constructing a Sentence-based ASL dataset using word-based WLASL (Word-level American Sign Language) dataset as the base dataset, where the WLASL dataset was used for generating random phrases from the videos. We used 2D (2-Dimensional) human pose-based approach for extracting keypoint information from videos, and the extracted information was fed into the Seq2Seq (Sequence-to-Sequence) architecture to convert the signs from videos into words (ASL gloss). We developed the Gloss-to-Grapheme module using the ASLG-L12 dataset, where the Attention-based Seq2Seq & Transformer architectures were used for training the models. We developed the Grapheme-to-Phoneme module using the CMUDict dataset, where the models were trained similar to the Gloss-to-Grapheme module, i.e., using the Attention-based Seq2Seq architectures were used to train the model. We developed the Phoneme-to-Spectrogram model using the LJSpeech dataset, where the Transformer architecture was used for training the model.
Keywords
Sign language recognition, English speech synthesis, ASL translation, Seq2Seq model, Attention mechanism
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Ganesh, Preetham, "CONTINUOUS AMERICAN SIGN LANGUAGE TRANSLATION WITH ENGLISH SPEECH SYNTHESIS USING ENCODER-DECODER APPROACH" (2021). Computer Science and Engineering Theses. 473.
https://mavmatrix.uta.edu/cse_theses/473
Comments
Degree granted by The University of Texas at Arlington