ORCID Identifier(s)

0000-0002-5655-3777

Graduation Semester and Year

2021

Language

English

Document Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Engineering

First Advisor

Vassilis Athitsos

Abstract

Interaction between human beings brings about improvements in science and technology. However, the interaction is limited for people who are deaf or hard-of-hearing, as they can only communicate with others who also know their sign language. With the help of recent technologies, such as Deep Learning, the gap can be bridged by converting Sentence-based Sign Language videos into English language speech. The methods discussed in this thesis are taking a step closer to solve that problem. There are four steps involved in converting ASL (American Sign Language) videos to English language speech. Step 1 is to recognize the phrases performed by the user in the videos. Step 2 is to convert the SVO (Subject-Verb-Object) phrases in the ASL glossary to English language text format. Step 3 would be to convert the English language text (graphemes) to English language phonemes. Step 4 would be to convert the English language phonemes to English language spectrogram. We developed the Video-to-Gloss module by constructing a Sentence-based ASL dataset using word-based WLASL (Word-level American Sign Language) dataset as the base dataset, where the WLASL dataset was used for generating random phrases from the videos. We used 2D (2-Dimensional) human pose-based approach for extracting keypoint information from videos, and the extracted information was fed into the Seq2Seq (Sequence-to-Sequence) architecture to convert the signs from videos into words (ASL gloss). We developed the Gloss-to-Grapheme module using the ASLG-L12 dataset, where the Attention-based Seq2Seq & Transformer architectures were used for training the models. We developed the Grapheme-to-Phoneme module using the CMUDict dataset, where the models were trained similar to the Gloss-to-Grapheme module, i.e., using the Attention-based Seq2Seq architectures were used to train the model. We developed the Phoneme-to-Spectrogram model using the LJSpeech dataset, where the Transformer architecture was used for training the model.

Keywords

Sign language recognition, English speech synthesis, ASL translation, Seq2Seq model, Attention mechanism

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS