Computer Science and Engineering Theses

USE OF WORD EMBEDDING TO GENERATE SIMILAR WORDS AND MISSPELLINGS FOR TRAINING PURPOSE IN CHATBOT DEVELOPMENT

Sanjay Thapa

ORCID Identifier(s)

0000-0002-2031-1186

Graduation Semester and Year

2019

Language

English

Document Type

Thesis

Degree Name

Master of Science in Computer Science

Department

Computer Science and Engineering

First Advisor

Deokgun Park

Abstract

The advancement in the field of Natural Language Processing and Machine Learning has played a significant role in the huge improvement of conversational Artificial Intelligence (AI). The use of text-based conversation AI such as chatbots have increased significantly for the everyday purpose to communicate with real people for a variety of tasks. Chatbots are deployed in almost all popular messaging platforms and channels. The rise of chatbot development frameworks based on machine learning is helping to deploy chatbot easily and promptly. These chatbot development frameworks use machine learning and natural language understanding (NLU) to understand users' messages and intents and respond accordingly to users' utterance. Since most of the chatbots are developed for domain-specific purposes, the performance of the chatbot is directly related to the training data. To increase the domain knowledge and knowledge base of the chatbots via training data, the chatbots need to know similar words or phrases for a users' message. Furthermore, it is not guaranteed that a user will spell a word correctly. A lot of times, in written conversation, a user will misspell at least some words. Thus, to include semantically similar words and misspellings in the training data, I have used word embedding to generate misspellings and similar words. These generated similar words and misspellings will be used as training data to train the model for chatbot development.

Keywords

Chatbots, Conversational artificial intelligence, Machine learning, Rasa, Misspellings, Word embedding, Similar words

Disciplines

Computer Sciences | Physical Sciences and Mathematics

License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Comments

Degree granted by The University of Texas at Arlington

Recommended Citation

Thapa, Sanjay, "USE OF WORD EMBEDDING TO GENERATE SIMILAR WORDS AND MISSPELLINGS FOR TRAINING PURPOSE IN CHATBOT DEVELOPMENT" (2019). Computer Science and Engineering Theses. 168.
https://mavmatrix.uta.edu/cse_theses/168

Download

Included in

Computer Sciences Commons

COinS