Graduation Semester and Year

2022

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Ramez Elmasri

Abstract

Database code fragments exist in software systems by using SQL as the stan- dard language for relational databases. Traditionally, developers bind databases as backends to software systems for supporting user applications. However, these bind- ings are low-level code and implemented to persist user data, so Object Relational Mapping (ORM) frameworks take place to abstract database access details. These approaches are prone to problematic database code fragments that negatively im- pact the quality of software systems. In the first part of the dissertation, we survey problematic database code fragments in the literature and examine antipatterns that occur in low-level database access code using SQL and high-level counterparts in ORM frameworks. We also study problematic database code fragments in different popular software architectures such as Service Oriented Architecture (SOA), Microservice Ar- chitecture (MA), and Model View Controller (MVC). We create a novel categorization of both SQL schema and query antipatterns in terms of performance, maintainability, portability, and data integrity. In the second part of this dissertation, we create NLP patterns that support data architects when modeling and naming data element definitions. We design and develop rule-based natural language processing (NLP) techniques to automatically extract standardized data element names from data element definitions written in American English. The goal is to study how using NLP techniques can improve the accuracy of extracting standardized data element names in a domain-independent context. It is a challenge to come up with NLP patterns in natural language definitions as opposed to unambiguous code. To achieve automated data element naming, we first identify heuristic patterns that mine noun phrases and relationships from data element definitions. Then, we use these noun phrases and relationships as input to determine components of data element names. The output of the patterns is reviewed by a domain expert. We apply our method to extract the five standard components of a data element name in the Railway and Transportation domains. We first achieved 80% accuracy, then by improving the rules and adding a similarity function using knowledge graphs, we improved the accuracy to 95% in our final experiments. We also introduce our tool entitled as Data Element Naming Automation (DENA) tool. The tool consists of four components: DENA NLP, DENA assem- bly, preprocessing, and duplicate checker. In the last part of the dissertation, we propose how we preprocess data element definitions and evaluate the deduplication detection.

Keywords

Data management, Artificial intelligence

Disciplines

Computer Sciences | Physical Sciences and Mathematics

Comments

Degree granted by The University of Texas at Arlington

Share

COinS