Graduation Semester and Year
2023
Language
English
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science and Engineering
First Advisor
Christoph Csallner
Abstract
In several safety-critical industries such as automotive, aerospace, healthcare, and industrial automation, MATLAB/Simulink has emerged as the de-facto standard tool for system modeling and analysis, model compilation into executable code, and code deployment onto embedded hardware. Within the context of cyber-physical system (CPS) development, it is imperative to both rigorously test the development tools, such as MathWorks’ Simulink, and understand modeling practices and model evolution. The existing body of work faces limitations primarily stemming from two factors: (1) contemporary testing methodologies often prove inefficient in identifying critical toolchain bugs due to a paucity of explicit toolchain specifications and (2) there exists a pronounced scarcity of a reusable and publicly available corpus of Simulink models for research. In response to these challenges, we first pioneered the use of language models for random Simulink model generation by both training and fine-tuning (large) language models such as LSTM and GPT-2 on sample Simulink models. Second, we meticulously curated the largest collection of Simulink models: SLNET, which is redistributable and contains detailed metadata. In addition, to encourage research on Simulink model evolution, we have curated EvoSL, a dataset of 900+ Simulink projects that has over 140k commits. Leveraging these datasets, we have systematically replicated previous studies, corroborating and/or refuting prior findings. As a further aid to the research community, we have developed ScoutSL, an open-source search engine for Simulink models. This tool simplifies the process of sampling Simulink projects from open-source domains, addressing the limitations of popular code hosting platforms that lack Simulink-specific filtering attributes. ScoutSL has already indexed over 100k Simulink models sourced from 18k projects.
Keywords
Cyber-physical system development, Simulink, Tool chain bugs, Deep learning, Programming language modeling, GPT-2, Mining software repositories, Open-source, Model evolution
Disciplines
Computer Sciences | Physical Sciences and Mathematics
License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.
Recommended Citation
Shrestha, Sohil Lal, "Constructing Large Open-Source Corpora and Leveraging Language Models for Simulink Toolchain Testing and Analysis" (2023). Computer Science and Engineering Dissertations. 394.
https://mavmatrix.uta.edu/cse_dissertations/394
Comments
Degree granted by The University of Texas at Arlington