Xiaofeng Wu

Graduation Semester and Year




Document Type


Degree Name

Doctor of Philosophy in Computer Science


Computer Science and Engineering

First Advisor

Hong Jiang


This thesis addresses the challenges of utilization, efficiency, and scalability faced by deep learning systems, which are essential for high-performance training and serving of deep learning models. Deep learning systems play a critical role in developing accurate and complex models for various applications, including image recognition, natural language understanding, and speech recognition. This research focuses on understanding and developing deep learning systems that encompass data preprocessing, resource management, multi-tenancy, and distributed model training. The thesis proposes several solutions to improve the performance, scalability, and efficiency of deep learning applications. Firstly, we introduce SwitchFlow, a scheduling framework that addresses the limitations of popular deep learning frameworks in supporting GPU sharing and multi-tasking. Secondly, we propose Atom, a distributed training framework for large language models that utilizes decentralized training to reduce communication costs and increase scalability. We discuss the challenges of decentralized training and present the design and implementation of Atom. Lastly, we introduce PerFect, a method that pre-trains the model using repetitive data to improve data processing efficiency and fine-tunes it to achieve the desired accuracy. Our approach provides a significant improvement in the performance, scalability, and efficiency of deep learning applications. Specifically, SwitchFlow reduces interference and eliminates out-of-memory errors by scheduling subgraphs instead of computation graphs as a whole. Additionally, it allows subgraphs running on different devices to overlap with each other, leading to a more efficient execution pipeline. Atom achieves high training throughput and fault-tolerance in a decentralized environment, enabling the training of massive-scale models using affordable hardware such as consumer-class GPUs and Ethernet. Finally, PerFect improves the throughput performance of the data preprocessing stage and achieves the desired accuracy when reusing cached data, without the need for additional hardware or third-party libraries. The proposed frameworks and solutions are evaluated using representative DL models, and the results demonstrate their effectiveness and scalability. Overall, this thesis contributes to the development of deep learning systems and provides practical solutions to the challenges of utilization, efficiency, and scalability, making deep learning applications more accessible and efficient for a wider range of users.


Optimization, Resource utilization, Efficiency, Scalability, Deep learning systems


Computer Sciences | Physical Sciences and Mathematics


Degree granted by The University of Texas at Arlington