Graduation Semester and Year
Spring 2026
Language
English
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science and Engineering
First Advisor
Leonidas Fegaras
Second Advisor
Chengkai Li
Third Advisor
Upendranath Chakravarthy
Fourth Advisor
David Levine
Abstract
Current machine learning systems, such as TensorFlow and PyTorch, rely on high-performance linear algebra libraries for efficient tensor computations. Although they provide numerous fine-tuned array algorithms based on well-studied data placement and communication patterns, these libraries are hard to customize to capture irregular array programs and unconventional array storages. This dissertation introduces TensorPlanner, a framework for constructing distributed task workflows from arbitrary tensor programs by partially evaluating these programs against the block coordinates of the tensors. In addition, it presents a novel task scheduler based on pattern-matching that assigns processes to tasks by recognizing certain patterns inside the task workflow. Although each such pattern applies to a small fixed number of tasks, when applied collectively, these patterns generate communication schemes that resemble optimal block-based algorithms, such as SUMMA. The scheduling of the task workflow is based on pattern-matching and is done bottom-up, guided by cost. We have implemented a high-performance, fault-tolerant execution engine using MPI and OpenMP that eagerly deletes blocks without compromising recovery, and have evaluated the performance of TensorPlanner relative to Ray, TensorFlow, and PyTorch. Building upon this foundation, this dissertation proposes STAG (Scalable Tensor Acceleration on GPUs), a general framework for accelerating distributed tensor programs on GPUs. Widespread availability of GPUs has made parallelizing Scientific and Machine Learning computations popular in both industry and academia. However, distributing computation across multiple nodes, each equipped with multiple GPUs, is non-trivial. The benefits of using GPUs can be nullified when intermediate data needs to be frequently copied between host and GPU memory, which slows down the whole computation. Moreover, optimizing linear algebra operations on GPUs requires significant expertise on GPU programming. STAG extends the TensorPlanner scheduler by adding GPU-aware cost-based scheduling. Its code generator generates GPU code by annotating loops using OpenACC pragmas, adding loop-reordering and tiling for better GPU memory access patterns. STAG's novel evaluation engine keeps tensor blocks in GPU memory and minimizes data copying between host memory and GPU memory to improve the performance of distributed computation. In performance evaluations, STAG demonstrates substantial gains over CPU-only systems and performs competitively with leading frameworks with GPU support, all while providing a more extensible programming model for large-scale tensor computations. Finally, this dissertation presents STAG-MLIR, which automatically transforms high-level loop-based tensor programs into efficient distributed GPU kernels through the use of a novel GPU kernel generation pipeline powered by the Multi-Level Intermediate Representation (MLIR) ecosystem without requiring manual kernel engineering. The compute-intensive portions of each task in a task workflow are translated into high-level MLIR code, and progressively lowered through our custom optimization pipeline that incorporates loop tiling, shared memory data copy, and kernel fusion to generate efficient GPU kernels that can be extended for any GPU vendor. Experimental evaluations demonstrate that STAG-MLIR achieves a 2 − 6× performance improvement over state-of-the-art frameworks, such as Ray and Dask, for dense and sparse linear algebra workloads, and 3 − 4× improvement over PyTorch and 4 − 6× improvement over Ray Train for machine learning workloads, while remaining fully generalizable to arbitrary user-defined tensor computations and storage formats, validating the effectiveness of a compiler-driven, asynchronous approach for modern distributed workloads.
Keywords
Big data, Distributed systems, Code generation, Tensor programs, Machine learning, Linear algebra, GPU programming, MPI, OpenACC, MLIR
Disciplines
Computer Sciences | Databases and Information Systems
License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Recommended Citation
Khan, Tanvir Ahmed, "A PLANNER FOR ACCELERATION OF LARGE-SCALE LOOP-BASED ARRAY PROGRAMS ON HETEROGENEOUS DISTRIBUTED SYSTEMS" (2026). Computer Science and Engineering Dissertations-Archive. 432.
https://mavmatrix.uta.edu/cse_dissertations/432