Computer Science and Engineering Dissertations - Archive

A PLANNER FOR ACCELERATION OF LARGE-SCALE LOOP-BASED ARRAY PROGRAMS ON HETEROGENEOUS DISTRIBUTED SYSTEMS

Tanvir Ahmed Khan, University of Texas at ArlingtonFollow

ORCID Identifier(s)

ORCID 0009-0001-2079-4085

Graduation Semester and Year

Spring 2026

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Leonidas Fegaras

Second Advisor

Chengkai Li

Third Advisor

Upendranath Chakravarthy

Fourth Advisor

David Levine

Abstract

Current machine learning systems, such as TensorFlow and PyTorch, rely on high-performance linear algebra libraries for efficient tensor computations. Although they provide numerous fine-tuned array algorithms based on well-studied data placement and communication patterns, these libraries are hard to customize to capture irregular array programs and unconventional array storages. This dissertation introduces TensorPlanner, a framework for constructing distributed task workflows from arbitrary tensor programs by partially evaluating these programs against the block coordinates of the tensors. In addition, it presents a novel task scheduler based on pattern-matching that assigns processes to tasks by recognizing certain patterns inside the task workflow. Although each such pattern applies to a small fixed number of tasks, when applied collectively, these patterns generate communication schemes that resemble optimal block-based algorithms, such as SUMMA. The scheduling of the task workflow is based on pattern-matching and is done bottom-up, guided by cost. We have implemented a high-performance, fault-tolerant execution engine using MPI and OpenMP that eagerly deletes blocks without compromising recovery, and have evaluated the performance of TensorPlanner relative to Ray, TensorFlow, and PyTorch. Building upon this foundation, this dissertation proposes STAG (Scalable Tensor Acceleration on GPUs), a general framework for accelerating distributed tensor programs on GPUs. Widespread availability of GPUs has made parallelizing Scientific and Machine Learning computations popular in both industry and academia. However, distributing computation across multiple nodes, each equipped with multiple GPUs, is non-trivial. The benefits of using GPUs can be nullified when intermediate data needs to be frequently copied between host and GPU memory, which slows down the whole computation. Moreover, optimizing linear algebra operations on GPUs requires significant expertise on GPU programming. STAG extends the TensorPlanner scheduler by adding GPU-aware cost-based scheduling. Its code generator generates GPU code by annotating loops using OpenACC pragmas, adding loop-reordering and tiling for better GPU memory access patterns. STAG's novel evaluation engine keeps tensor blocks in GPU memory and minimizes data copying between host memory and GPU memory to improve the performance of distributed computation. In performance evaluations, STAG demonstrates substantial gains over CPU-only systems and performs competitively with leading frameworks with GPU support, all while providing a more extensible programming model for large-scale tensor computations. Finally, this dissertation presents STAG-MLIR, which automatically transforms high-level loop-based tensor programs into efficient distributed GPU kernels through the use of a novel GPU kernel generation pipeline powered by the Multi-Level Intermediate Representation (MLIR) ecosystem without requiring manual kernel engineering. The compute-intensive portions of each task in a task workflow are translated into high-level MLIR code, and progressively lowered through our custom optimization pipeline that incorporates loop tiling, shared memory data copy, and kernel fusion to generate efficient GPU kernels that can be extended for any GPU vendor. Experimental evaluations demonstrate that STAG-MLIR achieves a 2 − 6× performance improvement over state-of-the-art frameworks, such as Ray and Dask, for dense and sparse linear algebra workloads, and 3 − 4× improvement over PyTorch and 4 − 6× improvement over Ray Train for machine learning workloads, while remaining fully generalizable to arbitrary user-defined tensor computations and storage formats, validating the effectiveness of a compiler-driven, asynchronous approach for modern distributed workloads.

Keywords

Big data, Distributed systems, Code generation, Tensor programs, Machine learning, Linear algebra, GPU programming, MPI, OpenACC, MLIR

Disciplines

Computer Sciences | Databases and Information Systems

License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Khan, Tanvir Ahmed, "A PLANNER FOR ACCELERATION OF LARGE-SCALE LOOP-BASED ARRAY PROGRAMS ON HETEROGENEOUS DISTRIBUTED SYSTEMS" (2026). Computer Science and Engineering Dissertations - Archive. 432.
https://mavmatrix.uta.edu/cse_dissertations/432

Computer Science and Engineering Dissertations - Archive

A PLANNER FOR ACCELERATION OF LARGE-SCALE LOOP-BASED ARRAY PROGRAMS ON HETEROGENEOUS DISTRIBUTED SYSTEMS

ORCID Identifier(s)

Graduation Semester and Year

Language

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Keywords

Disciplines

License

Recommended Citation

Included in

Search

Browse

Author & Creator Corner

Computer Science and Engineering Dissertations - Archive

A PLANNER FOR ACCELERATION OF LARGE-SCALE LOOP-BASED ARRAY PROGRAMS ON HETEROGENEOUS DISTRIBUTED SYSTEMS

Author

ORCID Identifier(s)

Graduation Semester and Year

Language

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Keywords

Disciplines

License

Recommended Citation

Included in

Share

Search

Browse

Author & Creator Corner