ORCID Identifier(s)

0009-0000-6550-7484

Graduation Semester and Year

Spring 2026

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Jia Rao

Second Advisor

Song Jiang

Third Advisor

Dajiang Zhu

Fourth Advisor

Junzhou Huang

Abstract

The rapid scaling of artificial intelligence workloads has shifted the dominant performance bottleneck of modern computing systems from compute to memory. Graph neural networks (GNNs) issue increasingly irregular memory accesses, while large language models (LLMs) issue increasingly large ones; in both cases, the relative scaling of memory bandwidth and capacity continues to lag behind the scaling of compute. Consequently, naively executing these workloads on commodity GPUs results in stalled streaming multiprocessors, exhausted high-bandwidth memory (HBM), and serving stacks that incur PCIe transfers on the critical path. This dissertation argues that the efficient scaling of attention-based AI workloads requires the joint optimization of algorithms, data layout, and system architecture, and develops three concrete systems that span the spectrum from intra-kernel data layout to cluster-scale memory disaggregation.

First, we present MEGA, a more efficient graph attention mechanism for GNN runtime on GPUs. MEGA reorganizes the input graph into a path-based, diagonal-oriented adjacency representation through a Weisfeiler–Lehman-isomorphism-preserving traversal, thereby replacing the irregular gather and scatter accesses of conventional graph attention with coalesced, diagonal accesses that map readily onto GPU tensor pipelines, and complements this layout with an adaptive diagonal attention kernel that dynamically adjusts the attention window to local graph density. MEGA achieves up to 3× training speedup while preserving model accuracy.

Second, we present BEYOND, a hybrid CPU–GPU attention mechanism that extends the operational scope of attention beyond GPU memory. BEYOND combines a locality-aware key–value (KV) cache manager with a head-granular sparse attention kernel executed on the CPU, while dense attention over recent KV blocks is executed on the GPU; the two partial outputs are subsequently fused through a lossless log-sum-exp scheme. On commodity GPU hardware, BEYOND scales LLM inference to long contexts and delivers up to 5.6× throughput improvement over FlexGen with no measurable degradation in accuracy, thereby demonstrating that the practical scope of attention need not be bounded by GPU memory alone.

Third, we present LMC-CXL, a CXL-enabled multi-host KV cache sharing layer that extends the modern LLM serving stack to operate over a CXL shared-memory pool. LMC-CXL introduces an exclusive-ownership coherence protocol with shadow-page registration, a runtime-pluggable ownership-negotiation mechanism for load balancing, and a conservative, quorum-based failure-recovery procedure, thereby supporting both peer-to-peer prefix reuse and prefill–decode disaggregation. In comparison with a tuned RDMA-based baseline employing NIXL and UCX over an InfiniBand HDR fabric, LMC-CXL removes the network interface card from the data path entirely and consistently reduces per-request transfer cost, time-to-first-token, and tail latency under production-representative multi-turn, retrieval-augmented generation (RAG), and prefix-heavy workloads.

Taken together, these three contributions trace a progression — from intra-kernel data layout (MEGA), to single-node memory hierarchy (BEYOND), to cluster-scale memory disaggregation (LMC-CXL) — and demonstrate that memory, rather than compute, constitutes the primary scaling bottleneck of modern AI systems, and that this bottleneck can be addressed through the joint design of algorithms, data layout, and system architecture.

Keywords

LLM inference, KV cache, hybrid attention, graph attention, GNN, CXL, memory disaggregation, PCIe, GPU, distributed systems

Disciplines

Computer and Systems Architecture | Computer Engineering

License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.