Graduation Semester and Year
Spring 2026
Language
English
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science and Engineering
First Advisor
Jia Rao
Second Advisor
Song Jiang
Third Advisor
Dajiang Zhu
Fourth Advisor
Junzhou Huang
Abstract
The rapid scaling of artificial intelligence workloads has shifted the dominant performance bottleneck of modern computing systems from compute to memory. Graph neural networks (GNNs) issue increasingly irregular memory accesses, while large language models (LLMs) issue increasingly large ones; in both cases, the relative scaling of memory bandwidth and capacity continues to lag behind the scaling of compute. Consequently, naively executing these workloads on commodity GPUs results in stalled streaming multiprocessors, exhausted high-bandwidth memory (HBM), and serving stacks that incur PCIe transfers on the critical path. This dissertation argues that the efficient scaling of attention-based AI workloads requires the joint optimization of algorithms, data layout, and system architecture, and develops three concrete systems that span the spectrum from intra-kernel data layout to cluster-scale memory disaggregation.
First, we present MEGA, a more efficient graph attention mechanism for GNN runtime on GPUs. MEGA reorganizes the input graph into a path-based, diagonal-oriented adjacency representation through a Weisfeiler–Lehman-isomorphism-preserving traversal, thereby replacing the irregular gather and scatter accesses of conventional graph attention with coalesced, diagonal accesses that map readily onto GPU tensor pipelines, and complements this layout with an adaptive diagonal attention kernel that dynamically adjusts the attention window to local graph density. MEGA achieves up to 3× training speedup while preserving model accuracy.
Second, we present BEYOND, a hybrid CPU–GPU attention mechanism that extends the operational scope of attention beyond GPU memory. BEYOND combines a locality-aware key–value (KV) cache manager with a head-granular sparse attention kernel executed on the CPU, while dense attention over recent KV blocks is executed on the GPU; the two partial outputs are subsequently fused through a lossless log-sum-exp scheme. On commodity GPU hardware, BEYOND scales LLM inference to long contexts and delivers up to 5.6× throughput improvement over FlexGen with no measurable degradation in accuracy, thereby demonstrating that the practical scope of attention need not be bounded by GPU memory alone.
Third, we present LMC-CXL, a CXL-enabled multi-host KV cache sharing layer that extends the modern LLM serving stack to operate over a CXL shared-memory pool. LMC-CXL introduces an exclusive-ownership coherence protocol with shadow-page registration, a runtime-pluggable ownership-negotiation mechanism for load balancing, and a conservative, quorum-based failure-recovery procedure, thereby supporting both peer-to-peer prefix reuse and prefill–decode disaggregation. In comparison with a tuned RDMA-based baseline employing NIXL and UCX over an InfiniBand HDR fabric, LMC-CXL removes the network interface card from the data path entirely and consistently reduces per-request transfer cost, time-to-first-token, and tail latency under production-representative multi-turn, retrieval-augmented generation (RAG), and prefix-heavy workloads.
Taken together, these three contributions trace a progression — from intra-kernel data layout (MEGA), to single-node memory hierarchy (BEYOND), to cluster-scale memory disaggregation (LMC-CXL) — and demonstrate that memory, rather than compute, constitutes the primary scaling bottleneck of modern AI systems, and that this bottleneck can be addressed through the joint design of algorithms, data layout, and system architecture.
Keywords
LLM inference, KV cache, hybrid attention, graph attention, GNN, CXL, memory disaggregation, PCIe, GPU, distributed systems
Disciplines
Computer and Systems Architecture | Computer Engineering
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
deng, weishu, "Scaling LLM Inference: From Novel Attention Mechanisms to Efficient KV Cache Management" (2026). Computer Science and Engineering Dissertations. 6.
https://mavmatrix.uta.edu/cse_dissertations2/6