ORCID Identifier(s)

0009-0000-6550-7484

Graduation Semester and Year

Spring 2026

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computer Science

Department

Computer Science and Engineering

First Advisor

Jia Rao

Second Advisor

Song Jiang

Third Advisor

Dajiang Zhu

Fourth Advisor

Junzhou Huang

Abstract

The rapid scaling of artificial intelligence workloads has shifted the dominant performance bottleneck of modern computing systems from compute to memory. Graph neural networks (GNNs) issue increasingly irregular memory accesses, while large language models (LLMs) issue increasingly large ones; in both cases, the relative scaling of memory bandwidth and capacity continues to lag behind the scaling of compute. Consequently, naively executing these workloads on commodity GPUs results in stalled streaming multiprocessors, exhausted high-bandwidth memory (HBM), and serving stacks that incur PCIe transfers on the critical path. This dissertation argues that the efficient scaling of attention-based AI workloads requires the joint optimization of algorithms, data layout, and system architecture, and develops three concrete systems that span the spectrum from intra-kernel data layout to cluster-scale memory disaggregation.

First, we present MEGA, a more efficient graph attention mechanism for GNN runtime on GPUs. MEGA reorganizes the input graph into a path-based, diagonal-oriented adjacency representation through a Weisfeiler–Lehman-isomorphism-preserving traversal, thereby replacing the irregular gather and scatter accesses of conventional graph attention with coalesced, diagonal accesses that map readily onto GPU tensor pipelines, and complements this layout with an adaptive diagonal attention kernel that dynamically adjusts the attention window to local graph density. MEGA achieves up to 3× training speedup while preserving model accuracy.

Second, we present BEYOND, a hybrid CPU–GPU attention mechanism that extends the operational scope of attention beyond GPU memory. BEYOND combines a locality-aware key–value (KV) cache manager with a head-granular sparse attention kernel executed on the CPU, while dense attention over recent KV blocks is executed on the GPU; the two partial outputs are subsequently fused through a lossless log-sum-exp scheme. On commodity GPU hardware, BEYOND scales LLM inference to long contexts and delivers up to 5.6× throughput improvement over FlexGen with no measurable degradation in accuracy, thereby demonstrating that the practical scope of attention need not be bounded by GPU memory alone.

Third, we present LMC-CXL, a CXL-enabled multi-host KV cache sharing layer that extends the modern LLM serving stack to operate over a CXL shared-memory pool. LMC-CXL introduces an exclusive-ownership coherence protocol with shadow-page registration, a runtime-pluggable ownership-negotiation mechanism for load balancing, and a conservative, quorum-based failure-recovery procedure, thereby supporting both peer-to-peer prefix reuse and prefill–decode disaggregation. In comparison with a tuned RDMA-based baseline employing NIXL and UCX over an InfiniBand HDR fabric, LMC-CXL removes the network interface card from the data path entirely and consistently reduces per-request transfer cost, time-to-first-token, and tail latency under production-representative multi-turn, retrieval-augmented generation (RAG), and prefix-heavy workloads.

Taken together, these three contributions trace a progression — from intra-kernel data layout (MEGA), to single-node memory hierarchy (BEYOND), to cluster-scale memory disaggregation (LMC-CXL) — and demonstrate that memory, rather than compute, constitutes the primary scaling bottleneck of modern AI systems, and that this bottleneck can be addressed through the joint design of algorithms, data layout, and system architecture.

Keywords

LLM inference, KV cache, hybrid attention, graph attention, GNN, CXL, memory disaggregation, PCIe, GPU, distributed systems

Disciplines

Computer and Systems Architecture | Computer Engineering

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Recommended Citation

deng, weishu, "Scaling LLM Inference: From Novel Attention Mechanisms to Efficient KV Cache Management" (2026). Computer Science and Engineering Dissertations. 6.
https://mavmatrix.uta.edu/cse_dissertations2/6

Download

Included in

Computer and Systems Architecture Commons

COinS

Computer Science and Engineering Dissertations

Scaling LLM Inference: From Novel Attention Mechanisms to Efficient KV Cache Management

ORCID Identifier(s)

Graduation Semester and Year

Language

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Keywords

Disciplines

License

Recommended Citation

Included in

Search

Browse

Author & Creator Corner

Links

Computer Science and Engineering Dissertations

Scaling LLM Inference: From Novel Attention Mechanisms to Efficient KV Cache Management

Author

ORCID Identifier(s)

Graduation Semester and Year

Language

Document Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Keywords

Disciplines

License

Recommended Citation

Included in

Share

Search

Browse

Author & Creator Corner

Links