Shrinking the Memory Footprint of Large Language Models

Author: Denis Avetisyan

A new quantization technique significantly reduces the memory demands of key-value caches, enabling more efficient deployment of large language models.

A system-aware INT4 KV-cache quantization framework, coupled with a rotate-quantize pipeline, facilitates modern large language model serving by strategically managing precision and data flow-acknowledging that all systems inevitably degrade, and intelligent design focuses on graceful decay rather than resisting it.

System-aware 4-bit quantization with block-diagonal Hadamard transforms delivers minimal performance loss for LLM serving.

Efficient large language model (LLM) serving is increasingly bottlenecked by KV-cache memory, yet existing compression techniques often clash with practical system constraints. This work, ‘SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving’, identifies that a simple token-wise INT4 quantization combined with block-diagonal Hadamard rotation consistently delivers near-lossless accuracy under realistic serving conditions. Our system-aware approach achieves this without sacrificing throughput, demonstrating that lightweight transformations can be remarkably effective when integrated directly into paged memory layouts. Could co-designing compression algorithms with system-level considerations unlock further substantial gains in LLM serving efficiency and scalability?

The Inevitable Constraint: Key-Value Cache Bottlenecks in Large Language Models

Large Language Models (LLMs) continue to impress with their capacity for complex tasks, from generating creative text formats to translating languages and answering questions informatively. However, this increasing sophistication comes at a cost, specifically related to the model’s memory usage during inference. A critical component, the Key-Value Cache, stores the attention weights for previously processed tokens, enabling efficient context recall. As models grow in scale-boasting billions of parameters-and handle increasingly lengthy sequences, the memory footprint of this cache expands dramatically. This presents a significant challenge, as the Key-Value Cache can quickly become a bottleneck, limiting throughput and hindering the model’s ability to process information in a timely manner, even with substantial computational resources available.

The escalating demands placed on Large Language Models (LLMs), driven by both increasing model size and the need to process longer sequences of text, are increasingly bottlenecked by the Key-Value Cache. This cache, essential for efficient attention mechanisms, stores the keys and values associated with prior tokens in a sequence, enabling the model to recall relevant information without recomputation. However, its memory footprint scales quadratically with sequence length – meaning doubling the sequence length quadruples the cache size. Consequently, accessing and managing this ever-growing cache becomes a significant throughput limitation, slowing down inference and hindering the model’s ability to process information in a timely manner. This performance degradation isn’t simply a matter of slower processing; it directly impacts the usability of LLMs in real-time applications and limits their potential for handling complex, extended dialogues or documents.

Existing strategies for managing the Key-Value Cache bottleneck in Large Language Models often present difficult trade-offs. Methods like aggressive pruning or quantization, intended to reduce memory usage, frequently lead to a demonstrable decline in model accuracy and the nuanced understanding of complex prompts. Conversely, techniques that prioritize maintaining full precision-such as sophisticated caching algorithms or distributing the cache across multiple devices-introduce significant computational overhead, demanding increased processing power and communication bandwidth. This overhead can negate performance gains, especially when scaling to longer sequences or larger models, effectively shifting the bottleneck rather than resolving it. The challenge lies in achieving substantial memory reduction without compromising the model’s ability to generate coherent and contextually relevant outputs, a balance that remains elusive with current approaches.

Qwen3-8B achieves superior per-GPU throughput compared to more complex methods like Kitty, and while non-<span class="katex-eq" data-katex-display="false"> ext{SGLang BF16}</span> methods are underestimated by Hugging Face’s <span class="katex-eq" data-katex-display="false"> ext{model.generate}</span> due to lacking continuous batching and PagedAttention, they still demonstrate lower performance. — Qwen3-8B achieves superior per-GPU throughput compared to more complex methods like Kitty, and while non- $ext{SGLang BF16}$ methods are underestimated by Hugging Face’s $ext{model.generate}$ due to lacking continuous batching and PagedAttention, they still demonstrate lower performance.

Mitigating Entropy: Compression Techniques for the Key-Value Cache

Key-Value (KV) cache compression techniques address the memory demands of large language models by reducing the precision and dimensionality of stored key and value tensors. Scalar Quantization lowers the precision of individual weights, typically from float16 or bfloat16 to int8 or even lower, decreasing memory usage with a potential trade-off in accuracy. Vector Quantization groups multiple weights into vectors and represents them with a smaller number of centroids, effectively reducing the number of unique values stored. Low-Rank Decomposition, such as Singular Value Decomposition (SVD), approximates the key and value matrices with lower-rank representations, reducing the overall number of parameters required to represent the same information; this is based on the principle that many matrices have redundant or near-redundant dimensions.

PagedAttention and Continuous Batching are memory access optimization techniques designed to improve Key-Value (KV) cache utilization. PagedAttention divides the KV cache into fixed-size pages, enabling non-contiguous memory access and reducing memory fragmentation, which is especially beneficial for variable-length sequences. Continuous Batching processes requests in a streaming fashion, minimizing the need to load and unload KV cache entries for each individual token. This reduces redundant memory operations and increases throughput by maintaining a consistent flow of data between the KV cache and processing units. Both techniques address inefficiencies arising from typical attention mechanisms’ scattered memory access patterns and contribute to lower latency and increased scalability.

Grouped-Query Attention (GQA) and Multi-Head Latent Attention (MHLA) are architectural modifications designed to decrease the memory requirements of the Key-Value (KV) cache during inference. Traditional multi-head attention replicates the key and value projections for each attention head, leading to a substantial KV cache size. GQA reduces this by sharing the key and value projections across a group of heads, decreasing the per-token KV footprint. MHLA further optimizes this by projecting queries into a lower-dimensional latent space before performing attention, effectively reducing the dimensionality of the key and value projections and thus the KV cache size. Both techniques maintain or improve throughput by reducing memory bandwidth requirements and enabling larger batch sizes, despite potentially introducing a slight increase in computational cost.

Effective bandwidth on H100 GPUs reaches up to <span class="katex-eq" data-katex-display="false">400</span> GB/s for the fused block rotate-quantize-save KV-cache kernel with a head dimension of 128 and 8 heads, and is optimized by varying the Hadamard block order between 16, 32, 64, and 128. — Effective bandwidth on H100 GPUs reaches up to $400$ GB/s for the fused block rotate-quantize-save KV-cache kernel with a head dimension of 128 and 8 heads, and is optimized by varying the Hadamard block order between 16, 32, 64, and 128.

Restoring Equilibrium: Advanced Quantization with Block-Diagonal Rotation

INT4 quantization reduces model size and memory bandwidth requirements by representing weights with 4-bit integers, offering substantial compression. However, this reduced precision increases sensitivity to outlier values within the weight distribution. These outliers, representing unusually large or small weights, can significantly degrade model performance during both training and inference. The limited dynamic range of 4-bit representation exacerbates this issue, as the quantization process struggles to accurately represent both common and extreme values, leading to increased quantization error and a reduction in overall accuracy. Consequently, while INT4 offers significant compression gains, its susceptibility to outliers necessitates techniques to mitigate their impact and preserve model fidelity.

Block-Diagonal Rotation is a pre-quantization processing step designed to reduce the sensitivity of INT4 quantization to outlier values. This technique applies a Hadamard Transform to blocks of weights, effectively rotating the data into a new coordinate system. By distributing the energy of outlier values across multiple dimensions within these rotated blocks, the maximum values are reduced, minimizing their impact on the quantization process. This transformation does not alter the overall magnitude of the weights, ensuring that the model’s fundamental behavior remains consistent, but it does improve the uniformity of the weight distribution prior to quantization, resulting in higher accuracy compared to directly quantizing the original weights.

Combining Block-Diagonal Rotation with outlier removal techniques demonstrably improves the robustness of INT4 quantization. This approach addresses performance degradation typically observed when reducing precision from BF16. Specifically, the application of Block-Diagonal Rotation prior to outlier handling minimizes the impact of extreme values, resulting in a near-complete recovery of accuracy lost during quantization. Benchmarks indicate that this combined method achieves performance levels closely approximating those of the original BF16 representation, significantly reducing the trade-off between model size and accuracy.

Across multiple model sizes (<span class="katex-eq" data-katex-display="false">\mathrm{TPS}\overline{\\mathrm{TPS}}\\_{\\mathrm{req}}</span> versus per-GPU throughput), INT4 quantization with RoPE rotation (R128) consistently matches or surpasses standard INT4 and outperforms BF16, demonstrating improved model quality without sacrificing efficiency. — Across multiple model sizes ( $\mathrm{TPS}\overline{\\mathrm{TPS}}\\_{\\mathrm{req}}$ versus per-GPU throughput), INT4 quantization with RoPE rotation (R128) consistently matches or surpasses standard INT4 and outperforms BF16, demonstrating improved model quality without sacrificing efficiency.

The Manifest Impact: Empirical Results and Performance Gains

Rigorous experimentation reveals that strategically applied compression techniques dramatically reduce the memory demands of large language models. Specifically, tests conducted on models including Qwen3-4B, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 consistently demonstrate a fourfold decrease in KV Cache memory footprint. This substantial reduction is achieved without significant performance degradation, allowing for the deployment of these powerful models in resource-constrained environments and enabling the processing of longer sequences. The observed compression efficiently manages the key and value caches, which are critical for maintaining context during text generation, thereby making advanced language capabilities more accessible and scalable.

Significant gains in system throughput and responsiveness are realized through these optimizations, directly translating to an improved user experience. Experiments demonstrate an impressive increase of up to +41.4% in system throughput when applied to the Qwen3-8B model under conditions of long context and high concurrency, compared to traditional BF16 processing. Furthermore, the reduced computational load contributes to a faster time-to-first-token, meaning users receive initial responses more quickly. This combination of increased speed and efficiency allows for more fluid and interactive conversations with large language models, even under demanding workloads, effectively lowering latency and boosting overall usability.

The research demonstrates a pathway toward democratizing large language model (LLM) technology through substantial reductions in computational expense. By maintaining comparable performance levels to existing models, these innovative methods significantly lower the barriers to entry for both researchers and developers. This efficiency is achieved without sacrificing accuracy or responsiveness, enabling broader access to powerful AI capabilities on more modest hardware. The resulting cost savings not only facilitate wider adoption but also pave the way for the deployment of LLMs in resource-constrained environments, extending their potential impact across diverse applications and user bases.

Across diverse workloads and model sizes (<span class="katex-eq" data-katex-display="false">\mathrm{TPS}\overline{\\mathrm{TPS}}\\_{\\mathrm{req}}</span> vs. per-GPU throughput), INT4 quantization with R128 rotation consistently matches or surpasses standard INT4 and outperforms BF16, demonstrating preserved efficiency and improved model quality. — Across diverse workloads and model sizes ( $\mathrm{TPS}\overline{\\mathrm{TPS}}\\_{\\mathrm{req}}$ vs. per-GPU throughput), INT4 quantization with R128 rotation consistently matches or surpasses standard INT4 and outperforms BF16, demonstrating preserved efficiency and improved model quality.

Toward Sustainable Scale: Future Directions in Memory Efficiency

The pursuit of increasingly large language models (LLMs) is fundamentally constrained by memory limitations, yet ongoing research into advanced quantization techniques offers a promising path forward. These methods, such as Hessian-Aware Quantization, strategically reduce the precision with which model weights and activations are stored – effectively compressing the model’s size without substantial performance degradation. By leveraging the Hessian matrix – which describes the curvature of the loss function – these techniques identify and preserve the most critical parameters, minimizing the impact of reduced precision. This allows for significant memory savings, enabling the deployment of larger, more capable LLMs on existing hardware and broadening accessibility to advanced AI capabilities. Further refinement of these quantization strategies is poised to unlock even greater compression ratios, paving the way for the next generation of scalable and efficient language models.

Scaling large language models to handle increasingly complex tasks demands a fundamental shift in architectural design, specifically addressing the substantial memory demands of the Key-Value Cache. This cache, essential for attention mechanisms, grows quadratically with sequence length, quickly becoming a bottleneck. Researchers are actively investigating designs that move beyond simply storing all key-value pairs; instead, they are exploring methods for adaptive or sparse caching, where only the most relevant information is retained. Innovative approaches include hierarchical attention mechanisms that reduce cache size by processing information in stages, and techniques that dynamically prune less important key-value pairs without sacrificing performance. Successfully minimizing the Key-Value Cache footprint isn’t merely about optimization; it’s about enabling a future where LLMs can process significantly longer sequences and handle far more intricate relationships within data, unlocking previously unattainable levels of artificial intelligence.

The synergistic combination of memory-efficient techniques and distributed training strategies represents a pivotal step towards realizing the full capabilities of large language models. While innovations like quantization reduce the individual model’s memory footprint, distributed training allows researchers to parallelize computation across numerous devices, effectively scaling model size beyond the limitations of a single machine. This coordinated approach not only facilitates the training of significantly larger and more complex models, but also accelerates the training process itself. Consequently, LLMs can be applied to increasingly intricate real-world challenges – from drug discovery and materials science to advanced climate modeling and personalized education – problems previously intractable due to computational constraints. The convergence of these advancements promises to unlock a new era of AI-driven solutions, transforming fields reliant on complex data analysis and predictive modeling.

On an H100 GPU running Qwen3-8B, higher <span class="katex-eq" data-katex-display="false">\overline{\mathrm{TPS}}_{\mathrm{req}}</span> is achieved with BF16 at high concurrency, but this comes at the cost of increased <span class="katex-eq" data-katex-display="false">\overline{\mathrm{TTFT}}_{\mathrm{req}}</span> due to larger KV cache requirements, while INT4 and INT4+BDR configurations offer a more balanced trade-off between throughput and latency. — On an H100 GPU running Qwen3-8B, higher $\overline{\mathrm{TPS}}_{\mathrm{req}}$ is achieved with BF16 at high concurrency, but this comes at the cost of increased $\overline{\mathrm{TTFT}}_{\mathrm{req}}$ due to larger KV cache requirements, while INT4 and INT4+BDR configurations offer a more balanced trade-off between throughput and latency.

The pursuit of efficient large language model serving, as demonstrated in this work concerning KV cache quantization, inherently acknowledges the transient nature of technological solutions. Every compression technique, every optimization-like the proposed block-diagonal Hadamard rotation-is a temporary reprieve against the relentless march of increasing model sizes and computational demands. As Robert Tarjan observed, “Every abstraction carries the weight of the past,” and this is acutely felt in system design. The presented method isn’t a final answer, but a considered adaptation, aiming to preserve resilience through mindful compression while acknowledging that future innovations will inevitably reshape the landscape of LLM deployment. The goal isn’t immortality, but graceful aging.

The Long View

The demonstrated efficacy of block-diagonal Hadamard rotation preceding INT4 quantization offers more than merely a compression technique; it provides a lens through which to examine the inherent fragility of state within large language models. Every reduction in precision is, fundamentally, an acceptance of entropy. The question isn’t whether information is lost, but how gracefully the system degrades under controlled approximation. This work highlights that system-awareness – understanding the interplay between algorithmic compression and hardware realities – is paramount. The pursuit of ever-smaller models, without acknowledging the inevitable accumulation of these ‘graceful degradations,’ is an exercise in building structures without foundations.

Future investigations should not focus solely on achieving higher compression ratios. Instead, attention must turn to characterizing the nature of the information preserved, and the specific failure modes introduced by these approximations. A deeper exploration of the block-diagonal structure itself-its limitations, and potential for adaptation to other quantization schemes-is critical. The field currently treats quantization as a purely numerical problem; it is, more accurately, a problem of temporal resilience.

Ultimately, the true measure of success will not be in benchmarks achieved today, but in the longevity of these systems. Architecture without history-without a clear understanding of how past approximations shape present performance-is fragile and ephemeral. The long view demands a shift from optimizing for immediate gain, to engineering for sustained, predictable decay.

Original article: https://arxiv.org/pdf/2604.19157.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/