Squeezing Memory: A New Approach to Faster AI Inference

Author: Denis Avetisyan

Researchers have developed a novel compression technique that dramatically reduces the memory footprint of key-value caches, accelerating AI model performance.

The system encodes cached vectors into a fixed-rate bitstream by partitioning unit directions into blocks, quantizing each against a shared codebook, and reconstructing the original vector through a reversed lookup and rotation-a process distinguished by its random-access capability and affine offsets, enabling independent recovery of past key/value pairs and representing an advancement over variable-length codecs by replacing per-coordinate scalar tables with <span class="katex-eq" data-katex-display="false">kk</span>-dimensional vector quantization. — The system encodes cached vectors into a fixed-rate bitstream by partitioning unit directions into blocks, quantizing each against a shared codebook, and reconstructing the original vector through a reversed lookup and rotation-a process distinguished by its random-access capability and affine offsets, enabling independent recovery of past key/value pairs and representing an advancement over variable-length codecs by replacing per-coordinate scalar tables with $kk$ -dimensional vector quantization.

FibQuant leverages spherical-beta distribution modeling and vector quantization to achieve high compression rates for random-access KV-caches with minimal attention fidelity loss.

Increasing context lengths in large language models are rapidly becoming bottlenecked by KV-cache memory traffic, despite advances in scalar quantization techniques. This paper introduces \textsc{FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression}, a novel approach that models the underlying data distribution as a spherical-beta and leverages vector quantization to achieve substantial compression. By moving beyond scalar quantization, \textsc{FibQuant} demonstrably improves compression rates while maintaining attention fidelity-achieving up to 34× compression on GPT-2 with minimal loss in accuracy. Could this technique unlock new levels of efficiency for long-context inference and enable the deployment of even larger language models?

The Inevitable Compression: Addressing Memory Bottlenecks in Language Models

The escalating memory demands of large language models present a significant barrier to their wider adoption and continued development. As these models grow in parameter count and sequence length – driven by the pursuit of enhanced performance and contextual understanding – their memory footprint expands dramatically. This isn’t merely a matter of needing larger servers; the sheer volume of data required to store and process information during inference-particularly the activation states within the neural network-quickly strains available resources. Consequently, deploying these powerful models becomes increasingly expensive and challenging, limiting access for researchers and developers with constrained budgets or hardware. Ultimately, the memory bottleneck restricts the scalability of LLMs, hindering the creation of even more sophisticated and capable artificial intelligence systems and impacting their integration into real-world applications.

The computational engine driving many large language models relies heavily on the key-value (KV) cache, a mechanism that stores past computations to accelerate processing of sequential data. However, this cache exhibits a critical limitation: its size grows linearly with the input sequence length. As models tackle increasingly complex tasks demanding longer contexts, the KV cache rapidly consumes substantial memory and bandwidth. This expansion creates a significant performance bottleneck, slowing down inference and limiting the practical scalability of these models. Effectively, the ability to process longer sequences is constrained not by the model’s inherent capacity, but by the physical limitations of memory and the speed at which data can be accessed – a challenge that necessitates innovative memory management and compression strategies.

Conventional compression methods, while aiming to reduce the memory footprint of large language models, frequently encounter a trade-off between precision and speed. Attempts to aggressively compress the key-value cache – essential for the attention mechanism – can lead to a noticeable degradation in the quality of generated text, diminishing the model’s overall performance. Conversely, less aggressive compression strategies, prioritizing accuracy, often fail to sufficiently reduce memory demands, resulting in unacceptable latency for real-time applications like conversational AI or live translation. This presents a significant challenge as developers strive to deploy increasingly complex models on resource-constrained hardware, demanding innovative solutions that can balance compression ratios, computational efficiency, and, crucially, maintain the integrity of the language model’s output.

On the GPT-2 small KV cache, the Pareto front demonstrates that above <span class="katex-eq" data-katex-display="false">5\times</span> compression, FibQuant, TurboQuant, and KIVIsit achieve comparable performance, while beyond <span class="katex-eq" data-katex-display="false">10\times</span> compression, performance diverges between universal methods like FibQuant and low-rank SVD approaches. — On the GPT-2 small KV cache, the Pareto front demonstrates that above $5\times$ compression, FibQuant, TurboQuant, and KIVIsit achieve comparable performance, while beyond $10\times$ compression, performance diverges between universal methods like FibQuant and low-rank SVD approaches.

FibQuant: Sculpting Data with Spherical Precision

FibQuant employs spherical vector quantization (SVQ) to achieve compression of Key-Value (KV) cache entries by representing them with a reduced set of spherical codes. This technique is predicated on the observation that KV cache vectors often exhibit a distribution concentrated around the origin, allowing for effective dimensionality reduction without significant information loss. Specifically, SVQ maps high-dimensional vectors to lower-dimensional representations by projecting them onto a sphere and then quantizing the resulting spherical coordinates. The selection of codebook vectors is optimized to minimize the reconstruction error, effectively compressing the KV cache while preserving the critical information needed for inference. This approach contrasts with traditional quantization methods by exploiting the angular distribution of the data, leading to improved compression ratios and reduced memory footprint.

FibQuant utilizes a Spherical-Beta distribution to characterize the statistical properties of the KV cache vectors, enabling dimensionality reduction. This distribution accurately models the angular distribution of the vectors, allowing for the projection of high-dimensional vectors onto a lower-dimensional subspace while minimizing information loss. Specifically, the parameters of the Spherical-Beta law – α and β – are determined through analysis of the KV cache data, and these values define the concentration and shape of the distribution. By representing vectors based on their probability density within this distribution, FibQuant effectively compresses the cache entries, reducing memory footprint and computational cost associated with accessing and processing the KV cache during inference.

Fixed-rate encoding within FibQuant assigns a uniform number of bits to represent each quantized vector, irrespective of its magnitude or distribution. This approach guarantees predictable memory access patterns during inference, as the system consistently reads a fixed number of bytes for each cache entry. Variable-length encoding, while potentially offering higher compression ratios, introduces computational overhead due to the need for decoding variable-length codes, which can significantly increase latency. By prioritizing consistent access times, fixed-rate encoding minimizes these delays, crucial for real-time applications and maintaining high throughput during model inference. The elimination of variable decoding steps simplifies the memory access pipeline and contributes to a more deterministic and efficient KV cache retrieval process.

For a <span class="katex-eq" data-katex-display="false">d=64</span>-dimensional spherical-Beta source, the performance of FibQuant, which covers fractional bit rates (indicated by arrows) inaccessible to ScalarTurboQuant, consistently outperforms the latter at every rate, demonstrating operation down to <span class="katex-eq" data-katex-display="false">b \approx 0.19</span> and confirming the predicted <span class="katex-eq" data-katex-display="false">1/d</span> scaling of source variance. — For a $d=64$ -dimensional spherical-Beta source, the performance of FibQuant, which covers fractional bit rates (indicated by arrows) inaccessible to ScalarTurboQuant, consistently outperforms the latter at every rate, demonstrating operation down to $b \approx 0.19$ and confirming the predicted $1/d$ scaling of source variance.

Optimizing the Quantization Landscape: Precision in Distribution

Quasi-uniform spherical point sets offer an efficient method for selecting quantization vectors by providing a distribution of points that approximates uniformity across the surface of a sphere. The Roberts-Kronecker sequence is a low-discrepancy sequence used to generate these point sets, minimizing clustering and maximizing coverage. Unlike purely random selection, this deterministic approach ensures a more even distribution of quantization vectors, reducing the potential for significant quantization error in any particular region of the input space. This is particularly beneficial in high-dimensional spaces where uniform random sampling becomes increasingly inefficient. The resulting codebook, composed of these quasi-uniformly distributed vectors, facilitates more accurate representation of the input signal with a limited number of quantization levels, improving compression ratios and signal fidelity.

Bennett-Gersho companding is a non-linear transformation applied to input data prior to quantization, specifically designed to improve compression ratios when the data follows a spherical-Beta distribution. This technique maps the input values to a new range where the probability density is more uniform, effectively concentrating frequently occurring values and expanding sparsely populated regions. By aligning the radius of the spherical-Beta distribution with the cached data, companding ensures that quantization levels are more efficiently allocated to the most probable input values. This reduces quantization error and minimizes information loss, leading to improved compression performance, particularly for data exhibiting a non-uniform distribution.

Lloyd-Max refinement is an iterative algorithm used to optimize a quantization codebook after initial vector selection. The process minimizes the mean squared error (MSE) between the original data and its quantized representation by alternating between two steps: codebook refinement and nearest neighbor assignment. In the first step, each quantization vector is updated to be the centroid of all training data points assigned to that vector. Subsequently, each data point is reassigned to the nearest quantization vector. These two steps are repeated until the codebook converges, meaning further iterations produce negligible reductions in MSE. This iterative process effectively minimizes information loss during compression by shaping the quantization boundaries to better reflect the data distribution.

A multi-shell codebook with <span class="katex-eq" data-katex-display="false">N=128</span> codewords, refined using a Fibonacci sphere distribution and Lloyd’s algorithm across <span class="katex-eq" data-katex-display="false">L=4</span> shells with <span class="katex-eq" data-katex-display="false">M_a = 32</span> codewords per shell and radii <span class="katex-eq" data-katex-display="false">r_1 \approx 0.117</span>, <span class="katex-eq" data-katex-display="false">r_2 \approx 0.201</span>, <span class="katex-eq" data-katex-display="false">r_3 \approx 0.265</span>, and <span class="katex-eq" data-katex-display="false">r_4 \approx 0.330</span>, demonstrates that independent refinement per shell achieves most of the performance gain of joint refinement at a reduced computational cost. — A multi-shell codebook with $N=128$ codewords, refined using a Fibonacci sphere distribution and Lloyd’s algorithm across $L=4$ shells with $M_a = 32$ codewords per shell and radii $r_1 \approx 0.117$ , $r_2 \approx 0.201$ , $r_3 \approx 0.265$ , and $r_4 \approx 0.330$ , demonstrates that independent refinement per shell achieves most of the performance gain of joint refinement at a reduced computational cost.

The Impact of Compression: A Resilient Architecture

FibQuant presents a significant advancement in large language model (LLM) compression, consistently achieving superior ratios when contrasted with established techniques like KIVI, KVQuant, and TurboQuant. This improvement isn’t merely incremental; the methodology demonstrably reduces the memory footprint required to store and operate these increasingly complex models. By employing a novel quantization strategy, FibQuant allows for more efficient data storage and faster retrieval, directly addressing a critical bottleneck in LLM deployment. The enhanced compression facilitates the possibility of running sophisticated models on hardware with limited resources, broadening accessibility and enabling real-time performance in diverse applications.

FibQuant achieves substantial compression of large language model weights without sacrificing performance, as demonstrated by rigorous evaluation using Attention Output Cosine Similarity. This metric quantifies the alignment between the model’s original and compressed outputs; FibQuant consistently scores 0.946, indicating near-perfect preservation of representational capacity even at a compression ratio of 34.1x. This high score suggests that the compressed model effectively retains its ability to generate coherent and relevant text, effectively mirroring the behavior of its full-precision counterpart despite a significant reduction in memory footprint. The result highlights a key advantage: substantial model size reduction without a corresponding loss in the quality of generated outputs.

A critical advantage of FibQuant lies in its ability to maintain the speed and responsiveness of large language models even with substantial compression. Unlike some quantization techniques that require sequential decoding of compressed data, FibQuant facilitates random access to individual cache entries. This means the model can retrieve any necessary information directly, without needing to process preceding data-a feature vital for real-time applications and interactive experiences. This direct access is achieved through a carefully designed indexing scheme that preserves the logical structure of the original cache, enabling the LLM to operate with minimal latency, even at a $34.1x$ compression ratio, and ensuring a fluid user experience.

Fibonacci initialization combined with Beta-quantization (<span class="katex-eq" data-katex-display="false">k=2</span>) effectively partitions a spherical-Beta distribution (<span class="katex-eq" data-katex-display="false">f_{d,2}</span>) on <span class="katex-eq" data-katex-display="false">\mathbb{B}^{2}</span> into Voronoi cells, demonstrating convergence to codewords optimized via the Lloyd-Max algorithm for varying dimensions <span class="katex-eq" data-katex-display="false">d</span> and codebook sizes <span class="katex-eq" data-katex-display="false">N\\in\\{8,16,32,64\\}</span>. — Fibonacci initialization combined with Beta-quantization ( $k=2$ ) effectively partitions a spherical-Beta distribution ( $f_{d,2}$ ) on $\mathbb{B}^{2}$ into Voronoi cells, demonstrating convergence to codewords optimized via the Lloyd-Max algorithm for varying dimensions $d$ and codebook sizes $N\\in\\{8,16,32,64\\}$ .

Future Pathways: Scaling Resilience and Expanding Horizons

Further gains in efficiency may be realized by uniting FibQuant with Hierarchical List Decoding (HLD). HLD is a technique that reduces the computational burden of decoding by exploring a limited set of promising candidate sequences, rather than exhaustively evaluating all possibilities. When combined with FibQuant’s reduced precision representation of weights, this synergistic effect could substantially diminish encoder complexity. The pairing promises a decrease in both computational requirements and memory footprint, as the reduced precision weights require less storage and processing. This integration doesn’t merely offer incremental improvement; it proposes a pathway toward more scalable and resource-efficient language models, particularly crucial for deployment on edge devices or in memory-constrained environments. The anticipated outcome is a streamlined architecture capable of maintaining high performance with significantly fewer resources, thereby broadening the applicability of large language models.

The principles underpinning FibQuant, initially demonstrated with large language models, possess considerable potential beyond text-based data. Researchers anticipate that adapting the technique to process image and audio data streams could unlock significant efficiencies in these domains, particularly regarding model compression and reduced computational demands. This expansion relies on the core concept of representing data with a limited set of Fibonacci-based quantization levels, allowing for effective dimensionality reduction without substantial information loss. Applying FibQuant to image processing, for example, might involve quantizing image features or color palettes, while in audio processing, it could be used to compress audio waveforms or spectral representations. Successful implementation across these modalities would not only broaden the applicability of FibQuant but also offer a valuable comparative analysis, revealing the technique’s robustness and limitations in diverse data landscapes and potentially inspiring further refinements to the core algorithm.

Further optimization of large language model performance hinges on intelligent memory management, and combining FibQuant with StreamingLLM’s eviction and retention policies presents a promising pathway. StreamingLLM dynamically manages its cache by strategically discarding less frequently used tokens-eviction-while preserving crucial information-retention. Integrating FibQuant’s compressed token representation with this system could dramatically reduce the memory footprint of cached tokens, allowing for a larger effective cache size without increasing hardware demands. This synergistic approach isn’t merely about storing more tokens, but about storing the right tokens more efficiently, potentially mitigating performance bottlenecks associated with frequent cache misses and costly recomputations. The resulting system promises not only reduced memory usage, but also faster processing speeds and improved scalability for resource-constrained environments.

The pursuit of efficient knowledge retrieval, as demonstrated by FibQuant’s approach to KV cache compression, echoes a fundamental tenet of resilient systems: adaptation in the face of increasing complexity. The method’s reliance on modeling the cache distribution-specifically, the spherical-beta-and subsequent vector quantization is a clever simplification, acknowledging the inherent trade-offs between precision and efficiency. As David Hilbert observed, “We must be able to answer the question: what are the ultimate elementary components of reality?” FibQuant doesn’t seek to eliminate complexity, but rather to skillfully manage it through a process of informed reduction, recognizing that any simplification carries a future cost, much like a system accruing technical debt. The design suggests a system that ages gracefully, prioritizing random access and attention fidelity even under compression-a testament to thoughtful, long-term system health.

What Lies Ahead?

FibQuant represents a logical step in the inevitable compression of memory access. The KV cache, a transient record of computation, is subject to the same entropic forces as any system. This work logs a particular moment in that decay – a refinement of vector quantization tailored to the specific demands of random access. The spherical-beta distribution, while effective, is a model, and all models are, by definition, incomplete. Future iterations will likely explore more adaptive distributions, perhaps those that evolve with the cache itself, mirroring the changing landscape of active data.

The true challenge isn’t merely achieving higher compression ratios, but sustaining fidelity over the cache’s lifespan. Every quantization step introduces a degree of approximation, a subtle erosion of information. The paper demonstrates an impressive balance, but the question remains: how gracefully does the system age under sustained load? Further investigation should focus on the cumulative effect of these approximations, and methods for intelligently redistributing quantization error.

Ultimately, the KV cache, like all caches, is a temporary reprieve from the cost of computation. FibQuant extends that reprieve, but does not abolish the underlying physics. The timeline continues. The next advance will likely involve a rethinking of the fundamental trade-offs between compression, latency, and the very definition of ‘fidelity’ in a probabilistic computing landscape.

Original article: https://arxiv.org/pdf/2605.11478.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/