Smarter Caching for Faster AI

Author: Denis Avetisyan


A new approach to memory management dramatically improves the speed and efficiency of large language model inference.

The system integrates a chunk-aware key-value placement plan within a paged key-value store, orchestrating recomputation of necessary tokens and fused RoPE to optimize attention mechanisms-a design acknowledging that every architectural decision foreshadows eventual limitations within the serving stack.
The system integrates a chunk-aware key-value placement plan within a paged key-value store, orchestrating recomputation of necessary tokens and fused RoPE to optimize attention mechanisms-a design acknowledging that every architectural decision foreshadows eventual limitations within the serving stack.

MEPIC enables position-independent caching of document chunks in High-Bandwidth Memory, reducing recomputation and memory usage for large language models.

Efficiently serving large language models (LLMs) demands minimizing the substantial memory footprint of the Key-Value (KV) cache, particularly when processing lengthy prompts with repeated content. This paper introduces MEPIC: Memory Efficient Position Independent Caching for LLM Serving, a system designed to overcome limitations in existing position-independent caching approaches by enabling comprehensive chunk KV reuse across requests and batches. MEPIC achieves up to 2x HBM reduction-and 5x for long prompts-through paged storage alignment, block-level recomputation, and RoPE fusion, all without requiring model modifications. Will these techniques unlock new scalability frontiers for LLM-powered applications and democratize access to advanced AI inference?


The Inevitable Bottleneck: Memory and Compute in the Age of LLMs

The deployment of Large Language Models (LLMs) presents a substantial challenge due to their immense computational and memory demands, creating a critical bottleneck in serving these powerful AI systems. These models, often comprising billions of parameters, require considerable resources not only for initial training but also for the real-time processing of user queries. The sheer size of these models necessitates high-bandwidth memory and powerful processors, driving up infrastructure costs and limiting scalability. As models grow in complexity to achieve enhanced reasoning and generation capabilities, the resource requirements increase exponentially, posing a significant obstacle to widespread accessibility and practical deployment. This bottleneck impacts response times, restricts the number of concurrent users, and ultimately hinders the potential of LLMs to deliver value in real-world applications.

The ability of Large Language Models to effectively process information is fundamentally constrained by their handling of context length. Traditional methods, reliant on sequentially processing each token in a given input, encounter escalating computational demands and memory requirements as the input sequence grows. This limitation isn’t merely a matter of processing time; longer sequences are critical for complex reasoning tasks requiring the model to synthesize information across a broad range of data. As context lengths increase, the computational cost grows disproportionately, leading to significant latency-the delay between input and output-and ultimately hindering the model’s practical application in real-time scenarios. Consequently, research focuses on innovative approaches to manage these long sequences without sacrificing either accuracy or speed, as the true potential of LLMs remains locked behind this persistent bottleneck.

Efficiently delivering responses from Large Language Models necessitates a careful approach to Key-Value Cache management, as redundant computation represents a major performance hurdle. During inference, LLMs repeatedly access previously computed key-value pairs – the ‘keys’ representing input tokens and the ‘values’ their corresponding hidden states. Without optimized caching, the model would recalculate these hidden states for every token, drastically increasing latency and resource consumption. Advanced serving systems therefore prioritize caching strategies – such as caching only the most recent tokens or employing techniques like attention slicing – to minimize recomputation. The effectiveness of these strategies directly impacts the model’s throughput and responsiveness, particularly when handling extended context lengths where the cache can become a significant portion of total memory usage. Consequently, innovations in cache design and management are central to scaling LLM deployments and reducing serving costs.

CacheBlend demonstrates consistently lower HBM usage and end-to-end latency compared to EPIC and MEPIC, even with increasing context lengths.
CacheBlend demonstrates consistently lower HBM usage and end-to-end latency compared to EPIC and MEPIC, even with increasing context lengths.

Beyond Position: A New Paradigm for Caching LLM States

Position-Independent Caching addresses limitations of prior techniques by storing and retrieving data based on content rather than its sequential location within an input stream. Traditional caching methods often require exact positional matches for cache hits, restricting data reuse if the input sequence is altered or shifted. Position-Independent Caching, however, generates cache keys that are invariant to the data’s absolute position; a cached item can be successfully retrieved from any point in the sequence where the key is encountered, regardless of its initial location. This decoupling of cached data from specific positions significantly enhances the potential for cache reuse and improves overall system efficiency, particularly in scenarios involving variable-length inputs or sequence transformations.

Prefix Caching operates by storing and retrieving data based on exact matches of input sequence prefixes; a cached result is only valid if the input begins with a previously seen prefix. This strict prefix matching necessitates that any deviation from the expected prefix – even a single token change – results in a cache miss and requires recomputation. Consequently, Prefix Caching exhibits limited reusability, as cached data cannot be effectively leveraged when dealing with inputs that differ even slightly from previously processed sequences. This is a fundamental constraint impacting its efficiency in scenarios with variable or unpredictable input patterns.

Position-Independent Caching improves performance by storing and retrieving data based on its content, rather than its location within a sequence. Traditional caching methods often require exact matches of initial sequence segments – a limitation that prevents reuse when the same data appears at different offsets. Decoupling the cached key from positional context allows the system to identify and utilize previously computed results even if they are not aligned with the beginning of the current input. This approach minimizes redundant computations and enhances overall efficiency, particularly in scenarios involving variable-length inputs or data streams where identical segments may occur non-contiguously.

Chunk-aware key-value reuse in CacheBlend, EPIC, and MEPIC reduces HBM usage and improves end-to-end latency across varying query rates.
Chunk-aware key-value reuse in CacheBlend, EPIC, and MEPIC reduces HBM usage and improves end-to-end latency across varying query rates.

MEPIC: A Memory-Efficient Architecture for Position-Independent Caching

MEPIC is a key-value (KV) caching system designed to optimize large language model (LLM) serving by reusing previously computed chunk-level KV states. This is achieved through Position-Independent Caching, which decouples cached KV states from specific positional encodings within the input sequence. By eliminating positional dependencies, MEPIC enables efficient reuse of KV states across diverse input prompts and varying sequence lengths, thus reducing redundant computation and lowering memory bandwidth requirements. This approach differs from traditional caching methods that are tightly coupled to input position, which limits their effectiveness in dynamic LLM serving scenarios.

MEPIC’s performance gains are directly linked to its utilization of High Bandwidth Memory (HBM). HBM provides substantially increased memory bandwidth – exceeding that of traditional GDDR6 – and reduced latency due to its 3D-stacked architecture and wider memory interface. This allows MEPIC to rapidly access cached key-value (KV) pairs, minimizing data retrieval bottlenecks during inference. The high bandwidth is critical for large language model serving, where frequent memory access is required for each token generated, and the low latency directly translates to reduced end-to-end processing time. By storing cached data in HBM, MEPIC circumvents the performance limitations associated with off-chip memory access, resulting in significant speedups and improved throughput.

The Chunk Cache Coordinator and Selective Recomputation modules within MEPIC collaboratively manage cached key-value (KV) pairs to minimize both memory footprint and computational expense. The Chunk Cache Coordinator is responsible for tracking KV chunk ownership and coordinating data transfers between different processing units, preventing redundant storage. Selective Recomputation identifies and recomputes only those KV pairs that are not already present in the cache or have been invalidated, rather than recomputing entire sequences. This granular approach, coupled with the Coordinator’s efficient memory management, allows MEPIC to significantly reduce HBM usage and associated compute costs without compromising performance.

MEPIC utilizes a NoPE (No Position Encoding) Key-Value (KV) format which enables enhanced position independence in caching strategies for Large Language Model serving. This format facilitates up to a 2x reduction in High Bandwidth Memory (HBM) usage when applied to multi-step Retrieval-Augmented Generation (RAG) workloads. Furthermore, benchmarks demonstrate greater than a 5x reduction in HBM usage for long prompts. Crucially, these memory optimizations are achieved without compromising accuracy or end-to-end latency, with MEPIC maintaining comparable or superior performance relative to existing caching methods such as CacheBlend and EPIC.

MEPIC introduces scheduling components that seamlessly integrate chunk-aware key-value management into vLLM by coordinating prefix and chunk handling, enforcing alignment, and managing residency across memory tiers.
MEPIC introduces scheduling components that seamlessly integrate chunk-aware key-value management into vLLM by coordinating prefix and chunk handling, enforcing alignment, and managing residency across memory tiers.

From Dialogue to RAG: Expanding the Reach of Efficient Caching

The effectiveness of multi-turn agents-those capable of engaging in extended, coherent conversations-hinges significantly on efficient caching mechanisms. As conversations unfold, these agents must retain and rapidly access prior interactions to maintain contextual understanding and generate relevant responses. MEPIC exemplifies a strategy for optimizing this process, demonstrating how careful caching can dramatically reduce computational load and latency. Without such efficient memory management, each turn would require reprocessing the entire conversation history, quickly becoming impractical as the dialogue extends. This capability isn’t simply about speed; it’s about enabling a truly conversational experience where the agent demonstrates genuine awareness of what has been previously discussed, fostering more natural and productive interactions.

Retrieval-Augmented Generation (RAG) represents a significant advancement in large language model (LLM) capabilities by grounding responses in factual, external knowledge. Rather than relying solely on the parameters learned during training, RAG systems dynamically access and incorporate relevant information from sources like databases or documents. This process necessitates efficient information retrieval; the LLM first identifies pertinent data, then uses it to augment the prompt and generate a more accurate and contextually relevant output. The performance of RAG is therefore inextricably linked to the speed and precision with which information can be located and reused, making optimized retrieval a core component of this powerful approach to knowledge-intensive tasks. By seamlessly integrating retrieved knowledge, RAG overcomes limitations in LLM training data and enhances the reliability and trustworthiness of generated text.

Recent advancements in optimizing Key-Value Cache utilization are significantly boosting the performance of complex applications like multi-turn dialogue systems and Retrieval-Augmented Generation. Techniques such as MoSKA and FlashForge address inefficiencies inherent in traditional cache designs by intelligently managing key storage and retrieval. MoSKA, for instance, employs a masked self-attention mechanism to prioritize and retain the most relevant keys, while FlashForge leverages a novel data layout to accelerate access times. These optimizations are particularly impactful given the Zipfian distribution often observed in query patterns – where a small number of keys are accessed with disproportionately high frequency – enabling faster response times and reduced computational costs without sacrificing contextual understanding.

Multi-turn agents frequently exhibit a distinct pattern of information access known as Zipfian retrieval, where a small number of keys are accessed with high frequency, while the vast majority are rarely needed. This skewed distribution presents a significant challenge for caching systems, as simply storing the most recently used items proves insufficient. Efficient caching mechanisms, therefore, become critical for these agents, enabling rapid access to frequently requested data and minimizing latency. By strategically prioritizing and storing the most popular keys-those adhering to the Zipfian distribution-performance gains are substantial, as the agent spends less time retrieving information and more time processing and responding. Without such optimization, the cost of repeatedly accessing external knowledge sources would quickly become prohibitive, hindering the agent’s ability to maintain coherent and engaging conversations.

Chunk-aware key-value caching strategies-CacheBlend, EPIC, and MEPIC-demonstrate reduced memory usage across multiple datasets (SQuAD, NewsQA, NarrativeQA, emrQA) during inference.
Chunk-aware key-value caching strategies-CacheBlend, EPIC, and MEPIC-demonstrate reduced memory usage across multiple datasets (SQuAD, NewsQA, NarrativeQA, emrQA) during inference.

Towards Intelligent Caching: The Future of LLM Acceleration

The efficiency of large language model (LLM) serving is poised to improve through the development of caching strategies that move beyond static implementations. Future research prioritizes adaptive caches capable of dynamically adjusting to the fluctuating demands of varying workloads and the ever-increasing context lengths characteristic of modern LLMs. These systems will likely incorporate predictive algorithms to anticipate frequently accessed tokens and proactively stage them in faster memory tiers. Such an approach moves beyond simply storing recently used data; instead, the cache intelligently learns patterns in the input stream and adjusts its contents to minimize retrieval latency and maximize throughput. This dynamic optimization is crucial as LLMs continue to grow in size and complexity, demanding innovative solutions to bridge the gap between computational power and memory bandwidth.

Significant performance gains in Large Language Model (LLM) acceleration are anticipated through the synergistic combination of caching strategies with model compression and quantization techniques. Reducing the precision of model weights – a core tenet of quantization – and employing compression algorithms directly lessens the memory footprint of the LLM itself. When paired with intelligent caching, which stores frequently accessed data closer to the processing unit, this combined approach minimizes data movement and alleviates memory bottlenecks. The effect is a substantial reduction in both latency and energy consumption, as the system requires less access to slower, off-chip memory. Further research explores how adaptive quantization levels, tailored to the specific layer or token, can maximize compression without sacrificing model accuracy, creating a powerful pathway toward deploying LLMs on resource-constrained devices and scaling their performance for demanding applications.

The pursuit of faster large language model (LLM) inference is increasingly focused on memory technology beyond traditional DRAM. Emerging persistent memory technologies, like those utilizing 3D XPoint or similar architectures, offer a compelling solution by bridging the gap between DRAM speed and storage capacity. These technologies retain data even when power is lost, enabling the creation of significantly larger caches – potentially orders of magnitude greater than currently feasible – without the constant reloading from slower storage. Such expanded caches drastically reduce latency by keeping more of the LLM’s weights and activations readily available, thereby accelerating token generation and overall throughput. Furthermore, persistent memory’s lower energy consumption compared to DRAM presents an opportunity for more sustainable and cost-effective LLM deployments, paving the way for increasingly sophisticated models and applications.

Current large language model serving infrastructure is evolving beyond uniform memory hierarchies, with systems like LMCache pioneering heterogeneous caching strategies. These approaches intelligently distribute cached data across multiple tiers of memory – from fast, yet limited, SRAM to slower, higher-capacity DRAM and even persistent storage – based on access frequency and latency requirements. By strategically placing frequently accessed tokens in faster memory and less critical data in slower tiers, LMCache and similar systems aim to maximize throughput and minimize latency without being constrained by the capacity of any single memory type. This tiered approach not only optimizes performance for individual requests but also improves overall system efficiency by reducing the demand on expensive, high-bandwidth memory, paving the way for serving even larger and more complex models with greater cost-effectiveness.

MEPIC, a novel compilation technique, improves system throughput by enabling cross-request high bandwidth memory reuse during both compile and link steps, contrasting with the fully recompute approach which recompiles all tokens.
MEPIC, a novel compilation technique, improves system throughput by enabling cross-request high bandwidth memory reuse during both compile and link steps, contrasting with the fully recompute approach which recompiles all tokens.

The pursuit of efficient large language model serving, as demonstrated by MEPIC, reveals a fundamental truth about complex systems. It isn’t about achieving static perfection, but embracing dynamic adaptation. Long stability, often celebrated in engineering, can become the camouflage for underlying vulnerabilities, particularly in memory management. As Bertrand Russell observed, “The difficulty lies not so much in developing new ideas as in escaping from old ones.” MEPIC’s approach to page-aligned caching and chunk reuse isn’t a final solution, but a necessary evolution, a shedding of conventional wisdom regarding KV cache locality to accommodate the ever-shifting demands of inference optimization. The system doesn’t prevent future failures; it prepares for them, evolving to absorb the inevitable shocks.

The Garden Grows

MEPIC addresses a familiar tension: the illusion of control over memory. It is tempting to view HBM as a neatly organized store, but the reality is closer to a shared garden. Efficient caching, as this work demonstrates, isn’t about preventing fragmentation, but about fostering a resilient ecosystem where fragments can be reused, forgiven, and repurposed. The paper’s focus on page-aligned chunking hints at a broader truth: architectural choices are prophecies of the shapes failures will take. What seems like optimization today-a precise alignment, a fixed page size-will inevitably become a constraint tomorrow as models and sequence lengths evolve.

The path forward isn’t simply to refine these alignments, but to explore systems that adapt to fragmentation. Imagine a caching layer that actively reshapes its own internal organization, prioritizing forgiveness over strict order. The current work rightly focuses on minimizing recomputation, but a truly robust system will accept a degree of redundancy-a gentle overgrowth-as the price of stability. Resilience lies not in isolation, but in the graceful degradation of performance when faced with inevitable entropy.

Ultimately, the challenge isn’t to build a perfect cache, but to cultivate a memory landscape that can absorb the unpredictable growth of large language models. A system isn’t a machine, it’s a garden-neglect it, and you’ll grow technical debt. The future will likely belong to those who embrace the messy, organic nature of memory, rather than attempting to impose a rigid, artificial order.


Original article: https://arxiv.org/pdf/2512.16822.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-22 03:03