Beyond the Prompt: Giving AI Agents Long-Term Memory on Edge Devices

Author: Denis Avetisyan

A new system allows multiple AI agents to maintain context and respond faster by storing key information directly on device, rather than relying solely on short-term prompts.

Researchers demonstrate a persistent, quantized KV cache for efficient multi-agent large language model inference on resource-constrained edge devices.

Deploying multi-agent large language models on resource-constrained edge devices presents a fundamental challenge: insufficient RAM to simultaneously store each agent’s key-value (KV) cache. This work, ‘Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices’, addresses this limitation with a system that persists quantized KV caches to disk, enabling direct cache restoration and drastically reducing inference latency. By employing 4-bit quantization and a dedicated block pool, the system achieves up to a 136x speedup in time-to-first-token across multiple model architectures while fitting four times more agent contexts into fixed memory. Could this approach unlock truly scalable and responsive multi-agent systems capable of operating entirely on the edge?

The Shifting Sands of Context: A Multi-Agent Challenge

The architecture of contemporary language models is rapidly evolving beyond single interactions, increasingly embedded within dynamic multi-agent systems designed for complex collaborative tasks. This shift necessitates a fundamentally different approach to processing, as models must not only generate coherent responses but also swiftly adapt to evolving conversational states and seamlessly switch between multiple, concurrent dialogues. The demand for rapid context switching and response generation isn’t merely a performance optimization; it’s a core requirement for creating truly interactive and responsive agents capable of engaging in nuanced, real-time collaborations. Consequently, existing infrastructure faces considerable strain, as the speed at which these models can access and utilize relevant information directly impacts the overall fluidity and effectiveness of the multi-agent system.

Current key-value (KV) caching strategies, while foundational for accelerating language model responses, increasingly falter when applied to the dynamic demands of multi-agent systems. These systems require constant context switching and rapid generation, a workload that exposes the limitations of traditional caching. Specifically, research indicates that a ‘cold start’ – the initial access to uncached data – can introduce substantial latency; for the Gemma 3 model with a 32K context window, this delay can reach up to 172 milliseconds. This performance bottleneck not only degrades the user experience through noticeable delays but also significantly restricts overall throughput, hindering the scalability of complex conversational applications and limiting the number of concurrent agent interactions a system can effectively manage.

As conversational systems grow in complexity, demanding more extensive context and nuanced interactions, inefficient cache management increasingly restricts scalability. The retrieval of prior conversational turns, essential for coherent responses, relies heavily on key-value caching; however, conventional approaches struggle to deliver information quickly enough to maintain a fluid user experience. When frequently accessed data isn’t readily available – a ‘cache miss’ – systems experience significant delays, impacting responsiveness and limiting the number of concurrent conversations a system can effectively handle. This bottleneck isn’t merely a performance issue; it fundamentally constrains the ability to deploy increasingly sophisticated agents and maintain engaging, real-time interactions, ultimately hindering the development of truly scalable conversational AI.

The Persistence of Memory: A Foundation for Speed

The PersistentKVCache is a disk-based storage system utilized to retain Key-Value (KV) data between sessions. This allows the system to bypass the need to recompute and reload KV data upon restart, a process known as a “warm start.” By persisting this data to disk, subsequent initial loading times are substantially reduced, as the system can directly retrieve previously computed KV states. This approach minimizes computational overhead and accelerates the generation of initial outputs, contributing to improved overall performance and responsiveness.

The PersistentKVCache utilizes BlockPool and the SafetensorsFormat to maximize storage efficiency and retrieval speed. BlockPool pre-allocates and manages memory blocks, reducing the overhead associated with frequent allocations and deallocations during KV data storage and access. SafetensorsFormat is employed as the serialization format, offering a simple and safe method for storing tensors with explicit metadata, eliminating potential security vulnerabilities and improving loading performance through a streamlined structure and reduced dependency count compared to alternatives like pickle. This combination minimizes disk I/O and memory usage, contributing to the overall speed of the caching system.

The system utilizes DiskPersistence to store key-value (KV) data on disk, minimizing the need to reload this data upon application restart. This approach significantly reduces initialization latency and dependency on the initial model loading process. Benchmarking demonstrates a Time-to-First-Token (TTFT) of 1.3 milliseconds when utilizing a warm cache, indicating rapid data recovery and responsiveness facilitated by the persisted KV data. This performance metric is achieved by bypassing the need to recalculate or reload KV data that has already been stored on disk.

Accelerating the Inevitable: Quantization and Concurrency

Q4Quantization is employed to reduce the memory requirements of Key-Value (KV) caches during inference. This process lowers the precision of the cached tensors, resulting in a significantly smaller memory footprint without substantial quality loss. Evaluations across various models indicate a minimal impact on Perplexity, ranging from a -0.7% decrease to a +3.0% increase, demonstrating that the technique maintains a high level of performance while conserving memory resources.

BatchQuantizedKVCache facilitates concurrent inference by managing Key-Value (KV) caches for multiple agents in a shared manner. This is achieved through interleaved prefill and decode scheduling, where operations for different agents are alternated to maximize throughput and utilization of hardware resources. Specifically, while one agent’s prefill stage is executing, another agent can simultaneously perform a decode operation, and vice versa. This scheduling approach reduces idle time and increases the overall inference speed when processing requests from multiple agents, effectively parallelizing the computation across their respective KV caches.

The implementation is specifically optimized for Apple’s `Metal` framework and the `MLX` machine learning library, enabling substantial performance improvements on Apple silicon-based hardware. This optimization allows for the concurrent processing of up to 12 agents, each with an 8K context window, on a device equipped with 24GB of memory. This capability is achieved through efficient memory management and parallel processing techniques within the `Metal` and `MLX` ecosystems, maximizing hardware utilization for large language model inference.

The Echo of Past Turns: Optimizing Context with Injection

The technique of Cross-Phase Context Injection fundamentally alters how conversational AI systems manage information across distinct stages of interaction. Rather than recalculating attention states – the mechanism by which the model focuses on relevant parts of the conversation – with each new phase, this method intelligently reuses previously cached states. This avoids redundant computation, effectively preserving the conversational thread and reducing processing demands. By leveraging this cached information, the system maintains contextual understanding without the performance overhead of repeated calculations, leading to faster response times and improved efficiency in long-form dialogues.

The computational demands of maintaining context in extended dialogues are substantially lessened through a combination of techniques, notably the implementation of a PersistentKVCache. This system intelligently stores and reuses previously computed key-value pairs, effectively bypassing the need for redundant calculations as a conversation progresses. Testing reveals a dramatic improvement in efficiency; agent context restoration times are reduced by as much as 136x, enabling more fluid and responsive interactions. This optimization is particularly impactful for resource-constrained environments, paving the way for more accessible and real-time conversational AI experiences.

Recent experimentation demonstrates the feasibility of deploying sophisticated conversational AI models directly on edge devices. Utilizing techniques like `UnifiedMemory` in conjunction with optimized context management, the Gemma 3 12B model achieves on-device performance with an 8K context window while maintaining a remarkably low memory footprint of just 432MB. This advancement circumvents the typical reliance on cloud-based processing, enabling real-time, private, and uninterrupted conversational experiences even with limited network connectivity. The reduced computational demands and memory usage open possibilities for integrating advanced AI into a wider range of portable and resource-constrained devices, fostering innovation in areas like personal assistants, robotics, and accessibility tools.

The Horizon of Attention: Refinements and Future Directions

To enhance the processing of extended sequences, researchers integrated the $Gemma3$ model with a novel attention mechanism termed $HybridAttention$ . This approach strategically combines the strengths of both global and $SlidingWindowAttention$ techniques. Global attention allows the model to consider relationships between all tokens in a sequence, crucial for understanding overarching context, while $SlidingWindowAttention$ focuses on local dependencies within a defined window. By dynamically leveraging both, $HybridAttention$ enables $Gemma3$ to efficiently process longer inputs without sacrificing the ability to capture intricate relationships, ultimately improving performance on tasks requiring comprehensive contextual understanding.

A rigorous comparison was conducted between the refined model and `DeepSeekCoder`, leveraging the technique of `AsymmetricKVDimensions` to evaluate performance gains. This method, which allocates differing dimensionalities to the key and value projections in the attention mechanism, allowed for a nuanced assessment of the model’s ability to retain and utilize contextual information. Results indicate that the holistic integration of `Gemma3` with `HybridAttention`, coupled with `AsymmetricKVDimensions`, yields significant improvements over `DeepSeekCoder` in several key areas, including long-range dependency modeling and overall code generation quality. These findings highlight the efficacy of the combined approach and validate the benefits of optimizing attention mechanisms for enhanced performance in complex language tasks.

Research is now directed towards extending these attention refinements to increasingly intricate multi-agent systems, where coordinating numerous independent entities demands robust contextual understanding and efficient information processing. Simultaneously, investigations are underway to develop novel quantization strategies, aiming to compress the model’s size and accelerate inference without sacrificing performance-a crucial step for deploying these advanced capabilities on resource-constrained devices. These parallel efforts promise to broaden the applicability of this work, facilitating its integration into real-world applications requiring both sophisticated reasoning and practical efficiency.

The pursuit of efficient multi-agent systems, as detailed in this work, echoes a fundamental truth about complex systems: optimization is rarely a final state. This research, by persisting quantized KV caches, doesn’t solve the problem of resource constraints on edge devices-it merely delays the inevitable entropy. As Robert Tarjan observed, “You can’t build systems-only grow them.” The authors haven’t constructed a static solution, but rather cultivated a method to extend the lifespan of viable inference, acknowledging the decaying nature of performance gains. This pragmatic approach-managing the rate of decay rather than attempting absolute prevention-aligns with a deeper understanding of architectural limitations.

Gardens Require Constant Tending

This work, in its careful husbandry of quantized key-value caches, reveals a truth often obscured by the pursuit of algorithmic efficiency: a system isn’t built, it’s grown. The reduction in latency and memory footprint achieved through persistent storage is not an endpoint, but a temporary reprieve. Each optimization introduces new vulnerabilities, new forms of decay. The edge device, relieved of immediate pressure, will inevitably demand more – a larger garden, more complex blooms. The question isn’t whether the cache will fail, but how it will fail, and what subtle imbalances will precipitate that failure.

The persistent memory offers forgiveness between components, a chance to recover from transient errors. But resilience isn’t about preventing storms; it’s about designing for graceful erosion. Future work must move beyond simply shrinking the cache and address the inevitable drift in calibration, the slow accumulation of errors that will render even the most carefully quantized model unreliable. A truly robust system will learn to tolerate, even embrace, its own imperfections.

The focus on multi-agent interaction introduces a further layer of complexity. Each agent, a competing ecosystem within the larger garden, introduces new dependencies and potential points of failure. The true challenge lies not in optimizing individual agents, but in fostering a symbiotic relationship – a shared understanding of limitations, and a willingness to forgive the inevitable trespasses of others. The garden will not manage itself.

Original article: https://arxiv.org/pdf/2603.04428.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/