Silent Errors in AI: How a Single Bit Flip Can Compromise Large Language Models

Author: Denis Avetisyan

New research reveals a critical vulnerability in how large language models manage memory, potentially allowing subtle data corruption to alter outputs without detection.

The analysis reveals that the mean transfer characteristic ratio <span class="katex-eq" data-katex-display="false"> TCR </span> fluctuates with bit position, demonstrating consistent variation across all five concurrency levels-a pattern underscored by the standard deviation observed across those levels. — The analysis reveals that the mean transfer characteristic ratio $TCR$ fluctuates with bit position, demonstrating consistent variation across all five concurrency levels-a pattern underscored by the standard deviation observed across those levels.

Shared key-value cache blocks in LLM serving systems are susceptible to bit-flip attacks, but a checksum-based mitigation strategy can ensure data integrity.

While modern LLM serving systems prioritize performance through techniques like shared caching, they may inadvertently create new security vulnerabilities. This paper, ‘Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems’, demonstrates that shared prefix blocks within these systems are susceptible to silent data corruption via single-bit flips, leading to coherent but altered outputs. We find this corruption propagates selectively and accumulates persistently, creating a unique threat profile distinct from traditional weight corruption attacks. Could this vulnerability be exploited in real-world deployments, and what proactive integrity protections are necessary before such attacks are demonstrated?

Unveiling the Fragility of Accelerated Inference

The escalating demand for Large Language Model (LLM) applications necessitates a relentless pursuit of efficient serving infrastructure. Traditional methods struggle to meet the low-latency and cost requirements of real-world deployments, prompting innovation in both hardware and software. Specialized accelerators, like GPUs and increasingly, custom ASICs, are being engineered to accelerate the computationally intensive matrix multiplications at the heart of LLMs. Simultaneously, software optimizations – including quantization, pruning, and distillation – aim to reduce model size and complexity without significant performance degradation. This confluence of hardware and software advancements is critical, as the economic viability and user experience of LLM-powered services hinge on delivering rapid responses at a sustainable cost, paving the way for broader accessibility and novel applications.

To drastically reduce the computational burden of serving Large Language Models, a technique called prefix caching has emerged as a crucial optimization. This process capitalizes on the commonalities between successive requests; when multiple prompts share initial sequences – the ‘prefix’ – the model avoids redundant computations. Instead of recalculating the processing of this shared prefix for each new request, the results are stored and reused. This significantly lowers latency and cost, as the model only needs to compute the unique continuation of each prompt. Effectively, prefix caching transforms the serving process from repeatedly solving similar problems to retrieving pre-computed solutions and applying only the necessary incremental calculations, enabling faster response times and more efficient resource allocation.

While prefix caching dramatically accelerates large language model inference by storing and reusing previously computed prompt segments, this optimization introduces significant data vulnerabilities. The caching mechanism, designed for speed, inherently creates a storage dependency; if the cached data becomes corrupted – through hardware failure, software bugs, or even malicious interference – subsequent inferences built upon that corrupted prefix will yield inaccurate or compromised results. This poses a critical challenge, as identifying and mitigating such corruption is difficult without robust data integrity checks and redundant storage systems. Furthermore, the very nature of shared caching means a single point of failure can impact numerous users or applications, amplifying the potential for widespread errors and demanding heightened security protocols to protect the cached prefixes from unauthorized modification or deletion.

The KV Cache: A Foundation Built on Shifting Sands

The Key-Value (KV) Cache utilized in large language model (LLM) inference stores tensors in the Brain Floating Point 16 (BF16) format to maximize processing speed and reduce memory usage. BF16 utilizes only 16 bits to represent each value, offering a significant reduction in storage compared to FP32 or FP16. However, this reduced precision and storage density inherently increases the susceptibility of the KV Cache to bit flips – single-bit errors that can alter stored values. Because each bit represents a portion of the overall value, a single bit flip can directly corrupt the tensor data, potentially impacting downstream calculations and the LLM’s output. The dense nature of the storage means that even a small number of bit flips can affect a substantial amount of cached data.

Prefix Blocks within the Key-Value (KV) Cache represent a concentrated vulnerability point due to their function and access patterns. These blocks store the shared prefix data for multiple sequence tokens, meaning a single block is repeatedly accessed during decoding for each token in the input sequence. This frequent access increases the probability of bit flips corrupting the data. Furthermore, the predictable nature of prefix data-specifically, the consistent repetition of initial tokens-allows attackers to target specific bits with a higher degree of certainty, maximizing the impact of a single bit flip. The combination of repeated access and predictable content makes Prefix Blocks significantly more susceptible to data corruption compared to other parts of the KV Cache.

Silent Divergence, resulting from bit flips within the KV Cache, manifests as LLM outputs that maintain grammatical correctness and apparent coherence despite containing factual errors or logical inconsistencies. This differs from typical failure modes which produce obviously incorrect or nonsensical text. The subtlety of these errors stems from the LLM continuing to generate plausible sequences based on the corrupted key-value data, effectively propagating the error throughout the output. Severity ranges from minor inaccuracies to complete fabrication of information, and the impact is amplified in tasks requiring high precision or where outputs are used for critical decision-making. Complete system failure, characterized by halting generation or producing entirely random outputs, is also possible but less frequent than the more insidious Silent Divergence scenarios.

Prefix caching, while enhancing inference speed by storing and reusing previously computed key-value tensors, introduces novel data integrity vulnerabilities. The storage of these tensors, particularly shared prefix data, creates a target for malicious bit flips. Research demonstrates that a single bit alteration within the cached key-value data can induce undetectable corruption of the LLM’s output, leading to coherent but factually incorrect responses. This differs from traditional error detection methods, as the LLM continues operation without signaling an error, making these attacks difficult to detect and potentially impactful in sensitive applications. The efficiency gains offered by prefix caching are therefore coupled with a new class of security risks centered on the integrity of the stored data.

The per-request corruption rate <span class="katex-eq" data-katex-display="false">ar{c}_{i}</span> varies by bit position for both Qwen3-8B (darker shaded bands representing mean ± 1 SD) and DeepSeek-R1 (lighter shaded bands), indicating differing sensitivities to bit corruption. — The per-request corruption rate $ar{c}_{i}$ varies by bit position for both Qwen3-8B (darker shaded bands representing mean ± 1 SD) and DeepSeek-R1 (lighter shaded bands), indicating differing sensitivities to bit corruption.

Unearthing the Roots of Instability: Rowhammer and Beyond

Dynamic Random Access Memory (DRAM), commonly used in GPU memory, exhibits a physical vulnerability known as Rowhammer. This phenomenon occurs due to the capacitive coupling between adjacent rows within the DRAM chip. Repeatedly accessing (hammering) a specific row can induce electrical disturbances in neighboring rows, potentially causing bit flips – unintended changes in the stored data. These bit flips are not due to logical errors in the system but are a direct result of the physical properties of the DRAM itself. The effect is probabilistic and depends on factors such as DRAM density, operating temperature, and voltage levels, but has been consistently demonstrated across various DRAM generations and manufacturers.

Software Fault Injection (SFI) is a technique used to intentionally introduce errors into a system to evaluate its robustness and error handling capabilities. This is achieved by simulating hardware faults, such as bit flips in memory, through software means. While valuable for testing system resilience-identifying how a system responds to and recovers from errors-SFI also presents a potential security vulnerability. By deliberately inducing bit flips, an attacker could potentially manipulate data or control program execution, bypassing standard security mechanisms. The controlled nature of SFI allows for targeted error introduction, making it a precise, though potentially malicious, tool for system manipulation.

Persistent Accumulation within Prefix Blocks exacerbates the impact of bit flips induced by Rowhammer or software fault injection. DRAM organizes data into blocks, and requests sharing a common prefix-the initial portion of a memory address-access these blocks sequentially. Even a single bit flip within a prefix block is not immediately isolated; subsequent requests using that same prefix repeatedly access and potentially propagate the error, causing it to accumulate over time. This temporal persistence means that minor initial corruptions can escalate into larger, more impactful data modifications, significantly increasing the likelihood of system instability or malicious exploitation. Our experiments, involving 2400 trials and analysis of 24000 post-injection requests, demonstrated this effect, confirming the amplification of errors through prefix block accumulation.

Analysis of fault injection experiments, encompassing 2400 trials and subsequent examination of 24000 post-injection requests, demonstrates that induced bit flips exhibit selective propagation. This means corruption remains largely confined to requests utilizing the same prompt prefix as the initial, affected request. This behavior limits the scope of the error and increases the difficulty of detection, as the corruption doesn’t readily spread to unrelated operations; the temporal persistence observed reinforces this prefix-based confinement of errors.

Fortifying the Foundation: Towards Resilient Inference

Data integrity within large language model serving relies heavily on error detection mechanisms, and checksums offer a foundational approach to safeguarding stored data. Specifically, Prefix Blocks – segments of pre-computed model outputs – benefit from the implementation of algorithms like SHA-256, which generate a unique “fingerprint” of the data. This fingerprint is then stored alongside the Prefix Block; any subsequent alteration to the data will result in a mismatch between the calculated checksum and the stored value, signaling a corruption event. While relatively simple to implement, checksums provide a crucial first line of defense against both accidental data errors – arising from hardware glitches or software bugs – and malicious attempts to tamper with the model’s operation, ensuring the reliability of generated outputs.

While checksums offer a foundational level of error detection within large language model serving systems, their efficacy diminishes when confronted with targeted and complex attacks. Simple checksum algorithms can be bypassed or manipulated by adversaries capable of precisely altering data without triggering detection, particularly those exploiting vulnerabilities in hardware or software. These sophisticated attacks can range from subtle bit-flip corruptions designed to influence model outputs, to more elaborate manipulations aimed at compromising the entire system. Consequently, relying solely on basic checksums creates a vulnerability that necessitates the implementation of more robust countermeasures, such as introducing randomness through techniques like cache salting, to enhance data integrity and resilience against evolving threats.

Cache Salt techniques introduce a layer of randomness into how Prefix Blocks – the foundational units of cached language model data – are stored and accessed. This approach deliberately varies the memory locations assigned to these blocks, effectively disrupting the predictability that timing attacks rely on to extract sensitive information. By obfuscating the physical arrangement of data, it becomes significantly more difficult for an attacker to correlate access times with specific data elements. Beyond security, this randomized storage also offers a degree of protection against bit flip corruption, a type of hardware error where individual bits are altered. The inherent unpredictability makes it harder for localized bit flips to predictably compromise the integrity of a specific Prefix Block, contributing to a more resilient and robust LLM serving infrastructure.

Large language model serving infrastructure can be significantly fortified against data corruption and malicious attacks through a combined approach to data integrity. A newly proposed checksum-based countermeasure demonstrably limits the scope of potential damage to a single batch of data, preventing cascading errors and maintaining system stability. Crucially, this protective measure introduces minimal performance overhead – only 0.11 tokens per second – a figure comparable to typical system variation. Rigorous testing, involving the injection of 120 simulated faults, confirmed the system’s effectiveness, achieving 100% fault detection with absolutely no false positives, thereby ensuring reliable and trustworthy LLM operation even in the face of hardware failures or software vulnerabilities.

The exploration of LLM serving systems and their vulnerabilities reveals a landscape where established architectures are not immutable truths, but rather challenges awaiting deconstruction. This work, detailing the bit-flip vulnerability in shared KV-cache blocks, echoes a sentiment articulated by Ada Lovelace: “The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.” The engine, much like these LLM systems, operates within defined parameters. However, the discovery of silent corruption via bit-flip attacks demonstrates that even within those parameters, unexpected behaviors emerge when the underlying assumptions of data integrity are challenged. Understanding such ‘faults’ is not merely about fixing errors, but about truly knowing the limits – and potential – of the system itself.

Unraveling the Code

The demonstration of bit-flip vulnerabilities within shared KV-cache blocks isn’t merely a security concern; it’s a confirmation. It confirms the suspicion that the increasingly complex architectures powering large language models are, fundamentally, still built on substrate susceptible to the most basic of failures. Reality, after all, is open source – it just hasn’t been fully read yet. This work peels back another layer, revealing the fragility hidden within optimization. The proposed checksum mitigation is a reasonable patch, but patches rarely address the underlying design philosophy.

Future work must move beyond simply detecting corruption. The focus should shift to preventing it at a more fundamental level. Exploring memory error correction codes tailored for the specific access patterns of LLM serving, or investigating hardware-level protections within GPUs themselves, offers a more robust, if more challenging, path. Furthermore, the silent nature of these bit-flips raises a broader question: how many other subtle corruptions are occurring undetected within these systems, shaping outputs in ways that are not attributable to the model’s parameters?

Ultimately, the pursuit of increasingly powerful models demands a concurrent pursuit of increasingly resilient infrastructure. This isn’t about building bigger firewalls; it’s about understanding the fundamental physics of computation and designing systems that gracefully degrade, rather than silently failing. The code is out there; it’s simply a matter of reverse-engineering the errors to fully comprehend the system.

Original article: https://arxiv.org/pdf/2604.17249.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unveiling the Fragility of Accelerated Inference

The KV Cache: A Foundation Built on Shifting Sands

Unearthing the Roots of Instability: Rowhammer and Beyond

Fortifying the Foundation: Towards Resilient Inference

Unraveling the Code

See also: