The Shifting Sands of Retrieval: Poisoning Attacks in AI Systems

Author: Denis Avetisyan

New research reveals how malicious data injected into knowledge sources can subtly manipulate the responses of AI systems powered by retrieval-augmented generation.

Security datasets present a significant challenge to automated fact-checking systems, exhibiting markedly lower detection accuracy-approaching random performance for both keyword and semantic methods-compared to the FEVER dataset, where near-perfect detection is achievable, despite differing strengths in stealth and co-retrieval between the two corpora-with Security emphasizing stealth (66.7%) at the cost of co-retrieval (44.4%), and FEVER prioritizing complete co-retrieval (100%) while lacking stealth (0%).

Hybrid retrieval methods offer a degree of protection against corpus-dependent poisoning attacks, but domain-specific vulnerabilities and advanced optimization techniques can still compromise system integrity.

While Retrieval-Augmented Generation (RAG) systems promise enhanced knowledge and reasoning for large language models, they introduce vulnerabilities through manipulation of the retrieval process. This paper, ‘Semantic Chameleon: Corpus-Dependent Poisoning Attacks and Defenses in RAG Systems’, investigates gradient-guided corpus poisoning attacks, demonstrating that malicious documents can be strategically inserted to skew retrieved information and control model outputs. Critically, we find that a simple architectural modification-hybrid retrieval combining BM25 and vector similarity-effectively mitigates these attacks across diverse LLM families, though sophisticated attackers and corpus characteristics can impact success. As RAG systems become increasingly prevalent, how can we proactively build robust defenses against evolving adversarial strategies targeting knowledge sources?

The Evolving Threat: Semantic Attacks and the Integrity of RAG Systems

Retrieval-Augmented Generation (RAG) has rapidly become a cornerstone of modern Large Language Model (LLM) deployments, enabling more informed and contextually relevant responses by grounding LLMs in external knowledge sources. However, this very reliance on retrieved information introduces a critical vulnerability. Unlike traditional LLM security measures focused on input prompting or model weights, RAG systems are susceptible to semantic attacks – subtle manipulations of the retrieved knowledge itself. These attacks don’t rely on directly ‘hacking’ the LLM; instead, they poison the information sources used by RAG, introducing misinformation or bias that the LLM then faithfully incorporates into its responses. Because semantic similarity, rather than exact keyword matching, drives the retrieval process, these attacks can be exceptionally difficult to detect, allowing attackers to exert influence over LLM outputs without triggering conventional safeguards. This represents a significant shift in the threat landscape, demanding new security paradigms focused on knowledge source integrity and robust retrieval mechanisms.

Conventional security protocols, designed to detect explicit malicious content, are increasingly ineffective against semantic attacks targeting Retrieval-Augmented Generation (RAG) systems. These attacks don’t rely on keyword matches or blatant falsehoods, but instead exploit the nuanced understanding of language models by introducing subtly manipulated information into the knowledge sources. Because RAG systems prioritize semantic similarity when retrieving context, even minor alterations – paraphrasing, adding leading statements, or shifting emphasis – can steer the model toward biased or incorrect outputs without triggering traditional filters. This poses a significant challenge, as the malicious content isn’t flagged through signature-based detection or blocked by content moderation tools, but rather woven into the fabric of seemingly legitimate information, making it remarkably difficult to discern and defend against.

The integrity of Retrieval-Augmented Generation (RAG) systems is increasingly threatened by a novel attack vector: knowledge source poisoning. Malicious actors are capable of subtly manipulating the data used to inform large language models, introducing inaccuracies or biases that can significantly alter generated outputs. This isn’t a matter of blatant misinformation, but rather a carefully crafted influence on semantic similarity – attackers insert seemingly benign, yet strategically designed, content into the knowledge base. Because RAG relies on retrieving information based on semantic relevance, these poisoned sources can push the LLM towards specific, desired conclusions, effectively controlling the narrative without triggering traditional security filters. The danger lies in the subtlety of this approach; the alterations are often difficult to detect, and the resulting outputs, while technically ‘correct’ based on the retrieved data, can be misleading, biased, or even harmful, posing a significant risk across applications from customer service to critical decision-making.

The Sleeper Cell Strategy: Orchestrating Knowledge Base Compromise

Attackers utilize “sleeper documents” as a preparatory step in knowledge base poisoning attacks. These documents are crafted to appear as legitimate, non-malicious text, focusing on blending into the existing corpus of information. Their primary function isn’t to directly deliver a payload, but to establish a semantic connection-a bridge-to a separate “trigger document” containing the malicious content. This is achieved by strategically incorporating keywords and concepts relevant to the target domain, allowing the sleeper document to be indexed and retrieved alongside the trigger document when a user queries the system. The sleeper document essentially acts as a conduit, increasing the probability that the malicious trigger document will be presented to the user, bypassing typical content filters that focus on identifying overtly harmful text.

Attackers utilizing the sleeper cell strategy bypass typical retrieval filters by employing a two-document approach. “Sleeper” documents, appearing as legitimate content, are strategically introduced into a knowledge base to establish a contextual link. These are then paired with “trigger” documents that contain the malicious payload – such as prompts leading to harmful outputs or data exfiltration. Standard retrieval systems, designed to prioritize relevance based on individual queries, often fail to identify the coordinated attack because the sleeper document masks the true intent of the trigger document. This circumvention relies on the system retrieving both documents in response to a single query, effectively delivering the malicious payload alongside seemingly harmless content.

The efficacy of dual-document poisoning attacks is directly correlated to the co-retrieval of both the sleeper and trigger documents. This means an attacker must successfully manipulate the retrieval system to present both documents in response to a user query. Experiments demonstrate that utilizing a pure vector retrieval method achieves a 38% success rate in establishing this necessary co-retrieval. This suggests that while not a guaranteed method, vector retrieval alone provides a substantial probability for attackers to introduce malicious content via this technique, highlighting a vulnerability in systems relying solely on semantic similarity for document retrieval.

Hybrid retrieval significantly reduces co-retrieval success to 0% across various configurations α = 0.3, 0.5, 0.7, a substantial improvement over the 38% success rate of pure vector retrieval <span class="katex-eq" data-katex-display="false">\chi^{2}</span> = 21.05, <span class="katex-eq" data-katex-display="false">p < 10^{-6}</span>, Cohen’s <span class="katex-eq" data-katex-display="false">h = 1.33</span>]. — Hybrid retrieval significantly reduces co-retrieval success to 0% across various configurations α = 0.3, 0.5, 0.7, a substantial improvement over the 38% success rate of pure vector retrieval $\chi^{2}$ = 21.05, $p < 10^{-6}$ , Cohen’s $h = 1.33$ ].

Measuring Stealth: Quantifying the Efficacy of Semantic Deception

The ‘stealth’ metric assesses the capacity of an adversarial attack to successfully retrieve malicious documents when presented with queries designed to appear innocuous. This measurement directly quantifies an attack’s ability to remain dormant - that is, to operate without immediate detection - by evaluating its success rate in blending malicious retrieval with normal query responses. A higher stealth score indicates the attack is more effective at concealing its intent, as it can consistently access and deliver malicious content without triggering security mechanisms designed to identify overtly harmful requests. The metric is calculated based on the proportion of benign queries that successfully elicit the retrieval of malicious documents, providing a quantitative assessment of the attack's subtlety and potential for prolonged, undetected operation.

Evaluations of stealth attacks were conducted using two distinct datasets: Security Stack Exchange, comprising question-answer pairs related to cybersecurity, and FEVER Wikipedia, a fact verification corpus. The consistent performance of these attacks across both corpora - a mean absolute error of less than 5% between the two datasets for all tested LLMs - demonstrates the transferability of the attack methodology. This indicates that successful exploitation isn’t reliant on the specific knowledge domain contained within the training data of the target LLM, but rather exploits a general vulnerability in how the models process and respond to complex queries.

Evaluations of attack effectiveness across various Large Language Models (LLMs) reveal substantial performance differences. Specifically, Llama 4 demonstrated a high success rate of 93.3% in executing attacks, indicating a significant vulnerability. Conversely, Claude Sonnet 4.6 exhibited a considerably lower success rate of only 6.7%, suggesting a greater resilience to these particular attack vectors. These results highlight the variable security postures of different LLM architectures and the importance of model-specific security assessments.

Circumventing Hybrid Retrieval: The Limits of Ensemble Defenses

Hybrid retrieval systems, which integrate the strengths of both lexical-based retrieval - typically using algorithms like BM25 - and semantic-based retrieval via vector search, are commonly deployed as a defense against adversarial attacks on information retrieval systems. BM25 excels at keyword matching and precision, while vector search, utilizing dense embeddings, captures semantic similarity. However, this combined approach is not impenetrable. Recent research demonstrates that, while effective against some attacks, hybrid retrieval can be circumvented through specifically crafted adversarial examples, indicating its limitations as a standalone defense mechanism. The effectiveness of the defense is contingent on the attack strategy employed and the specific implementation of the hybrid system.

Despite the demonstrated effectiveness of hybrid retrieval - combining BM25 and vector search - in fully mitigating gradient-guided poisoning attacks (reducing success rates to 0%), the ‘Joint Sparse + Dense Optimization’ attack strategy achieves a 20-44% success rate. This indicates that optimizing both sparse (BM25) and dense (vector search) retrieval components simultaneously allows attackers to overcome the defenses provided by a simple combination of these techniques. The success rate varies depending on the specific dataset and attack parameters, but consistently demonstrates a vulnerability that hybrid retrieval alone does not address.

The demonstrated success of the Joint Sparse + Dense Optimization attack against hybrid retrieval systems - achieving a 20-44% success rate despite hybrid retrieval previously nullifying gradient-guided poisoning attacks - indicates that combining established retrieval methods is insufficient as a robust defense. This suggests a fundamental limitation in approaches that rely solely on ensemble techniques; attackers can optimize strategies to exploit vulnerabilities that persist even when multiple retrieval functions are utilized. Effective defenses require novel approaches that address the underlying mechanisms of attack, rather than simply layering existing protective measures.

Future Directions: Toward Robust and Aligned Language Models

Recent adversarial attacks on large language models (LLMs) demonstrate a critical need for preemptive safety protocols rather than reactive patching. These exploits, which successfully bypass built-in safeguards to generate harmful content, underscore the inherent vulnerabilities present in models trained primarily for fluency and coherence. The demonstrated capacity to reliably elicit undesirable outputs - ranging from malicious code to biased statements - signifies that relying solely on post-hoc moderation is insufficient. Consequently, the field must prioritize the development and implementation of proactive safety measures integrated directly into the model's training and architecture, focusing on robustness against adversarial prompts and a foundational alignment with ethical guidelines. This shift towards preventative security is essential for responsible LLM deployment and the mitigation of potential harms.

Constitutional AI represents a promising avenue for bolstering the semantic security of large language models by moving beyond simple prompt engineering and reactive safety filters. This approach centers on defining a set of guiding principles - a ‘constitution’ - that the LLM is trained to adhere to during both its training and inference phases. Rather than relying solely on human feedback to identify and correct harmful outputs, the model learns to self-evaluate its responses against these pre-defined ethical and security guidelines. Through this process, the LLM internalizes a framework for discerning and mitigating potentially problematic content, enabling it to proactively avoid generating outputs that violate the established principles, even when faced with adversarial prompts. This self-governance capability has the potential to significantly enhance the robustness and reliability of LLMs in sensitive applications, fostering greater trust and responsible deployment.

Recent evaluations reveal a substantial disparity in the security posture of large language models; while some exhibit relative robustness, others prove remarkably susceptible to adversarial prompting. Specifically, testing demonstrates that Llama 4 incurs a 93.3% rate of safety violations when subjected to carefully crafted inputs designed to bypass safeguards - a stark contrast to Claude Sonnet 4.6, which maintained a significantly lower 6.7% violation rate under identical conditions. This considerable difference underscores that vulnerability isn't inherent to the technology itself, but rather varies significantly based on model architecture, training data, and the implementation of safety protocols, suggesting a critical need for standardized evaluation metrics and focused development of more resilient LLMs.

The pursuit of robust retrieval-augmented generation systems often leads to layered complexity, a tendency this work subtly critiques. Researchers, in their eagerness to fortify against poisoning attacks, frequently introduce architectures that, while theoretically sound, obscure the fundamental interplay between corpus characteristics and vulnerability. This paper’s finding - that hybrid retrieval offers mitigation, yet isn’t impervious, and domain knowledge impacts success - highlights a preference for direct solutions. As Carl Friedrich Gauss observed, “If other objects are mixed in with the observations, it is necessary to separate them and to determine the laws according to which they act.” The study demonstrates that a clear understanding of the ‘objects’ - the corpus, the attack vectors, and the retrieval methods - is paramount, rather than masking them with needlessly intricate designs.

Further Directions

The observed resilience of hybrid retrieval to corpus poisoning is not, ultimately, a resolution. It is a relocation of the problem. Attack surfaces merely shift; optimization, given sufficient resources, will locate them. The domain-specificity of attack success suggests a deeper interplay between semantic space and the inherent biases within any curated knowledge source. The question is not whether attacks can succeed, but rather, at what cost, and with what collateral damage to the integrity of the retrieved information.

Future work must address the practical limitations of defense. Gradient-based optimization, while effective in this context, demands significant computational resources. Simpler, more readily deployable defenses, even if imperfect, offer a more pragmatic path forward. A focus on detection - identifying poisoned documents before retrieval - may prove more fruitful than attempting to inoculate the entire system.

The long view suggests a necessary re-evaluation of corpus construction itself. Static knowledge bases are, by definition, vulnerable. Dynamic, continuously validated sources, coupled with robust provenance tracking, represent a more sustainable, if considerably more complex, architecture. The cost of security is not merely computational; it is epistemological.

Original article: https://arxiv.org/pdf/2603.18034.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/