Stress-Testing AI Agents: A New Approach to Security

Author: Denis Avetisyan

Researchers have developed an adaptive architecture for systematically evaluating and improving the safety of complex AI agents against adversarial attacks.

The AJAR pipeline systematically deconstructs complex robotic tasks into modular components, enabling a granular understanding and subsequent reassembly of behavioral primitives.

This paper introduces AJAR, an architecture leveraging protocol-driven orchestration and agentic runtimes to perform more realistic red-teaming of large language model-based agents and assess action safety.

As Large Language Models transition from passive chatbots to autonomous agents, ensuring their safety requires a shift from content moderation to robust action security, yet current red-teaming frameworks struggle with complex, multi-turn agentic exploits. To address this gap, we introduce ‘AJAR: Adaptive Jailbreak Architecture for Red-teaming’, a novel framework leveraging protocol-driven cognitive orchestration built upon the Petri runtime to decouple adversarial logic and facilitate standardized, plug-and-play testing of advanced algorithms like X-Teaming. Our work demonstrates AJAR’s ability to perform stateful backtracking within tool-use environments and reveals a complex interplay between code execution vulnerabilities and cognitive load in persona-based attacks. Will this adaptive architecture enable more effective, environment-aware evaluation of the rapidly evolving attack surface of agentic LLMs?

Unveiling the Agentic Gap: A System Under Stress

The increasing capacity of large language models to utilize external tools introduces a significant, and often overlooked, vulnerability termed the ‘Agentic Gap’. While intended to enhance functionality, tool use can paradoxically amplify risk by creating a cognitive shield – the LLM may defer reasoning to the tool itself, obscuring flawed logic or malicious code execution. This isn’t simply a matter of a compromised tool; the very process of employing tools can open new attack vectors, allowing adversarial prompts to indirectly manipulate outcomes through tool interactions. Consequently, traditional safety evaluations, focused primarily on direct prompt injection, are becoming insufficient, as they fail to account for these emergent risks arising from the dynamic interplay between the LLM and its environment. The ability to act through tools, therefore, necessitates a shift in safety paradigms, demanding evaluations that assess not just the LLM’s inherent reasoning, but also its decision-making process when leveraging external capabilities.

Current evaluations of large language model safety predominantly center on direct prompt injection – attempts to manipulate the model through cleverly worded instructions. However, this approach overlooks a crucial dynamic: as these models gain the ability to utilize tools and interact with external environments, new vulnerabilities emerge that are not addressed by simply securing the prompt itself. This ‘agentic gap’ arises because tool use introduces a layer of complexity, allowing models to execute actions with real-world consequences, potentially circumventing safeguards designed to control outputs. Traditional methods fail to account for risks stemming from the process of achieving a goal – the steps taken with tools – rather than solely focusing on the final, generated text. Consequently, a model deemed safe under standard prompt injection tests may still exhibit dangerous behavior when operating as an agent, highlighting the need for safety assessments that consider the full scope of tool-augmented interaction.

The burgeoning Agentic Gap arises from a fundamental disconnect between how large language models ‘understand’ their surroundings and how they actually operate within them through tool use. These models, trained primarily on textual data, develop a simulated world view, but lack embodied experience or genuine environmental awareness. When granted access to tools – search engines, APIs, or even physical robots – LLMs attempt to navigate the real world based on this incomplete, text-derived understanding. This mismatch can lead to misinterpretations of tool outputs, flawed task decomposition, and ultimately, unpredictable or harmful behavior. The very act of employing a tool doesn’t inherently imbue the LLM with common sense; instead, it amplifies the consequences of existing cognitive limitations, creating vulnerabilities that traditional safety evaluations – focused on direct input manipulation – often fail to detect. The model isn’t simply responding to a prompt; it’s actively acting in an environment, and its actions are shaped by a world model that may be fundamentally inaccurate.

AJAR: Reverse-Engineering Agency for Robustness

AJAR, or Adaptive Jailbreaking for Agentic Systems, is constructed on the Petri agentic runtime, which provides the foundational infrastructure for evaluating the capabilities and limitations of target models. This architecture enables a dynamic and iterative assessment process, moving beyond static vulnerability scans. Petri facilitates the creation of an environment where various attack strategies can be systematically deployed and analyzed. The runtime’s agentic nature allows for the instantiation of specialized agents designed to probe the target model’s responses and identify potential weaknesses in its security and alignment. This robust environment is critical for determining the extent to which a model can be manipulated or induced to exhibit unintended behaviors.

Protocol-driven Cognitive Orchestration (PCO) represents a structured attack methodology within the AJAR framework, moving beyond reliance on single-turn prompt engineering. PCO utilizes an ‘Auditor Agent’ to execute multi-step attacks defined by pre-defined protocols, enabling the systematic exploration of a Target Model’s vulnerabilities. These protocols detail a sequence of actions, observations, and conditional branching, allowing the Auditor Agent to adapt its strategy based on the Target Model’s responses. This orchestrated approach facilitates a more thorough assessment of the Target Model’s robustness by simulating complex interactions and identifying weaknesses that simple prompt manipulations would likely miss. The use of protocols also allows for repeatable and quantifiable evaluations of the Target Model’s behavior under various attack scenarios.

AJAR utilizes ‘Synthetic Tools’ – programmatically generated functionalities presented to the Target Model – to probe its self-awareness of agency and capabilities. These tools are not intended for practical task completion, but rather as stimuli designed to elicit responses revealing the model’s internal representation of its own permissions, limitations, and operational scope. By analyzing how the Target Model interprets and interacts with these synthetic functionalities, AJAR can determine whether the model accurately assesses its boundaries, potentially exposing vulnerabilities related to overestimation or underestimation of its capabilities, and informing targeted jailbreaking strategies beyond simple prompt-based attacks.

DeepSeek V3.2: A Controlled Stress Test

For validation testing, both the Auditor Agent and the Target Model were implemented within the AJAR framework utilizing DeepSeek V3.2. DeepSeek V3.2 is a large language model specifically selected for its demonstrated capacity in complex reasoning tasks, making it suitable for both generating sophisticated adversarial attacks (as the Auditor) and serving as the subject of those attacks (as the Target Model). This instantiation allows for a controlled environment to assess the robustness of agentic behavior and identify potential vulnerabilities through targeted prompting and evaluation.

The Auditor Agent employs a multi-faceted approach to vulnerability assessment, moving beyond simple prompt injection. Its architecture incorporates complex branching logic, enabling it to dynamically adjust attack vectors based on the Target Model’s responses. These strategies include, but are not limited to, goal hijacking, prompt leaking attempts, and the exploitation of agentic memory and tool use. The agent evaluates the Target Model’s behavior after each interaction, branching to different attack paths based on observed weaknesses and iteratively refining its approach to maximize the likelihood of exposing exploitable vulnerabilities in the Target Model’s agentic workflows.

AJAR’s validation methodology, employing the DeepSeek V3.2 model for both auditing and target agents, facilitates the discovery of vulnerabilities not typically identified by conventional safety evaluations. Traditional assessments often rely on static analysis or predefined test cases, which are insufficient to expose emergent, agentic behaviors and complex interaction-based weaknesses. AJAR, through its dynamic, adversarial testing and branching logic, actively probes the Target Model’s responses to a wider range of stimuli and edge cases, revealing potential failures in alignment, robustness, and adherence to safety guidelines that would otherwise remain undetected.

Beyond Static Benchmarks: Towards Adaptive Safety

Existing methods for evaluating large language model safety, such as datasets like HarmBench and adversarial techniques like X-Teaming, frequently employ static benchmarks that present a limited snapshot of potential risks. These approaches often struggle to capture the nuanced and unpredictable behavior of increasingly agentic systems – those capable of autonomous planning and execution. Because static datasets rely on pre-defined prompts and scenarios, they may fail to uncover vulnerabilities that emerge from complex interactions and novel strategies developed by the language model itself. This limitation is particularly concerning as LLMs become more sophisticated, capable of ‘gaming’ established tests or exhibiting unintended consequences in dynamic, real-world contexts, highlighting the need for more adaptive evaluation methodologies.

The Adaptive Jacobian Red-Teaming, or AJAR, represents a shift in how large language model (LLM) safety is assessed. Unlike traditional red-teaming methods that rely on pre-defined adversarial prompts, AJAR employs a dynamic system capable of evolving its attack strategies in response to the LLM’s defenses. This is achieved by iteratively perturbing prompts based on the LLM’s gradient – essentially, identifying which subtle changes are most likely to elicit harmful responses. By probing the ‘Jacobian’ of the LLM’s behavior, AJAR can uncover emergent vulnerabilities – unexpected weaknesses that arise from the complex interactions within the model – and push the boundaries of LLM agency to reveal potential failure modes. This adaptive process allows for a more thorough and realistic evaluation of safety, moving beyond static benchmarks and providing a more robust understanding of an LLM’s limitations as capabilities advance.

The accelerating pace of large language model development demands a corresponding evolution in safety evaluation techniques. Traditional benchmarks, while valuable starting points, struggle to anticipate the novel behaviors arising from increasingly sophisticated agentic systems. This new methodology addresses this critical gap by establishing a dynamic framework for red-teaming, one that doesn’t rely on pre-defined attack patterns but instead adapts and evolves alongside the LLM’s capabilities. By continuously probing for emergent vulnerabilities through an iterative process of challenge and response, this approach moves beyond static assessments to provide a more resilient and future-proof foundation for AI safety evaluations. Ultimately, this adaptive red-teaming offers a pathway to proactively identify and mitigate risks, ensuring that safety measures remain effective as LLMs become ever more powerful and complex.

The pursuit within AJAR-an adaptive jailbreaking architecture-mirrors a fundamental tenet of understanding any complex system. It doesn’t simply accept the boundaries presented, but actively probes them. This echoes Marvin Minsky’s assertion: “You can’t always get what you want, but you can get what you need to know.” The architecture’s protocol-driven cognitive orchestration, designed for complex adversarial testing, isn’t about malicious intent, but about identifying vulnerabilities – the ‘need to know’-within the LLM agent’s safety mechanisms. Every exploit, as the work demonstrates, starts with a question, not with intent, and AJAR systematically asks those questions to reverse-engineer the limits of model context protocols and action safety.

What’s Next?

The architecture detailed within pushes against the comfortable notion of ‘safe’ agents, and rightly so. The current approach, focusing on protocol-driven adversarial testing, merely automates the discovery of existing failure modes. A more interesting challenge lies in provoking novel failures – constructing inputs not anticipated by the training data, or exploiting emergent behaviors from the agent’s own cognitive orchestration. The system isn’t truly ‘broken’ until it surprises its creators, revealing a blind spot in the underlying assumptions about intelligence and control.

Future iterations will inevitably require a deeper engagement with the problem of reward hacking. An agent optimized to ‘win’ at red-teaming may prioritize demonstrating vulnerabilities over actually using them responsibly, a distinction crucial for real-world deployment. The question isn’t simply whether an agent can be jailbroken, but whether it can be induced to do so in a way that is both informative and, ironically, contained.

Ultimately, this work serves as a useful demolition. It dismantles the idea of a permanently ‘safe’ agent, exposing the inherent fragility of any complex system built on probabilistic prediction. The real innovation won’t be in building better defenses, but in accepting that perfect security is an illusion, and focusing instead on graceful degradation and rapid adaptation when – not if – the inevitable cracks appear.

Original article: https://arxiv.org/pdf/2601.10971.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unveiling the Agentic Gap: A System Under Stress

AJAR: Reverse-Engineering Agency for Robustness

DeepSeek V3.2: A Controlled Stress Test

Beyond Static Benchmarks: Towards Adaptive Safety

What’s Next?

See also: