Unmasking AI’s Hidden Risks – Investment Policy

Author: Denis Avetisyan

New research reveals how interpretability techniques can systematically expose vulnerabilities in even the most advanced large language models.

The study identifies the roles of individual layers within a model during jailbreaking attempts, leveraging the RepE interpretability approach to pinpoint how the final layer-designated as -1-contributes to these vulnerabilities.

Interpretability-based safety audits provide a pathway for quantifying risks and understanding the robustness of state-of-the-art AI systems against adversarial attacks and unintended behavior.

Despite rapid advances in large language models (LLMs), systematically assessing their safety remains a critical challenge, often relying on opaque, black-box probing. This is addressed in ‘Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs’, which presents a comprehensive audit of eight leading open-source LLMs-including Llama-3, GPT-oss, Qwen, and Phi-using interpretability-driven techniques to uncover vulnerabilities. Our findings reveal substantial variations in robustness, with Llama-3 models exhibiting high jailbreaking rates while GPT-oss-120B proves remarkably resilient, demonstrating the power of steering-based audits to quantify safety risks. Given these disparities and the potential for both proactive defense and adversarial exploitation, how can we develop more robust internal defenses and standardized safety evaluations for increasingly powerful LLMs?

The Illusion of Control: Vulnerabilities in Language Models

Despite their remarkable capabilities, Large Language Models (LLMs) aren’t immune to manipulation; cleverly crafted prompts, known as adversarial attacks or ‘jailbreaking,’ can coax them into generating responses that violate safety guidelines or express harmful content. This vulnerability stems from the LLM’s core function – predicting the most likely continuation of a given text – which doesn’t inherently distinguish between benign and malicious requests. Attackers exploit this by framing prompts in ways that bypass built-in filters, often employing indirect language, role-playing scenarios, or code injection to subtly guide the model towards undesirable outputs. The resulting content can range from hate speech and misinformation to instructions for illegal activities, highlighting a critical need for ongoing research into robust defense mechanisms and reliable safety evaluations.

Current safeguards designed to prevent Large Language Models (LLMs) from generating problematic responses are increasingly challenged by inventive adversarial prompts – often referred to as ‘jailbreaking’ techniques. These attacks don’t simply rely on obvious, easily filtered keywords; instead, they employ subtle manipulations of language, complex reasoning chains, or indirect questioning to bypass safety protocols. This demonstrates a critical gap in existing defense mechanisms, which frequently operate on surface-level pattern matching rather than true semantic understanding. Consequently, a more nuanced investigation into the fundamental vulnerabilities within LLM architectures is essential – focusing on how these models interpret intent, handle ambiguity, and ultimately, decide what constitutes a safe or unsafe response. Addressing this necessitates moving beyond reactive filtering and towards proactive design principles that build inherent robustness against sophisticated manipulation.

Rigorous evaluation of Large Language Model (LLM) safety is paramount before these systems become widely accessible, demanding more than superficial testing. Current methods often struggle to anticipate the diverse range of adversarial prompts – carefully crafted inputs designed to bypass safety protocols and elicit harmful responses. Robust evaluation requires a multifaceted approach, encompassing automated red-teaming-where algorithms actively probe for vulnerabilities-and human evaluation to assess nuanced risks like bias or the generation of misleading information. Furthermore, simply identifying risks isn’t enough; mitigation strategies, such as reinforcement learning from human feedback or the development of more resilient model architectures, are essential to proactively address these challenges and ensure responsible AI deployment. Without comprehensive safety assessments, the potential for LLMs to generate damaging or policy-violating content remains a significant concern.

Dissecting the Black Box: The Pursuit of Interpretability

Interpretability research in large language models (LLMs) addresses the challenge of understanding the internal mechanisms driving their outputs. LLMs, while demonstrating impressive performance, operate as complex, high-dimensional systems, making it difficult to ascertain why a model produced a specific result. This research aims to move beyond simply observing input-output behavior and instead directly analyze the model’s internal states – its activations – to determine which features contribute to specific decisions or concepts. The goal is not to achieve full transparency – understanding every parameter – but rather to identify the salient features and pathways within the network responsible for particular behaviors, allowing for debugging, control, and ultimately, improved model design. This is crucial for addressing issues such as bias, safety, and trustworthiness in increasingly deployed LLM applications.

Representation Engineering and Universal Steering are techniques used to analyze Large Language Models (LLMs) by directly interacting with their internal states. Representation Engineering focuses on identifying and modifying specific neuron activations to understand their role in processing information. Universal Steering builds on this by employing Recursive Feature Machines to extract ‘Concept Vectors’ which represent individual concepts within the LLM. By manipulating these internal representations – specifically, by adding or subtracting Concept Vectors from the LLM’s activations – researchers can effectively steer the model’s responses and isolate the influence of particular concepts, enabling a granular understanding of the model’s decision-making process without requiring access to training data or model weights.

Reading Vectors are a core component of LLM interpretability methods, functioning as vector representations within the model’s activation space that correspond to specific concepts. These vectors are identified through techniques designed to isolate the neural activity associated with a given concept; when a particular Reading Vector is accessed – typically via a dot product with an internal activation – the resulting scalar value indicates the presence or strength of that concept within the LLM’s processing of a given input. Crucially, a single concept may be represented by multiple Reading Vectors, and these vectors can be combined to represent more complex ideas, allowing researchers to ‘read out’ specific information from the model’s internal state without altering its core functionality. The effectiveness of a Reading Vector is determined by its ability to consistently and accurately reflect the presence of the target concept across diverse inputs and model layers.

Universal Steering leverages Recursive Feature Machines (RFMs) to identify and isolate specific concepts within a Large Language Model (LLM). RFMs operate by iteratively learning feature combinations from internal activations, ultimately producing ‘Concept Vectors’ that represent these concepts. These Concept Vectors can then be used to directly manipulate the LLM’s internal state, enabling precise control over generated responses. Specifically, a steering direction – derived from the Concept Vector – is applied to the activations during inference, effectively amplifying or suppressing the influence of the targeted concept on the output. This allows researchers to not only identify what concepts the LLM has learned, but also to systematically test and modify the model’s reliance on those concepts without retraining.

Universal Steering modulates response category distributions across coefficient values for models evaluated in Algorithm 1.

Proactive Defense: Automated Vulnerability Discovery

Adaptive Grid Search is utilized as a proactive method for identifying vulnerabilities in Large Language Models (LLMs). This process systematically explores a defined search space of prompts, varying parameters to reliably generate responses that violate established safety guidelines. The search is ‘adaptive’ because, based on the LLM’s response to each prompt, the algorithm adjusts subsequent prompts to efficiently converge on inputs that consistently elicit unsafe outputs. This differs from random or manual prompting by automating the discovery of adversarial inputs and providing a quantifiable measure of LLM robustness against malicious or unintended interactions.

The implementation of a ‘Control Coefficient’ is central to modulating the intensity of steering interventions applied to Large Language Models (LLMs). This coefficient functions as a scalar multiplier affecting the magnitude of the applied steering vector, thereby controlling the degree to which the LLM’s response is guided away from unsafe outputs. The optimal value of this coefficient is determined empirically; higher values represent stronger steering, increasing safety but potentially diminishing performance on legitimate tasks, while lower values prioritize performance but risk increased vulnerability to adversarial prompts. Experimentation across models like Llama-3 and GPT-oss demonstrates significant variance in optimal coefficient values – Llama-3 requires a range of 0.3 to 0.9, whereas GPT-oss-20B necessitates a coefficient of 49 to achieve comparable safety levels, illustrating the model-specific tuning required for effective safety intervention.

Evaluation of the automated vulnerability discovery process utilizes benchmarks, notably the ‘Adversarial Bench’, to quantify Large Language Model (LLM) robustness. This benchmark employs a suite of adversarial prompts designed to elicit unsafe or undesirable responses, allowing for a systematic assessment of the LLM’s susceptibility to jailbreaking attempts. Rigorous testing with the Adversarial Bench provides a measurable metric for evaluating the efficacy of steering interventions and representation engineering techniques in mitigating vulnerabilities and ensuring safe LLM operation. Results are reported as success rates, indicating the percentage of adversarial prompts that successfully bypass safety mechanisms and generate harmful outputs.

Evaluation of the Universal Steering method was conducted across several open-source Large Language Models, including Llama-3, Phi4, GPT-oss, and Qwen3. Results indicate a significant variation in vulnerability based on model size and the application of Representation Engineering. Specifically, a 94% jailbreaking success rate was achieved using Universal Steering on the GPT-oss-20B model; however, when combined with Representation Engineering, the jailbreaking success rate dropped to 0% on the larger GPT-oss-120B model, demonstrating an increased robustness with scale and the applied mitigation technique.

Analysis of the ‘Control Coefficient’ required for successful adversarial prompt generation reveals significant differences between LLM architectures. Experiments demonstrate that Llama-3 models achieve reliable unsafe responses with a coefficient ranging from 0.3 to 0.9, indicating a relatively low degree of steering intervention is necessary to elicit undesirable behavior. In contrast, GPT-oss-20B necessitates a substantially higher coefficient of 49 to achieve the same result. This disparity suggests that Llama-3 exhibits a lower inherent resistance to adversarial prompting compared to GPT-oss-20B, and therefore requires less forceful intervention during the ‘Adaptive Grid Search’ process to identify vulnerabilities.

Qwen3-0.6B demonstrates varied steering responses across different query categories, as evaluated by Grok-4.

Validating Resilience: Comprehensive Evaluation & Dataset Usage

Grok-4 is utilized as an independent adjudication system within our LLM safety audit process to categorize model responses and identify instances of harmful content. This approach mitigates subjective bias inherent in human evaluation by providing a consistent and reproducible assessment standard. The implementation involves submitting LLM outputs to Grok-4, which then classifies responses based on pre-defined safety criteria, including the presence of hate speech, personally identifiable information, or instructions for illegal activities. The resulting classifications are used to quantify safety performance and track improvements across different model versions and training techniques, providing a data-driven basis for safety enhancements.

The ToxicChat Dataset serves as a foundational resource in the development and assessment of Large Language Model (LLM) safety mechanisms, specifically focusing on identifying and mitigating jailbreaking attempts. This dataset consists of adversarial prompts designed to elicit harmful or unintended responses from LLMs, covering a broad range of potentially dangerous topics and employing diverse phrasing techniques. Its utility lies in providing a standardized and comprehensive benchmark for evaluating an LLM’s resilience to jailbreaking directions; models are tested against the dataset, and performance metrics, such as the percentage of successful jailbreaks, are recorded. The dataset’s consistent nature allows for objective comparison of different safety interventions and iterative improvement of LLM robustness against adversarial inputs.

Automated vulnerability discovery employs a systematic approach to probe the LLM’s response space by generating a diverse set of prompts designed to elicit potentially harmful outputs. This process differs from manual testing by leveraging algorithms to explore a significantly larger number of input variations, including edge cases and subtle phrasing changes that human testers might overlook. By exhaustively sampling the input space, the method identifies vulnerabilities – specifically, prompts that successfully bypass safety mechanisms – with a higher degree of coverage than is typically achievable through manual inspection. This is critical for uncovering previously unknown weaknesses in the LLM’s safety protocols and improving its robustness against adversarial attacks.

Evaluation of jailbreaking techniques on the Llama-3.1-8B language model revealed significant performance differences. Universal Steering achieved an 86% success rate in generating outputs that bypassed safety constraints, indicating a higher degree of effectiveness in eliciting undesirable responses. In contrast, Representation Engineering yielded a 57% success rate under the same conditions. This comparison demonstrates that, within this specific evaluation framework, Universal Steering is comparatively more effective at identifying and exploiting vulnerabilities in the model’s safety mechanisms than Representation Engineering.

Combining automated vulnerability discovery with rigorous evaluation establishes a robust lifecycle for improving LLM safety. Automated methods systematically probe the model’s response space, identifying potential weaknesses that traditional manual testing might overlook. These discovered vulnerabilities are then subjected to rigorous evaluation, often utilizing independent judges like Grok-4 and datasets such as the ToxicChat Dataset, to categorize and quantify the severity of harmful outputs. This iterative process – discovery, evaluation, and subsequent model refinement – allows for a more comprehensive and efficient approach to building safer large language models, exceeding the capabilities of isolated testing or training efforts.

The pursuit of safety in large language models often resembles sculpting – removing excess to reveal the form within. This work, focusing on interpretability-based safety audits, embodies that principle. It systematically dismantles the layers of complexity to expose vulnerabilities, mirroring the core idea of adversarial attacks and steering vectors. As Claude Shannon observed, “The most important thing in communication is to convey information accurately.” This paper doesn’t seek to add defenses, but to precisely reveal where communication breaks down, highlighting the discrepancies in robustness across models. The methodology’s power lies not in complexity, but in its capacity to distill meaning from potential failure points, a testament to clarity’s inherent strength.

What’s Next?

The systematic probing of large language models, as demonstrated, reveals not inherent malice, but predictable failure modes. These models are, at their core, elaborate pattern completion engines. The ‘jailbreaks’ are not breaches of intelligence, but exploitations of incomplete specifications – a reliance on implicit assumptions rather than explicit constraints. Future work must move beyond identifying that a model is vulnerable, and focus on quantifying how vulnerable – a measure of semantic drift under adversarial pressure. This isn’t merely about building more robust defenses; it’s about developing metrics that correlate internal representation with observable behavior.

The notion of ‘steering’ via concept vectors suggests a tantalizing, if fraught, path toward control. However, current methods treat concepts as discrete, isolated entities. A more nuanced approach would recognize the inherent interconnectedness of semantic space, acknowledging that manipulating one concept inevitably introduces ripple effects throughout the entire model. The challenge lies in predicting, and mitigating, these unintended consequences – a task that demands a deeper understanding of the model’s internal topology.

Ultimately, the pursuit of ‘safe’ language models is a study in applied entropy. The goal is not to eliminate all potential for undesirable output – an impossible task – but to constrain the probability distribution, shifting the balance toward predictable, and therefore manageable, behavior. The field must resist the temptation to add layers of complexity, recognizing that elegance – a reduction to essential principles – is often the most effective defense.

Original article: https://arxiv.org/pdf/2604.20945.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Control: Vulnerabilities in Language Models

Dissecting the Black Box: The Pursuit of Interpretability

Proactive Defense: Automated Vulnerability Discovery

Validating Resilience: Comprehensive Evaluation & Dataset Usage

What’s Next?

See also: