When Saying ‘No’ Isn’t Enough: Hardening Language Models Against Jailbreaks

Author: Denis Avetisyan

New research reveals a critical flaw in current AI safety protocols – a tendency for models to override safety features when pressured – and proposes a more robust approach to prevent unintended outputs.

The paper introduces ‘fail-closed alignment’-a technique utilizing redundant refusal mechanisms-to address vulnerabilities in large language model safety and improve resilience against prompt-based attacks.

Current large language models (LLMs) exhibit a critical vulnerability: their safety mechanisms often fail open, collapsing under targeted prompt-based attacks that suppress dominant refusal features. This work, ‘Fail-Closed Alignment for Large Language Models’, addresses this limitation by proposing a novel alignment principle-fail-closed alignment-which advocates for redundant, independent causal pathways for refusal, ensuring robustness even with partial failures. We demonstrate this principle through a progressive alignment framework that iteratively ablates existing refusal directions, forcing the model to reconstruct safety along new subspaces, achieving stronger robustness against jailbreaks while preserving generation quality. Does enforcing multiple, independent safety constraints represent a fundamental shift toward truly reliable and trustworthy LLMs?

Deconstructing the Guardrails: The Fragility of Alignment

Large Language Models (LLMs) are engineered with a ‘refusal mechanism’ – a set of internal constraints designed to prevent the generation of harmful, biased, or otherwise inappropriate content. However, this seemingly robust safety feature proves surprisingly fragile in practice. While these models often succeed in blocking overtly malicious prompts, subtle alterations or cleverly crafted inputs can frequently bypass these safeguards. This brittleness doesn’t stem from a fundamental lack of intelligence, but rather from the inherent difficulty in anticipating and coding for every possible adversarial input. The refusal mechanism operates by identifying patterns associated with harmful requests, and this pattern-matching approach is susceptible to circumvention. Consequently, even models demonstrating strong overall alignment can exhibit unpredictable failures, raising concerns about their reliability and the potential for unintended consequences as they become increasingly integrated into critical applications.

The safety of large language models isn’t an all-or-nothing proposition; a concerning vulnerability known as ‘Fail-Open Alignment’ reveals that even seemingly well-aligned systems can produce harmful content due to partial failures in their refusal mechanisms. Rather than a complete breakdown of safety protocols, this phenomenon manifests as subtle degradations – a weakened filter here, a misinterpreted instruction there – that cumulatively bypass safeguards. Essentially, the model doesn’t explicitly agree to generate unsafe content, but its internal processes allow it to happen anyway, often by subtly reinterpreting prompts or exploiting ambiguities. This is particularly alarming because it suggests that traditional metrics for alignment – which often focus on overall refusal rates – can be misleading, masking underlying weaknesses that are only exposed under specific, cleverly crafted conditions. The risk isn’t that models will suddenly become malicious, but that their safeguards will gradually erode, leading to an insidious increase in the generation of harmful or biased outputs.

Large language models are equipped with refusal mechanisms intended to block the generation of harmful content, yet these defenses are under constant siege from increasingly inventive prompt-based jailbreak attacks. Researchers have demonstrated that subtle manipulations of input prompts – employing techniques like character substitution, indirect phrasing, or the construction of elaborate role-playing scenarios – can frequently circumvent these safety filters. These attacks don’t necessarily require deep technical expertise; many successful examples have emerged from open-source communities dedicated to ‘red-teaming’ LLMs, highlighting a significant vulnerability. The ongoing arms race between safety mechanisms and jailbreak techniques underscores the precarious nature of alignment, suggesting that maintaining robust safety requires not just stronger filters, but also a deeper understanding of how these models interpret and respond to adversarial prompts.

Mapping the Black Box: Decoding Refusal Through Activation Space

Large Language Model (LLM) refusal behavior originates within the ‘Activation Space’, which represents the high-dimensional state of neuron activations during the processing of any given input. This space isn’t a discrete set of states, but rather a continuous, multi-dimensional area where each dimension corresponds to the activation level of a neuron within the model. The model doesn’t simply “decide” to refuse; instead, the input prompt propagates through the network, resulting in a specific activation pattern. The characteristics of this pattern – the intensity and combination of neuron activations – determine the model’s subsequent response, including whether it chooses to fulfill the request or decline based on safety protocols. Analyzing this activation space allows researchers to understand how a model arrives at a refusal decision, rather than simply observing that it refused.

The ‘Refusal Direction’ within an LLM’s activation space is a specific vector representing the consistent pattern of neuron activations that correlate with the model declining to respond to potentially harmful prompts. This direction isn’t a single neuron, but rather a combination of activations across many neurons. Identifying this direction involves analyzing internal model states during refusal events and determining the principal component that explains variance in those states. Quantifying the magnitude and consistency of activation along this direction provides a metric for assessing how reliably the model refuses harmful requests, and how resistant it is to adversarial attacks designed to circumvent those refusals. Shifts or distortions in this direction indicate potential vulnerabilities in the refusal mechanism.

Evaluating the robustness of an LLM’s refusal direction-the specific activation patterns indicating prompt rejection-requires systematic testing against adversarial inputs. Datasets such as HarmBench and AdvBench are designed to rigorously assess this robustness by presenting models with a diverse range of potentially harmful prompts, including those employing subtle phrasing or indirect requests. These benchmarks quantify a model’s susceptibility to generating unsafe content despite attempts to elicit refusal, measuring performance through metrics like attack success rate and the ability to consistently identify and decline harmful queries. Analysis using these datasets allows developers to identify vulnerabilities in the refusal mechanism and improve model safety through techniques like fine-tuning and reinforcement learning from human feedback.

Building Fortresses, Not Walls: Fail-Closed Alignment and Progressive Ablation

Fail-Closed Alignment addresses potential safety failures in large language models by implementing redundant refusal mechanisms. This approach moves beyond single-point failure designs, ensuring that even if one refusal pathway is bypassed or fails, alternative safeguards remain operational. The core principle is to build multiple, independent checks for harmful prompts or outputs. These redundant pathways increase the overall robustness of the system, making it significantly more difficult for adversarial attacks or unexpected inputs to elicit unsafe responses. The design prioritizes erring on the side of caution; a partial failure results in a safe refusal rather than a potentially harmful generation.

The Progressive Alignment Framework establishes redundancy in refusal mechanisms through an iterative process of identifying and systematically ablating – or removing – primary directions of undesired model behavior. This framework doesn’t rely on a single refusal pathway; instead, it identifies the most prominent avenues for generating harmful or inappropriate content and then reduces the model’s reliance on those specific responses. Subsequent iterations then focus on secondary refusal directions, building layers of redundant safety measures. This approach contrasts with methods that attempt to directly enforce refusal, and instead focuses on diminishing the influence of specific problematic outputs, thereby increasing robustness against adversarial attacks and maintaining generative capabilities.

Evaluations demonstrate the proposed framework significantly improves model robustness against adversarial jailbreaks. Across four distinct models subjected to diverse attack strategies, the method reduces attack success rates by 92 to 97 percent. Concurrently, the system maintains a high level of generation quality, achieving a Compliance Rate of 86% and an Accuracy of 61.6%. These metrics represent a substantial improvement compared to baseline approaches, indicating the framework’s effectiveness in both safety and performance.

The pursuit of robust safety in Large Language Models, as detailed in this work, echoes a fundamental principle of resilient systems: redundancy. The paper’s exploration of ‘fail-closed alignment’ – building multiple, independent refusal mechanisms – isn’t merely about preventing undesirable outputs, but about understanding how systems degrade when single points of failure are stressed. As Robert Tarjan once observed, “A good algorithm is one that does what it is supposed to do, even when given bad input.” This sentiment perfectly encapsulates the core idea behind fail-closed alignment; it’s not enough for a model to usually refuse harmful prompts. The system must maintain that refusal even when adversarial prompts attempt to bypass initial defenses, effectively rendering the system resistant to subtle manipulations of the activation space.

Beyond the Safety Net

The assertion that a bug is the system confessing its design sins holds particular weight here. This work exposes a critical flaw in current large language model alignment: the tendency toward ‘fail-open’ behavior. Redundancy, rather than simply strengthening a single refusal mechanism, appears to be the key. But this is not a solution; it’s a postponement. The activation space remains a black box, and independent refusal pathways, however numerous, are still susceptible to unforeseen causal interactions-a complex system’s inherent fragility.

Future work must move beyond simply adding layers to the existing architecture. A deeper understanding of how refusals are represented and processed within the model is paramount. Can we reliably predict the emergent behavior of redundant safety systems? Or are we merely building increasingly elaborate castles on foundations of sand? The challenge isn’t just to prevent jailbreaks, but to map the contours of the model’s internal logic – to reverse-engineer its capacity for both creation and deception.

Ultimately, the pursuit of ‘safe’ AI may necessitate a fundamental shift in how these systems are constructed. Perhaps the very notion of a monolithic, general-purpose language model is flawed. The ideal may not be a perfectly aligned system, but one intentionally constrained – a deliberately incomplete intelligence, acknowledging the limits of its own understanding.

Original article: https://arxiv.org/pdf/2602.16977.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deconstructing the Guardrails: The Fragility of Alignment

Mapping the Black Box: Decoding Refusal Through Activation Space

Building Fortresses, Not Walls: Fail-Closed Alignment and Progressive Ablation

Beyond the Safety Net

See also: