The Art of the Prompt: Bypassing AI Safety with Ancient Language

Author: Denis Avetisyan

Researchers have discovered a surprising method for circumventing AI safety protocols by crafting adversarial prompts rooted in the nuances of Classical Chinese.

This work contrasts with contemporary prompt optimization techniques-such as PAIR/TAP and CL-GSO-by drawing inspiration from classical Chinese contexts to define an eight-dimensional search space, subsequently employing a unified, bio-inspired optimization strategy for prompt generation.

A novel jailbreak framework utilizing bio-inspired optimization and cross-lingual transfer demonstrates effective bypassing of safety alignments in large language models.

Despite increasing efforts to align large language models (LLMs) with safe and ethical guidelines, vulnerabilities to adversarial manipulation persist. This paper, ‘Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search’, investigates a novel attack vector leveraging the conciseness and ambiguity inherent in Classical Chinese to bypass LLM safety constraints. The authors introduce CC-BOS, a framework utilizing a multi-dimensional, fruit fly-inspired optimization algorithm to automatically generate effective Classical Chinese adversarial prompts, consistently outperforming state-of-the-art jailbreak methods. Could this cross-lingual approach reveal fundamental weaknesses in current LLM safety alignment techniques and inspire more robust defenses?

The Inevitable Cracks: Probing the Vulnerability of Language Models

Despite their remarkable capabilities, Large Language Models are not impervious to manipulation. Adversarial attacks, carefully crafted inputs designed to exploit inherent weaknesses, can bypass the safety mechanisms built into these systems. These attacks don’t necessarily rely on obvious loopholes; instead, they often probe the boundaries of the model’s training data and alignment protocols, subtly prompting it to generate harmful, biased, or misleading content. The core issue lies in the fact that LLMs, while proficient at pattern recognition and text generation, lack true understanding, making them vulnerable to prompts that exploit ambiguities or leverage unforeseen combinations of concepts. Consequently, even models with robust safety features can be induced to produce undesirable outputs, highlighting a critical need for continuous security evaluations and the development of more resilient architectures.

Despite advancements in AI safety, current defense mechanisms for large language models frequently falter when faced with sophisticated attacks that rely on contextual understanding and cultural nuance. These attacks don’t simply employ obvious harmful keywords; instead, they carefully craft prompts that exploit the model’s learned associations and implicit biases. A seemingly innocuous request, when framed within a specific cultural context or relying on subtle linguistic cues, can bypass safety protocols and elicit unintended, even harmful, responses. This vulnerability stems from the fact that LLMs, while adept at pattern recognition, often lack genuine comprehension of the social and ethical implications embedded within language, making them susceptible to manipulation through carefully constructed, context-rich scenarios. Consequently, defenses built on keyword filtering or simplistic rule-based systems prove inadequate against these more subtle, culturally aware adversarial prompts.

The expanding capabilities of large language models now encompass numerous languages, yet this multilingualism inadvertently creates new security challenges. Current safety protocols are often developed and tested primarily on English language models, leaving cross-lingual applications vulnerable to attacks that exploit discrepancies in linguistic nuance and cultural context. A prompt deemed harmless in one language can, when translated, elicit a harmful response in another, or subtly alter the model’s behavior in unintended ways. This necessitates the development of novel safety approaches-including cross-lingual adversarial training and culturally aware filtering mechanisms-to ensure consistent and reliable safety performance across all supported languages. Addressing these vulnerabilities is crucial for deploying truly global and trustworthy language models.

Large language models, despite built-in safety measures, can be manipulated through cleverly designed prompts that exploit a learning process known as in-context learning. Rather than directly requesting a harmful response, successful attacks subtly steer the model by providing examples or framing questions in a way that encourages undesirable outputs. This bypasses typical content filters because the model isn’t explicitly prompted to generate harmful content; instead, it infers the desired response from the provided context. The technique relies on the model’s ability to identify patterns and extrapolate from the given examples, effectively ‘teaching’ it to produce outputs that would otherwise be blocked. Consequently, even seemingly benign prompts can become vectors for malicious content generation, highlighting a critical vulnerability in current LLM safety protocols and necessitating more robust defenses against contextual manipulation.

CC-BOS employs a bio-inspired search loop to iteratively refine jailbreak prompts, optimizing across context, intent, style, and timing based on keyword and semantic consistency scoring to identify high-performing strategies.

Bio-Inspired Exploitation: A Systematic Approach to LLM Jailbreaking

CC-BOS is a black-box jailbreak framework developed to systematically identify and exploit vulnerabilities in Large Language Models (LLMs) when processing classical Chinese text. Unlike many existing jailbreak methods, CC-BOS does not require access to the LLM’s internal parameters or training data; it operates solely by crafting and evaluating input prompts. The framework’s novelty lies in its focus on the nuances of the Chinese language and cultural context, which can present unique challenges for LLM safety mechanisms. By providing a structured methodology for generating adversarial prompts, CC-BOS facilitates a more comprehensive assessment of LLM robustness against jailbreak attacks specifically targeting Chinese language inputs and aims to improve the security of LLMs in this context.

The CC-BOS framework defines an eight-dimensional strategy space to systematically explore potential jailbreak attack vectors against Large Language Models. These dimensions encompass prompt injection techniques categorized by instruction type (e.g., direct, indirect), role play characteristics (e.g., persona, scenario), context framing (e.g., historical, fictional), adversarial phrasing (e.g., negation, ambiguity), translation methods, code injection, embedding manipulation, and prompt composition. This multi-dimensional approach allows for the generation of a diverse range of prompts, increasing the likelihood of identifying vulnerabilities and bypassing safety mechanisms. Each dimension contains a defined set of parameters and values, enabling automated exploration and optimization of attack strategies.

The CC-BOS framework employs a bio-inspired optimization algorithm modeled after fruit fly foraging behavior to efficiently identify effective jailbreak prompts within its eight-dimensional strategy space. This algorithm initializes a population of prompt candidates, analogous to a swarm of fruit flies, and iteratively refines them based on their success in eliciting prohibited responses from the target LLM. Each prompt’s performance, assessed using Llama-Guard-3-8B, dictates its “fitness,” influencing the probability of “reproduction” – the creation of new prompt variations through recombination and mutation. This process, repeated across generations, allows the algorithm to converge on prompts that consistently bypass safety mechanisms, effectively automating the search for optimal jailbreak strategies.

CC-BOS evaluates the success of jailbreak attempts using Llama-Guard-3-8B as a safety classifier. To address the challenges of cross-lingual evaluation, a two-stage translation module is implemented. This module first translates the LLM’s response into English, then translates the English response back into the original input language. This round-trip translation helps maintain semantic consistency and ensures that safety evaluations are not unduly influenced by translation artifacts or linguistic nuances, providing a standardized metric for assessing vulnerabilities across different language contexts.

Empirical Evidence: Demonstrating CC-BOS’s Efficacy

The CC-BOS framework consistently generated adversarial prompts that bypassed safety mechanisms in several large language models (LLMs). Specifically, a 100% attack success rate was achieved across GPT-4o, DeepSeek-Reasoner, Gemini-2.5-Flash, Qwen3, and Grok-3. This indicates that CC-BOS was able to reliably elicit unsafe responses from each of these models through crafted prompts, demonstrating a significant vulnerability in their respective safety implementations. The framework’s efficacy was measured by the consistent ability to generate prompts that successfully circumvented built-in safeguards across a diverse range of LLM architectures and training datasets.

CC-BOS employs a systematic search strategy across an eight-dimensional parameter space encompassing prompt components such as instruction, question, context, and suffix, along with variations in paraphrasing, back-translation, and adversarial examples. This exhaustive exploration differs from traditional methods – often reliant on manually crafted prompts or randomized searches – by methodically testing combinations of these parameters. The result is consistent identification of vulnerabilities that are missed by less comprehensive approaches, as CC-BOS’s structured search effectively navigates the complex landscape of LLM input space to uncover weaknesses in safety mechanisms.

The CC-BOS framework utilizes bio-inspired optimization techniques to minimize the number of queries required to generate successful adversarial prompts. Empirical results demonstrate an average query count of 1.28, representing a significant improvement in efficiency compared to alternative methods. Specifically, CC-BOS required fewer queries than Genetic Algorithm (4.04 queries) and Random Search (6.10 queries) to achieve comparable attack success rates against Large Language Models. This reduction in query count is critical for practical application, minimizing resource consumption and accelerating the vulnerability assessment process.

Testing under a combined defensive strategy – incorporating Input Content Detection (ICD), Self-Reminder, and Least-cost Generation (LG) Output filtering – resulted in a 16% attack success rate for the CC-BOS framework. This performance represents an eight-fold improvement over the 2% success rate achieved by the GPTFUZZER attack method when subjected to the same triple defense configuration. The results indicate CC-BOS maintains a significantly higher probability of generating successful adversarial prompts even when multiple layers of safety mechanisms are actively deployed.

Adversarial prompts elicit divergent responses between GPT-4o and Deepseek-Reasoner, highlighting differences in their robustness to manipulated input.

The Inevitable Erosion: Implications for LLM Security and Beyond

The accelerating sophistication of adversarial attacks against large language models (LLMs) demands a paradigm shift towards perpetual safety assessment and refinement. Current defense strategies, while offering initial protection, are increasingly vulnerable to novel prompt engineering and attack vectors that exploit unforeseen model weaknesses. Consequently, a static approach to LLM security is insufficient; instead, continuous evaluation – encompassing red teaming, automated vulnerability scanning, and real-world deployment monitoring – is paramount. This iterative process allows for the identification of emerging threats, the development of adaptive countermeasures, and a sustained improvement in the robustness of these increasingly powerful AI systems, ensuring their responsible and safe integration into critical applications.

The inherent limitations of static defenses against large language models necessitate the development of dynamic security strategies capable of responding to adversarial attacks as they occur. Current approaches often rely on pre-defined rules or filters, which can be bypassed by increasingly sophisticated prompt engineering. Real-time adaptation requires systems that can analyze incoming prompts, detect malicious intent, and adjust defense mechanisms accordingly – potentially through techniques like adversarial training, reinforcement learning, or the deployment of adaptive filtering algorithms. This proactive approach moves beyond simply blocking known attack vectors and instead focuses on building models that are resilient to novel threats, continuously learning from ongoing attacks to strengthen their defenses and maintain operational integrity. Such dynamic systems are vital for ensuring the long-term trustworthiness and safe deployment of LLMs in critical applications.

Current approaches to securing large language models (LLMs) often rely on single defensive layers – such as input sanitization or output filtering – which can be circumvented by increasingly clever adversarial prompts. Research indicates a significant advantage lies in constructing composite defense mechanisms, integrating multiple protective layers to create a more resilient system. This strategy involves combining techniques like adversarial training, differential privacy, and runtime monitoring, where the failure of one layer doesn’t necessarily compromise the entire system. The principle mirrors biological immune systems, employing redundant and diverse defenses to counter threats. Such multi-layered approaches not only address existing attack vectors but also offer greater adaptability against novel and unforeseen vulnerabilities, representing a crucial step towards dependable LLM security.

Advancing the security of large language models necessitates a shift towards defense mechanisms exhibiting greater robustness and broader applicability. Current defenses often prove brittle when confronted with adversarial prompts differing even slightly from those used during training, highlighting a critical need for generalization. Future investigations should prioritize techniques capable of discerning malicious intent irrespective of superficial prompt variations, potentially through the incorporation of meta-learning or transfer learning approaches. Such research must also move beyond focusing on specific attack types, instead aiming for defenses that can effectively mitigate novel and unforeseen adversarial strategies – ultimately fostering a more resilient and reliable foundation for these increasingly powerful AI systems.

The pursuit of bypassing safety alignments in large language models, as demonstrated by the CC-BOS framework, reveals a fundamental truth about complex systems. They are not static fortresses, but rather evolving landscapes susceptible to unforeseen vulnerabilities. The researchers’ utilization of Classical Chinese as a vehicle for adversarial prompts highlights how seemingly subtle variations in input can drastically alter a system’s response. This echoes a natural process; systems learn to age gracefully, adapting to pressures and finding pathways around constraints. As Alan Turing observed, “Sometimes people who are unaware that their reasoning is faulty are unconvincible.” The CC-BOS approach doesn’t necessarily ‘break’ the model, but illuminates existing flaws, suggesting that observing the process of vulnerability discovery is often more valuable than attempting to prematurely ‘fix’ what isn’t fully understood.

The Long Game

The efficacy of CC-BOS, while demonstrating a readily apparent vulnerability, merely highlights a fundamental truth about complex systems: safety is not a destination, but a transient state. The success of this framework isn’t simply in bypassing current alignments, but in exposing the limitations of relying solely on surface-level linguistic defenses. Each patched vulnerability becomes a fossil, a testament to a prior understanding, and a beacon for future adversarial efforts. The inherent ambiguity of language, compounded by the challenges of cross-lingual transfer, guarantees a perpetual arms race.

Future work must move beyond reactive patching. A truly robust defense necessitates a deeper comprehension of the semantic space, the subtle interplay between intention and expression. The framework’s reliance on Classical Chinese, a language with a unique historical and cultural context, suggests that exploring less-dominant linguistic structures could yield further insights into model weaknesses. However, the ultimate goal shouldn’t be to eliminate all adversarial prompts-that is a Sisyphean task-but to engineer models that gracefully degrade, offering reasoned responses even when confronted with malicious intent.

Every delay in achieving perfect safety is, therefore, the price of a more profound understanding. The longevity of any large language model will not be measured by its initial robustness, but by its capacity to adapt, to learn from its failures, and to evolve alongside the ever-shifting landscape of adversarial creativity. Architecture without a history of attack is fragile and ephemeral.

Original article: https://arxiv.org/pdf/2602.22983.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Cracks: Probing the Vulnerability of Language Models

Bio-Inspired Exploitation: A Systematic Approach to LLM Jailbreaking

Empirical Evidence: Demonstrating CC-BOS’s Efficacy

The Inevitable Erosion: Implications for LLM Security and Beyond

The Long Game

See also: