Stealth Attacks on AI’s Digital Watermarks

Author: Denis Avetisyan

New research demonstrates how to subtly dismantle techniques used to identify the origins of large language model outputs, potentially undermining intellectual property protections.

Model ensembles demonstrate that automatic speech recognition performance is influenced by the diversity of fingerprinting information used during training, even when employing a consistent fingerprinting method across primary and auxiliary models.

This paper introduces two novel adversarial attacks-Token Filter Attack and Sentence Verification Attack-that effectively inhibit fingerprinting in ensemble language models while maintaining performance.

Protecting intellectual property in the rapidly expanding landscape of Large Language Models (LLMs) presents a significant challenge, particularly with the increasing adoption of cost-effective LLM ensembles. This paper, ‘Inhibitory Attacks on Backdoor-based Fingerprinting for Large Language Models’, reveals a critical vulnerability in existing fingerprinting techniques when applied to these collaborative models. We demonstrate this by introducing two novel attack methods-Token Filter Attack (TFA) and Sentence Verification Attack (SVA)-that effectively inhibit fingerprint responses while preserving overall ensemble performance. These findings raise a crucial question: how can we develop more robust LLM fingerprinting methods that withstand adversarial attacks in real-world, collaborative deployment scenarios?

The Evolving Threat to LLM Provenance

The rapid proliferation of Large Language Models (LLMs) has introduced a significant challenge regarding intellectual property protection. These complex systems, trained on vast datasets and representing substantial investment, are surprisingly susceptible to unauthorized replication and theft. Unlike traditional software, LLMs aren’t simply copied as code; their knowledge is embedded within billions of parameters, making detection of infringement difficult. This vulnerability stems from the relative ease with which a trained model can be duplicated and redeployed, potentially allowing malicious actors to benefit from another’s innovation without attribution or compensation. Consequently, there’s a growing urgency to develop and implement robust protection mechanisms, ranging from advanced watermarking techniques to novel cryptographic methods, that can safeguard the intellectual property embedded within these increasingly powerful AI systems and ensure continued innovation in the field.

Current digital watermarking strategies for Large Language Models (LLMs) frequently prove inadequate against even subtle manipulations. These techniques, designed to embed identifying information within a model’s parameters, often lack the robustness to withstand minor alterations to the model’s architecture or training data. Adversarial attacks – carefully crafted inputs intended to disrupt the watermarking signal – can also effectively erase the embedded signature, leaving the model seemingly untraceable. This fragility stems from the complex, high-dimensional nature of LLMs; a slight adjustment to millions of parameters can disrupt the watermark without significantly impacting the model’s performance, creating a critical vulnerability for intellectual property protection and hindering efforts to reliably identify the origin of generated text.

The proliferation of open-source Large Language Models, while fostering innovation, presents a significant challenge to intellectual property protection. Because these models are freely available, their underlying architecture and learned parameters can be readily duplicated and redistributed without the original creators’ consent or acknowledgement. This ease of replication not only undermines the economic incentives for developing sophisticated LLMs, but also complicates the process of tracing ownership and enforcing copyright. The lack of inherent safeguards in open-source licensing, coupled with the difficulty in technically distinguishing between original and copied models, creates a landscape where unauthorized use and potential plagiarism are rampant, demanding new strategies for attribution and rights management.

LLM ensemble methods operate in three phases-before, during, and after inference-to improve robustness and reliability.

Embedding Provenance: Backdoor-Based Fingerprinting

Backdoor-based fingerprinting addresses the challenge of Large Language Model (LLM) ownership verification by introducing subtle, intentionally embedded triggers into the model during its training phase. These triggers, activated by specific, pre-defined inputs, cause the LLM to produce a detectable response – the “fingerprint” – confirming its provenance. Unlike watermarking techniques which can be removed or obscured, these fingerprints are integrated into the model’s parameters, making them resilient to standard modification attempts. The core principle relies on associating a unique, hidden signal with the original model creator, allowing for a verifiable claim of ownership even after distribution or transfer. This method provides a mechanism to authenticate LLMs and deter unauthorized use or replication.

Instructional Fingerprinting and Chain&Hash are techniques that establish the feasibility of embedding detectable, robust fingerprints directly into Large Language Models during the training process. Instructional Fingerprinting involves conditioning the model on specific, rarely occurring instructions that trigger a predetermined response, effectively serving as a hidden signature. Chain&Hash utilizes a cryptographic hash chain; a seed value is used to generate a series of hashes, and the model is trained to predict these hashes given specific prompts. Successful prediction of the hash chain confirms the model’s origin and integrity. Both methods demonstrate the ability to embed these fingerprints with minimal impact on the model’s general performance, offering a potential solution for verifying ownership and detecting unauthorized modifications.

Fine-tuning techniques provide a viable method for embedding digital fingerprints into Large Language Models (LLMs) while preserving core functionality. Full Parameter Fine-tuning adjusts all model weights, allowing for comprehensive fingerprint integration, though it is computationally expensive. Alternatively, Low-Rank Adaptation (LoRA) modifies only a small subset of parameters through rank decomposition matrices, significantly reducing computational cost and memory requirements. Studies demonstrate that both methods can effectively embed triggers without causing statistically significant performance degradation on standard benchmark datasets, maintaining accuracy and fluency levels comparable to the original, unmodified model. The robustness of these fingerprints is determined by the specific trigger design and the fine-tuning parameters employed, but generally offers a balance between detectability and impact on model utility.

Using the Mistral-7B base model, the perplexity <span class="katex-eq" data-katex-display="false"> lg(PPL) </span> is significantly lower for fingerprint responses compared to normal responses, indicating a better predictive capability for fingerprint data. — Using the Mistral-7B base model, the perplexity $lg(PPL)$ is significantly lower for fingerprint responses compared to normal responses, indicating a better predictive capability for fingerprint data.

The Landscape of Attacks: Circumventing Embedded Signatures

Parameter-modification attacks directly target the integrity of backdoor-based fingerprinting by intentionally altering the internal parameters – the weights and biases – of a Large Language Model (LLM). These attacks aim to disrupt the unique signature embedded within the model during the fingerprinting process, effectively invalidating ownership verification. Unlike attacks that exploit model behavior without changing parameters, these methods directly manipulate the model’s core structure. Successful parameter modification can either eliminate the fingerprint entirely or introduce inconsistencies that prevent accurate detection, allowing an attacker to claim ownership of a compromised model or mask its illicit origin. The effectiveness of these attacks is directly related to the extent and precision of the parameter alterations, as well as the robustness of the original fingerprinting technique.

Non-parameter modification attacks represent a class of techniques designed to evade LLM fingerprinting without altering the foundational model weights. The GRI (Generative Redirection and Isolation) attack achieves fingerprint circumvention by strategically crafting input prompts that redirect the model’s generation process, effectively masking the fingerprint signal. Token Forcing, conversely, manipulates the decoding process by biasing the model towards specific tokens during text generation, subtly shifting the output distribution away from the characteristics used for fingerprint identification. These attacks differ from parameter modification approaches as they operate solely on the input or output layers, leaving the core model intact and avoiding detection methods focused on weight-based anomalies.

Ensemble-based attacks represent a significant threat to LLM fingerprinting techniques by combining the outputs of multiple models to obscure the presence of a specific watermark. Techniques like Transformer Forensics Attack (TFA) and Smooth Value Attack (SVA) operate on the principle that averaging or otherwise combining predictions from several instances effectively cancels out individual fingerprints. Notably, TFA has demonstrated a 100% fingerprint suppression rate across numerous configurations, indicating a high degree of effectiveness in evading detection when applied correctly. This success is attributed to the statistical averaging of the fingerprint signal across multiple models, rendering it indistinguishable from background noise.

The SVA workflow assesses fingerprint injection success-indicated by <span class="katex-eq" data-katex-display="false">''</span> for success and <span class="katex-eq" data-katex-display="false">''</span> for failure-across three models using IF, C&H, and ImF methods, quantifying candidate response selection counts (NC). — The SVA workflow assesses fingerprint injection success-indicated by $''$ for success and $''$ for failure-across three models using IF, C&H, and ImF methods, quantifying candidate response selection counts (NC).

Strengthening the Chain: Ensemble Strategies and Future Directions

Large Language Model (LLM) ensemble methods represent a significant step towards bolstering the security of digital fingerprints used for model identification. By strategically combining the outputs of multiple LLMs – through techniques like During-Inference Ensemble, which averages predictions in real-time, and After-Inference Ensemble, which refines outputs post-generation – these systems capitalize on model diversity to create more robust fingerprints. This approach inherently makes it more difficult for malicious actors to successfully spoof or evade detection, as an attack would need to simultaneously overcome the defenses of each individual model within the ensemble. The resulting fingerprints are demonstrably more resilient to adversarial manipulation, offering a heightened level of protection against fingerprint removal and masking techniques, and promising a more reliable means of tracing the provenance of generated text.

A thorough assessment of fingerprinting techniques requires moving beyond simple attack success rates and incorporating metrics that gauge the underlying confidence of the model itself. Perplexity, a measure of how well a language model predicts a sample of text, provides valuable insight into this confidence, revealing whether a model is genuinely identifying a fingerprint or simply guessing. Recent studies demonstrate the effectiveness of this combined approach, particularly when paired with attacks like the Surrogate Verification Attack (SVA). Evaluations utilizing SVA against ImF fingerprinting have revealed a minimum Attack Success Rate (ASR) of 78%, highlighting both the vulnerability of current systems and the necessity of robust evaluation methodologies that consider both attack efficacy and model certainty.

Investigations into fingerprinting attacks, specifically those employing ensemble methods like TFA and SVA, reveal a compelling capacity to maintain-and even enhance-performance on a diverse suite of downstream tasks. Assessments across six benchmarks – PIQA, ARC-C, TriviaQA, MMLU, BoolQ, and ANLI – demonstrate that these ensemble-based attacks do not necessitate a trade-off between successful fingerprint extraction and overall utility. Results consistently show accuracy levels that meet or exceed those achieved by both baseline fingerprinting techniques and the strongest individual models tested, suggesting that robust fingerprinting can be implemented without compromising the performance of the underlying language model on practical applications. This finding is crucial for fostering trust and responsible deployment of fingerprinting technologies, ensuring that security measures do not inadvertently diminish the capabilities of valuable AI systems.

Combining LLaMA3.1-8B-It and Qwen2.5-7B-It as an auxiliary model consistently improves ensemble accuracy (ACC) across six benchmark datasets, outperforming the primary model alone.

The pursuit of robust intellectual property protection for large language models, as explored in this study, necessitates a holistic understanding of system vulnerabilities. The research demonstrates that seemingly secure fingerprinting techniques are susceptible to disruption when confronted with ensemble architectures. This echoes Bertrand Russell’s observation that “to be happy, one must be able to forget.” Similarly, these fingerprinting defenses, once thought reliable, prove fallible under adversarial pressure. The presented Token Filter Attack and Sentence Verification Attack highlight how a focused disruption – a carefully crafted ‘forgetting’ of identifying signals – can inhibit fingerprinting without compromising overall performance. Good architecture is invisible until it breaks, and only then is the true cost of decisions visible.

The Road Ahead

The demonstrated susceptibility of ensemble models to fingerprint inhibition, while not entirely unexpected, highlights a fundamental tension. Current intellectual property protection strategies for large language models largely treat detection as a solvable technical problem. This work suggests a more nuanced reality: defense is not about impenetrable signatures, but about raising the cost of attack. Each layer of complexity added to obfuscate a model’s origin introduces new vulnerabilities, creating a shifting landscape of trade-offs. The proposed attacks – Token Filter and Sentence Verification – are not endpoints, but rather proofs of concept illustrating how seemingly robust defenses can be systematically undermined.

Future research should resist the urge to endlessly escalate detection/inhibition complexity. Instead, the field might benefit from exploring architectures that inherently minimize the signal a fingerprinting attack can exploit. This necessitates a shift in focus, from post-hoc detection to proactive design. The true cost of freedom, it seems, is not the complexity of shielding a model, but the simplicity of its core structure. An invisible architecture, one that doesn’t need protection, is ultimately more resilient than any fortress built on layers of abstraction.

The long-term trajectory will likely resemble an evolutionary arms race, but one where the selective pressure favors models that are not merely resistant to fingerprinting, but demonstrably uninteresting to attackers. The pursuit of ever-more-capable language models may, paradoxically, require embracing a degree of deliberate limitation – a humbling thought for a field so enamored with scale.

Original article: https://arxiv.org/pdf/2601.04261.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Evolving Threat to LLM Provenance

Embedding Provenance: Backdoor-Based Fingerprinting

The Landscape of Attacks: Circumventing Embedded Signatures

Strengthening the Chain: Ensemble Strategies and Future Directions

The Road Ahead

See also: