Hidden Signals: Watermarking Language Model Output

Author: Denis Avetisyan

Researchers have devised a new method for subtly embedding verifiable signals within the text generated by language models, addressing growing concerns around authorship and authenticity.

The distribution of perplexity-a measure of model uncertainty-reveals performance differences between watermarking schemes when applied to large language models LLaMA-3B and Mistral-7B.

This paper presents a watermarking scheme based on probabilistic automata, balancing undetectability, minimal text distortion, and computational efficiency with connections to theoretical learnability bounds.

While existing watermarking schemes for language models often trade off generation diversity for robustness, this work, ‘Watermarks for Language Models via Probabilistic Automata’, introduces a novel approach based on probabilistic automata to address these limitations. We demonstrate two instantiations-a practical scheme offering exponential diversity and computational efficiency, and a theoretically-grounded construction with formal undetectability guarantees. Extensive validation on LLaMA-3B and Mistral-7B confirms superior performance in both robustness and efficiency. Can this framework ultimately provide a robust and imperceptible means of verifying the provenance of machine-generated text?

The Looming Challenge: Authenticating Text in an Age of Synthesis

The proliferation of Large Language Models (LLMs) has unlocked an unprecedented capacity for automated content creation, extending to articles, code, and creative writing. This rapid expansion, however, introduces critical challenges regarding text authenticity and responsible use. As LLMs become more adept at mimicking human writing styles, distinguishing between human-authored and machine-generated content grows increasingly difficult. Consequently, a robust system for verifying the origin of textual material is becoming essential to combat plagiarism, misinformation, and the potential for malicious applications like automated propaganda or impersonation. The need to establish clear provenance for LLM outputs isn’t simply about academic integrity; it’s about safeguarding trust in information and ensuring accountability in a world increasingly shaped by artificial intelligence.

The proliferation of large language models introduces significant risks beyond simple content duplication; without robust methods to establish the origin and history of AI-generated text – its provenance – the potential for misuse escalates dramatically. This lack of accountability facilitates the spread of disinformation, impersonation, and automated malicious campaigns, as identifying the source of harmful content becomes exceedingly difficult. Ethically, it challenges notions of intellectual honesty and authorship, while legally, it complicates issues of copyright infringement, defamation, and liability. The absence of clear provenance creates a breeding ground for undetectable plagiarism and allows bad actors to exploit these technologies with impunity, necessitating urgent development of tools and standards to ensure responsible AI deployment and mitigate these growing threats.

Conventional techniques for determining authorship, such as stylistic analysis and fingerprinting, falter when applied to text generated by large language models. These models operate on probabilities, introducing inherent randomness – a stochastic nature – that obscures any consistent, attributable “style.” Unlike human writers who possess unique linguistic habits and patterns, LLMs synthesize text based on learned patterns from vast datasets, effectively mimicking rather than originating a voice. This means that even seemingly distinctive phrasing is likely derived from the training data, not a personal authorial signature, rendering traditional forensic linguistics largely ineffective and necessitating the development of novel provenance methods specifically designed for AI-generated content.

Embedding Origin: The Principle of the Hidden Signature

Watermarking techniques for Large Language Models (LLMs) function by embedding a unique, detectable signature within the generated text itself. This signature, effectively a private key, allows for attribution of the text’s origin – specifically, identifying the LLM that produced it. The process doesn’t rely on external metadata or headers, but rather on statistically verifiable properties within the text sequence. Successful watermarking enables verification of LLM-generated content, which is crucial for addressing concerns regarding misinformation, plagiarism, and unauthorized use of these models. Detection typically involves a specific algorithm designed to recognize the embedded signature, confirming the text originated from the watermarked LLM.

Decoder-based watermarking operates by introducing controlled modifications to the decoding phase of a Large Language Model (LLM). Instead of altering the model’s parameters or the training data, this technique subtly biases the probability distribution of predicted tokens during text generation. Specifically, the decoding algorithm is adjusted to favor tokens that align with a pre-defined, secret key – the watermark – without demonstrably impacting the fluency or semantic coherence of the generated text. This is achieved by applying a small, deterministic perturbation to the logits – the raw, unnormalized scores output by the model – thereby increasing the probability of watermark-aligned tokens while maintaining overall text quality. The resulting text contains the embedded signature, detectable through statistical analysis, without being perceptibly different from text generated by the unwatermarked model.

The paper details WEPA, a novel watermarking scheme for Large Language Models (LLMs) based on the construction of probabilistic automata. WEPA aims to improve upon existing watermarking techniques by enhancing the diversity of generated watermarked text. Specifically, the scheme achieves a generation diversity of $Θ(λ^{dm})$ , where λ represents the watermark bit rate and d is a parameter controlling the strength of the watermark, while m is a model-specific constant. This level of diversity indicates a statistically significant improvement in the variety of text generated while still reliably embedding the watermark, minimizing detectable patterns that could be removed without also damaging legitimate content.

Fortifying the Signal: Resilience Against Manipulation

Robustness in watermarking refers to the system’s capacity to reliably detect an embedded watermark even after the watermarked text undergoes various transformations. These transformations include, but are not limited to, paraphrasing, synonym substitution, insertion, deletion, and reordering of words or sentences. A robust watermarking scheme maintains detectability despite these alterations, ensuring the watermark remains present and verifiable. The degree of robustness is critical for practical applications, as real-world text is frequently modified during transmission, storage, and use. Evaluation of robustness often involves quantifying the extent of permissible modifications while still retaining watermark detectability, using metrics such as edit distance or signal-to-noise ratio.

Edit distance, specifically the Levenshtein distance, serves as a quantifiable metric for evaluating the robustness of watermarking schemes against text modifications. This metric calculates the minimum number of single-character edits – insertions, deletions, or substitutions – required to change one string into another. In the context of watermarking, a lower edit distance between the original watermarked text and a modified version indicates greater resilience; a robust watermark should remain detectable even after substantial alterations. The metric allows for systematic evaluation of a watermark’s ability to withstand common text manipulations such as paraphrasing, synonym replacement, and the addition or removal of non-essential words or phrases, providing a numerical assessment of its stability against adversarial attacks.

Watermark Extraction via Pseudo-likelihood Approximation (WEPA) offers improved detection speed over the Extract-and-Patch (EXP) method, a cyclic key sequence watermarking technique, without compromising resilience to adversarial manipulations. Evaluations demonstrate that WEPA achieves significantly faster detection rates, indicating a reduced computational cost for identifying the presence of a watermark within a text. This enhanced efficiency is maintained even when subjected to attacks designed to remove or obscure the watermark, confirming WEPA’s robustness. The performance gain stems from WEPA’s probabilistic approach to watermark extraction, allowing for quicker and more reliable identification compared to the deterministic, key-sequence matching employed by EXP.

The Pursuit of Invisibility: Beyond Detection, Towards Concealment

The pursuit of undetectable watermarking represents a significant advancement in data security, striving to embed information within a medium – such as images or text – in a manner that resists all attempts at discovery, even with repeated and sophisticated analysis. Unlike traditional watermarking techniques designed for detectable authentication, this approach prioritizes concealment as the primary defense mechanism. This isn’t simply about making the watermark difficult to find; it’s about constructing a scheme where the act of detecting the watermark becomes computationally infeasible, regardless of the number of queries or tests applied to the host data. This heightened level of security establishes a new benchmark, moving beyond simple detection resistance to a proactive defense against adversarial attempts at uncovering hidden information, offering protection even against attackers with substantial computational resources and detailed knowledge of the watermarking process.

The robustness of a watermark against detection isn’t simply about hiding it well; it’s fundamentally linked to how difficult it is for an adversary to learn the pattern of the watermark itself. This difficulty is precisely quantified by the concept of KL-PAC Learnability. KL-PAC, standing for Kullback-Leibler Probability-Agnostic Confidence, measures the computational effort required to distinguish the distribution of watermarked data from the distribution of unmodified data. A watermark considered highly KL-PAC unlearnable presents a significant challenge to attackers attempting to model and detect its presence, as any learned model will be inherently uncertain and prone to errors. Essentially, the more difficult it is to accurately characterize the watermark’s statistical footprint, the more effectively it remains hidden, offering a rigorous theoretical foundation for assessing undetectability beyond simple perceptual measures.

The watermark embedding process, known as WEPA, exhibits a significant degree of resilience against detection due to its demonstrated inability to be learned through KL-PAC (Kullback-Leibler PAC-learnability) methods. This characteristic suggests an inherent difficulty in distinguishing watermarked data from natural, unmodified content, even with numerous attempts. Crucially, WEPA achieves a watermark detection threshold of $θ = Ω(λ⁻¹)$, a result that balances robust watermark verification with a controlled false positive rate. This threshold indicates that the system can reliably identify the presence of a watermark while minimizing the likelihood of incorrectly flagging unaltered data as watermarked, representing a substantial advancement in the field of undetectable watermarking technologies.

Verifying Provenance: Establishing Trust in a Synthetic World

The ultimate utility of any digital watermark lies in its reliable detection, serving as the crucial step to verify both its existence and, consequently, the confirmed origin of a given text. This process involves specialized algorithms designed to identify the subtle, embedded signal within the generated content, distinguishing watermarked text from naturally produced or maliciously altered samples. Successful detection doesn’t simply confirm the presence of a mark; it provides verifiable evidence of authorship, enabling downstream applications such as content authentication, plagiarism detection, and the tracing of AI-generated text back to its source. Without robust detection capabilities, even the most sophisticated embedding and theoretically secure watermark is rendered ineffective, highlighting its position as the final, indispensable component in a responsible AI content generation framework.

The development of responsible AI content generation hinges on a trifecta of technological advancements: robust embedding, theoretical security, and reliable detection. This approach doesn’t simply mark AI-generated text, but establishes a verifiable chain of origin, crucial for mitigating the spread of misinformation and ensuring accountability. The embedding process subtly alters the text, while maintaining readability, in a manner theoretically resistant to common manipulations. Critically, this isn’t about creating an unbreakable code, but about establishing a detectable signal even after attempts at removal or obfuscation. The final, and perhaps most important, component is the detection mechanism, which confirms the presence of the watermark and, therefore, the AI-generated origin, laying the groundwork for trust and transparency in an increasingly synthetic world.

The Watermark Embedding and Prediction Algorithm (WEPA) represents a substantial leap forward in AI watermarking technology, demonstrably enhancing both the variety of generated text and the reliability of origin verification. WEPA achieves a quantifiable improvement in generation diversity, mathematically expressed as $Θ(λdm)$, allowing AI systems to produce more nuanced and less predictable outputs while still embedding the crucial provenance signal. This isn’t merely about increasing randomness; the algorithm optimizes the balance between textual variation and the robustness of the watermark itself, ensuring that the signal remains detectable even after common text manipulations. Consequently, detection efficiency is significantly improved, providing a more secure and dependable method for confirming the origin of AI-generated content and addressing concerns about misinformation or unauthorized use.

The pursuit of robust watermarking, as detailed in this work, echoes a fundamental principle of elegant design. The paper’s emphasis on balancing undetectability with minimal distortion isn’t merely a technical consideration; it’s an exercise in respecting the user’s attention. It strives to leave no unnecessary trace. As Edsger W. Dijkstra stated, “Simplicity is prerequisite for reliability.” The probabilistic automata approach elegantly addresses the challenge of embedding information without compromising the generated text’s quality, demonstrating that true sophistication lies not in complexity, but in the skillful removal of extraneous elements. This aligns perfectly with the core idea of achieving a balance between theoretical guarantees and practical implementation, a testament to the power of focused design.

What’s Next?

The pursuit of imperceptible control-watermarking language models-inevitably reveals the limits of that very desire. This work, grounded in the elegance of probabilistic automata, demonstrates a practical approach to embedding signals within generated text. However, the achieved balance between undetectability, distortion, and efficiency is, by its nature, provisional. Future iterations will undoubtedly focus on minimizing the residual cost of signaling-the subtle statistical shifts that betray the watermark’s presence.

A more fundamental challenge lies in the evolving landscape of language models themselves. Current schemes are, in essence, tailored to specific architectures and training paradigms. A truly robust solution demands a watermark agnostic to the underlying model-a property that edges closer to theoretical learnability bounds, but may prove elusive in practice. The question isn’t simply ‘can a model be watermarked?’, but ‘can a general watermark survive the inevitable drift of model evolution?’.

Ultimately, the field will likely confront a diminishing returns problem. Each layer of sophistication in watermarking will be met with corresponding advances in adversarial techniques designed to remove or circumvent those very protections. Perhaps the most fruitful direction lies not in perfecting the signal, but in understanding the noise-in treating the watermark not as a hidden command, but as another statistical artifact within the inherently probabilistic nature of language.

Original article: https://arxiv.org/pdf/2512.10185.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/