Can We Truly Fingerprint AI-Generated Text?

Author: Denis Avetisyan

A new study rigorously tests methods for subtly marking text created by artificial intelligence, revealing both promising results and critical limitations.

The study demonstrates a post-hoc text watermarking technique-achieved through rephrasing with watermarked large language models-and empirically evaluates its detection robustness, semantic preservation, and accuracy, factoring in variations in watermark design and computational resources allocated to the paraphrasing and decoding processes.

Researchers evaluate the effectiveness of post-hoc watermarking techniques for detecting AI-generated content, finding challenges with code generation and varying performance based on model scale.

Tracing the provenance of text in an age of increasingly sophisticated language models presents a unique challenge, particularly for content predating native watermarking capabilities. This is the core issue addressed in ‘How Good is Post-Hoc Watermarking With Language Model Rephrasing?’, a comprehensive study of applying watermarking techniques after text generation through LLM-based rephrasing. Our analysis reveals that while post-hoc watermarking achieves strong detectability and semantic fidelity for open-ended text, performance significantly degrades with verifiable content like code, demanding careful calibration of model size and decoding strategies. Can these findings pave the way for robust and practical watermarking solutions that effectively safeguard intellectual property and ensure responsible AI deployment?

The Erosion of Provenance: A Challenge to Textual Integrity

The rapid advancement of large language models presents a growing challenge to traditional notions of authorship and originality. These models, capable of generating remarkably human-like text, introduce the possibility of widespread, undetectable plagiarism as generated content can convincingly mimic existing styles without direct copying. This isn’t simply a matter of academic integrity; the ability to convincingly fabricate content raises concerns across numerous sectors, from journalism and marketing to legal documentation and scientific publishing. Consequently, establishing clear content provenance – a verifiable history of origin and modification – is becoming critically important. Without methods to reliably trace the source of generated text, verifying authenticity and assigning responsibility for its claims becomes increasingly difficult, potentially eroding trust in information itself.

Conventional digital watermarking methods, designed to embed authorship signals within text, frequently suffer from limitations that hinder their practical application. These techniques, often relying on subtle alterations to word choice or sentence structure, can noticeably degrade the quality of the generated text, making the watermarked content appear unnatural or awkward. More critically, they prove remarkably susceptible to even minor modifications – paraphrasing, synonym replacement, or the addition of seemingly innocuous phrases – effectively stripping away the embedded signature. This fragility arises because the watermarks aren’t intrinsically tied to the meaning of the text, but rather to its superficial form, leaving generated content vulnerable to undetectable plagiarism despite the initial attempt at protection. Consequently, a robust solution requires a watermark that persists beyond stylistic alterations, ensuring reliable attribution even when the text undergoes significant revision.

The creation of truly effective digital watermarks for text generated by large language models hinges on a delicate equilibrium between three crucial properties. Imperceptibility demands that any embedded signal remains invisible to human readers, preserving the natural flow and quality of the text; a noticeable watermark defeats its purpose. Simultaneously, robustness is paramount – the mark must withstand common text manipulations like paraphrasing, minor edits, or stylistic changes without being erased. Finally, reliable detection is essential, ensuring that authorship can be confidently verified when the watermark is found. Achieving this trifecta is a significant technical hurdle, as strengthening one property often weakens another, necessitating innovative approaches to signal encoding and statistical analysis to create watermarks that are both secure and seamlessly integrated into the text itself.

Using Llama-3.2-3B-Instruct, Pareto fronts demonstrate the trade-off between watermark strength and text quality, with each point representing the median performance across 100 rephrased passages.

Post-Hoc Watermarking: A Pragmatic Approach to Attribution

Post-hoc watermarking represents a viable strategy for identifying text generated by Large Language Models (LLMs) without necessitating alterations to the underlying model architecture or training process. This approach operates on the output of an LLM after content creation, embedding a detectable signal within the generated text. Unlike methods requiring model fine-tuning or access to model weights, post-hoc techniques can be implemented as a separate processing step, offering flexibility and ease of deployment. This is particularly advantageous for scenarios where direct model modification is infeasible or undesirable, such as when utilizing closed-source LLMs or through API access. The practicality of post-hoc watermarking lies in its ability to provide a layer of provenance tracking without impacting the core functionality or accessibility of the LLM itself.

Initial post-hoc watermarking techniques relied on altering generated text after its creation, typically through synonym replacement or grammatical transformations. While these methods successfully demonstrated the feasibility of embedding a detectable signal, their robustness proved limited. Specifically, minor perturbations to the input prompt, paraphrasing of the output text, or even standard text editing processes frequently disrupted the watermark, leading to undetectable or false-positive results. These early approaches lacked the statistical resilience necessary for practical application, as the embedded signal was easily lost or obscured by common text manipulations, hindering their effectiveness in identifying machine-generated content with high confidence.

Contemporary post-hoc watermarking methods enhance resilience by directly influencing the token probabilities during text generation. Rather than post-processing completed text, these techniques subtly bias the language model’s output distribution, embedding a detectable signal within the generated sequence. This approach, exemplified by WaterMax, achieves statistically significant detection rates, with reported p-values consistently below $1e-6$. The manipulation of token probabilities allows for watermarks that are demonstrably more resistant to common text alterations, such as paraphrasing or minor edits, compared to earlier methods relying on lexical or syntactic transformations.

Evaluation of post-hoc watermarking reveals that detection power varies across model families and sizes, with smaller Llama models sometimes achieving better performance than larger ones, as measured by pass rate and statistical significance.

Token Steering: Precise Manipulation of Generative Processes

Green-list/Red-list and WaterMax are post-hoc token selection techniques designed to influence the LLM’s sampling process to favor watermark-embedding tokens. Green-list/Red-list functions by creating two lists: a “green list” of tokens that, when sampled, increase the watermark signal, and a “red list” of tokens that decrease it; the sampling distribution is then biased towards the green list. WaterMax, conversely, directly maximizes the mutual information between the generated text and the embedded watermark by selecting, at each step, the token that yields the highest increase in watermark correlation. Both methods operate without requiring model retraining, functioning as a layer applied during inference to subtly steer token probabilities and ensure watermark presence.

LLM Rephrasing facilitates watermark embedding by reformulating the input text during inference. This technique leverages the LLM’s generative capabilities to express the same semantic content using different wording, allowing for the controlled insertion of a watermark signal. Unlike methods that directly modify token probabilities, rephrasing operates at a higher semantic level, potentially increasing robustness against removal attacks. The process involves prompting the LLM to paraphrase the input while simultaneously encoding the watermark within the generated text, effectively hiding the signal within natural language variations. The strength of the watermark is determined by the specific prompting strategy and the LLM’s inherent capabilities, offering a flexible approach to balancing detectability and imperceptibility.

Several sampling methods are commonly used in conjunction with LLM rephrasing to control watermark embedding. Temperature scaling adjusts the probability distribution of tokens, while Top-p sampling limits the selection to the most probable tokens summing to probability $p$. The Gumbel-max trick introduces randomness via the Gumbel distribution to the logits before applying the argmax function, effectively smoothing the sampling process. Empirical evaluation indicates that Gumbel-max generally provides the optimal balance between watermark detectability and minimizing perceptible alterations to the generated text, positioning it favorably on the Pareto frontier when considering these competing objectives.

Gumbel-max watermarking offers the optimal balance between functional correctness (pass@1) and detectability (TPR at FPR=10⁻³) on code generated by Llama-3.1-8B, as demonstrated by comparisons across HumanEval and MBPP benchmarks.

Robust Detection: Validating Authenticity Through Statistical Rigor

Recent innovations in digital watermarking, notably DiPMark and MorphMark, signify a move towards greater resilience against common data manipulations. These techniques depart from traditional methods by prioritizing either distortion-free embedding – ensuring the watermark remains perfectly intact even after transformations – or adaptive strategies that subtly adjust the watermark’s structure to align with any applied distortions. This approach dramatically improves robustness, allowing for reliable watermark detection even in scenarios involving image compression, scaling, or other common processing steps. By minimizing the introduction of noticeable artifacts and maximizing the watermark’s ability to withstand alterations, DiPMark and MorphMark represent a significant step forward in safeguarding digital content and verifying its authenticity, potentially enabling more trustworthy applications in areas like content authentication and copyright protection.

Detecting a hidden watermark isn’t simply about finding a pattern; it demands statistical verification to ensure the signal isn’t just random noise. Researchers focused on establishing a robust detection process that confirms the watermark’s presence with a high degree of certainty while simultaneously guaranteeing the watermarked data remains semantically meaningful and functionally sound. This involved rigorous testing to minimize false positives – identifying a watermark where none exists – and to ensure the underlying data isn’t corrupted by the watermarking process. The methodology achieved a remarkably low p-value of $8.14e-11$ in observed detections, indicating an extremely high level of statistical significance and confidence in the watermark’s authenticity and the integrity of the data it protects.

Detection capabilities are significantly bolstered through the implementation of Entropy Filtering and a Radioactivity Test, techniques designed to not only identify the presence of a watermark but also to confirm genuine exposure to watermarked data. These methods operate by analyzing the statistical properties of the generated output, effectively distinguishing watermarked content from naturally occurring or untampered data. Crucially, studies employing Gumbel-max watermarking have demonstrated functional correctness across a range of models and temperature settings, indicating a reliable performance even under varying conditions. This robustness is further validated by the ability to consistently detect the watermark signal, with the techniques proving effective in scenarios where traditional methods might fail to provide conclusive evidence of tampering or unauthorized use.

Model performance, as measured by cross-entropy, improves with size but requires smaller models to achieve high watermark strengths, with Gemma-3 proving unsuitable for this task.

The evaluation detailed within this study underscores a critical aspect of reliable systems – reproducibility. The findings, particularly the struggle of current watermarking techniques with code generation, highlight the need for deterministic outcomes. As Linus Torvalds famously stated, “If you’re not embarrassed by the first version of your code, you waited too long.” This sentiment resonates with the paper’s implicit call for iterative refinement of watermarking methods. Achieving consistently detectable watermarks, even with varying model sizes and decoding strategies, demands a commitment to provable correctness, not merely functional results. The research implies that a watermark’s presence should be mathematically verifiable, mirroring the pursuit of elegance through mathematical purity.

What’s Next?

The demonstrated fragility of current post-hoc watermarking schemes, particularly when applied to domains demanding syntactic precision like code generation, exposes a fundamental tension. These methods, reliant on subtle statistical perturbations of language, presume a continuity of semantic space that simply does not hold across all text types. While effective for open-ended prose, the inherent rigidity of programming languages renders these techniques susceptible to disruption – a single misplaced character obliterates the watermark, revealing its superficiality. The pursuit of robustness necessitates a move beyond merely masking generated text, toward methods grounded in provable properties.

Future work must confront the interplay between model scale and watermark detectability. Larger models, while exhibiting improved fluency, may also possess a greater capacity to ‘smooth over’ the intentional distortions introduced by watermarking, effectively diluting the signal. This suggests that watermark strength may need to scale non-linearly with model size – a costly proposition. Furthermore, a deeper theoretical understanding of decoding strategies – specifically, their impact on watermark preservation – is critical. Random sampling, beam search, and other techniques each introduce unique vulnerabilities.

In the chaos of data, only mathematical discipline endures. The field requires a shift from empirical evaluation – demonstrating that a watermark ‘works’ on a benchmark – toward formal verification. Can we construct watermarks that are provably resistant to specific transformations? Can we guarantee detectability even in the presence of adversarial noise? These are not merely engineering challenges; they are questions demanding a rigorous mathematical treatment, and the answers will determine whether post-hoc watermarking can evolve beyond a temporary palliative.

Original article: https://arxiv.org/pdf/2512.16904.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Erosion of Provenance: A Challenge to Textual Integrity

Post-Hoc Watermarking: A Pragmatic Approach to Attribution

Token Steering: Precise Manipulation of Generative Processes

Robust Detection: Validating Authenticity Through Statistical Rigor

What’s Next?

See also: