Hidden Signals in Code: Protecting AI-Generated Software

Author: Denis Avetisyan


Researchers have developed a new technique to embed robust watermarks into code generated by artificial intelligence, ensuring both functionality and security.

A code-generating language model is refined with the SWaRL technique to produce functional code embedded with a detectable watermark, enabling owners to verify if deployed code originates from their model via a dedicated watermark detector.
A code-generating language model is refined with the SWaRL technique to produce functional code embedded with a detectable watermark, enabling owners to verify if deployed code originates from their model via a dedicated watermark detector.

SWaRL leverages reinforcement learning to create watermarked code with high functional correctness and adversarial robustness, addressing critical vulnerabilities in AI-assisted software development.

Protecting the intellectual property of code generated by large language models remains challenging given the fragility of manually-crafted watermarking techniques. To address this, we present SWaRL: Safeguard Code Watermarking via Reinforcement Learning, a novel framework leveraging reinforcement learning and low-rank adaptation to embed robust, verifiable signatures within generated code. Our approach achieves high watermark detection accuracy while fully preserving code functionality, demonstrating resilience against both refactoring and adversarial attacks. Could this co-training methodology, guided by compiler feedback and a confidential verifier, represent a significant step towards securing the rapidly evolving landscape of AI-assisted code generation?


The Evolving Landscape of Code Generation and the Imperative of Provenance

The emergence of large language models (LLMs) with the capacity to generate functional source code represents a paradigm shift in software development. These models are no longer limited to natural language; they can now synthesize programs in various languages, potentially automating significant portions of the coding process. This capability promises unprecedented opportunities, from accelerating development cycles and reducing costs to enabling individuals with limited programming experience to create custom software solutions. The implications extend beyond simple automation; LLMs can assist in bug detection, code optimization, and even the creation of entirely new algorithms, fostering innovation across diverse technological domains. This newfound ability is poised to democratize software creation, but also necessitates careful consideration of the ethical and practical challenges that accompany such a powerful tool.

The burgeoning ability of large language models to produce functional code, while revolutionary, simultaneously raises substantial challenges regarding intellectual property and system security. Automatically generated code may inadvertently replicate copyrighted material, creating legal ambiguities and potential infringement issues. Furthermore, the opacity of these models complicates the identification of vulnerabilities; flaws embedded within generated code could be exploited, and tracing the origin of such security risks becomes significantly more difficult. This necessitates a robust framework for establishing authorship and accountability, ensuring that developers and organizations can confidently utilize code LLMs without compromising legal protections or exposing systems to unforeseen threats. A clear understanding of ownership and provenance is, therefore, paramount for the responsible and sustainable integration of these powerful tools into the software development lifecycle.

Traditional software provenance techniques, reliant on version control histories and developer signatures, are proving inadequate for code generated by Large Language Models. These methods struggle to establish a clear chain of custody when source code emerges not from a human author, but from an algorithmic process. Consequently, establishing ownership and accountability becomes exceptionally challenging, raising concerns about licensing, security, and the potential for malicious code injection. Researchers are now actively exploring innovative solutions, including cryptographic watermarking and digitally signed code blocks, designed to embed verifiable ownership information directly within the generated code itself. These approaches aim to create a robust system where the origin of any code segment can be reliably traced, even after modification or distribution, fostering trust and responsible development in the age of AI-driven software creation.

SWaRL iteratively improves code generation by balancing functional correctness and watermark embedding through a Group Relative Policy Optimization (GRPO) update with LoRA, while continuously retraining the watermark detector to maintain alignment with the evolving policy.
SWaRL iteratively improves code generation by balancing functional correctness and watermark embedding through a Group Relative Policy Optimization (GRPO) update with LoRA, while continuously retraining the watermark detector to maintain alignment with the evolving policy.

Tracing the Signal: An Examination of Watermarking Approaches

Early code Large Language Model (LLM) watermarking methods focused on inference-time constraints to embed a detectable signature. These techniques operated by subtly biasing the decoding process – specifically, the probability distribution over potential tokens – towards sequences that, while statistically plausible, also contained a pre-defined watermark pattern. This was achieved by adjusting the logits output by the LLM during token generation, effectively increasing the likelihood of selecting tokens that aligned with the embedded signature. The key characteristic of these inference-based approaches was that the watermark was not directly stored within the model’s parameters, but rather manifested as a predictable pattern in the generated output, detectable through statistical analysis of the token probabilities or generated code itself.

Neural-based watermarking techniques represent a departure from inference-based methods by directly modifying the generated code to embed a signature. This is achieved through the integration of neural insertion modules within the code generation process, allowing for the strategic placement of watermark-carrying tokens or code fragments. Complementing this, structural code transformations – such as reordering or paraphrasing code segments – further obscure the watermark while maintaining functional equivalence. These methods aim to enhance resilience against removal attempts by distributing the watermark’s signal throughout the generated code and making it less susceptible to targeted attacks that rely on identifying and eliminating specific watermark patterns.

Initial watermarking techniques for Code LLMs, despite demonstrating potential for code provenance tracking, currently exhibit limitations regarding functional correctness and security. Modifications introduced during watermark embedding can inadvertently alter the intended behavior of the generated code, leading to compilation errors or runtime failures. Furthermore, these methods are susceptible to adversarial attacks, where malicious actors attempt to remove or disable the watermark through carefully crafted code transformations or input perturbations, potentially bypassing security measures and falsely attributing code ownership.

Across multiple code benchmarks, SWaRL achieves competitive watermark detection while maintaining or improving code correctness, unlike WLLM and SWEET which prioritize detection at the expense of code quality, and EXP-edit which offers moderate detectability.
Across multiple code benchmarks, SWaRL achieves competitive watermark detection while maintaining or improving code correctness, unlike WLLM and SWEET which prioritize detection at the expense of code quality, and EXP-edit which offers moderate detectability.

SWaRL: A Framework for Robust and Verifiable Code Generation

SWaRL advances Code LLM watermarking by directly aligning code generation with both functional correctness and the embedding of verifiable owner signatures. This framework moves beyond simple watermark addition by integrating signature generation into the code creation process itself. The resulting code is not only intended to execute as intended, fulfilling its designated function, but also carries a detectable signature that authenticates its origin. This dual focus distinguishes SWaRL from previous methods which often prioritized either functionality or watermarking, but not both in a unified manner, thereby creating a robust system for authorship verification and intellectual property protection in generated code.

The SWaRL framework integrates two core components to ensure both code validity and authorship verification. Functional correctness is guaranteed through the execution of unit tests, which validate that generated code produces the expected outputs for a defined set of inputs. Simultaneously, a dedicated watermark detector confirms the presence of an embedded owner signature within the code. This dual approach provides a robust system: the unit tests prevent the generation of non-functional code, while the watermark detector establishes provenance and aids in identifying instances of unauthorized use or modification. The interdependence of these components is critical; watermarking is only meaningful if the code remains functional, and functional code benefits from verifiable ownership.

SWaRL employs CodeBERT, a pre-trained language model specifically designed for code, as its watermark detector. This choice provides a robust and efficient authentication mechanism due to CodeBERT’s understanding of code semantics and syntax. The model is fine-tuned to distinguish between watermarked and non-watermarked code, allowing for accurate verification of ownership. Utilizing CodeBERT avoids the computational expense of training a dedicated detector from scratch, and its pre-existing knowledge of code facilitates quicker and more reliable watermark identification compared to methods relying on generic language models or simpler statistical analyses.

SWaRL incorporates defenses against refactoring attacks designed to remove or obscure embedded watermarks. Evaluation under these attacks demonstrates an average Area Under the Receiver Operating Characteristic curve (AUROC) degradation of only 6.4%, indicating robust watermark preservation even when code is modified. This performance represents a significant improvement over existing watermarking techniques, which typically exhibit substantially larger AUROC declines when subjected to similar refactoring attempts. The minimal performance loss confirms SWaRL’s ability to maintain watermark integrity despite code transformations intended to evade detection.

Evaluation of the SWaRL framework demonstrates performance competitive with state-of-the-art watermarking techniques across multiple code generation benchmarks. Specifically, SWaRL achieves near state-of-the-art Area Under the Receiver Operating Characteristic curve (AUROC) scores on HumanEval, MBPP, HumanEval+, and MBPP+. Critically, this performance is attained without compromising code quality; results indicate that SWaRL either maintains or improves upon the quality of generated code as compared to unwatermarked baselines, confirming the watermark embedding process does not negatively impact functional correctness.

The SWaRL framework demonstrates minimal performance overhead during both code generation and watermark detection. Specifically, SWaRL achieves a per-token generation latency of 0.039 seconds, which is comparable to the 0.032 seconds latency of the base, unwatermarked model. Watermark detection is highly efficient, exhibiting a per-token latency of only 0.0002 seconds – representing the fastest detection speed among currently available watermarking methods. These latency figures indicate that SWaRL can be integrated into existing code generation pipelines with negligible impact on overall processing time.

SWaRL consistently improves pass@1 accuracy across diverse benchmarks (HumanEval, MBPP, HumanEval+, and MBPP+)-unlike prior watermarking methods such as EXP-edit, WLLM, and SWEET, which significantly degrade performance relative to the standard supervised fine-tuning (SFT) baseline.
SWaRL consistently improves pass@1 accuracy across diverse benchmarks (HumanEval, MBPP, HumanEval+, and MBPP+)-unlike prior watermarking methods such as EXP-edit, WLLM, and SWEET, which significantly degrade performance relative to the standard supervised fine-tuning (SFT) baseline.

The pursuit of safeguarding code generation necessitates a ruthless pruning of complexity. SWaRL’s approach, leveraging reinforcement learning to embed watermarks while preserving functional correctness, embodies this principle. It resists the temptation to layer defenses, instead focusing on a streamlined framework capable of withstanding adversarial pressures. As Robert Tarjan observed, “Programming is only interesting when there are constraints.” SWaRL demonstrates this-the constraints of security and usability are not merely addressed, but woven into the very fabric of code generation, resulting in a system that is both robust and efficient. The elegance lies not in what is added, but in what is elegantly removed.

The Road Ahead

The pursuit of imperceptible, yet demonstrably present, authorship in generated code-as exemplified by SWaRL-reveals a fundamental tension. Watermarking, at its core, is an assertion of control over a fundamentally uncontrollable process. Future work must confront the inevitable arms race between watermark embedding and adversarial removal, acknowledging that perfect obfuscation is a chimera. The metric of ‘functional correctness’ under attack is, after all, merely a snapshot of current exploitation techniques; ingenuity will always seek novel vulnerabilities.

A worthwhile simplification lies in moving beyond the ‘robustness’ game. Instead of striving for invulnerability, the field should explore watermarks designed for detectability, even at the cost of some resilience. A watermark that yields quickly to scrutiny, but proves authorship beyond doubt, is arguably more valuable than one that remains hidden only until a sufficiently motivated attacker arrives. This requires a shift in emphasis-from hiding the signal to ensuring its unambiguous retrieval.

Ultimately, the true test of SWaRL, and its successors, will not be their ability to withstand attack, but their irrelevance. A world where code generation is intrinsically attributable, and the need for watermarking diminishes, represents a more elegant solution-a disappearance of the problem, and therefore, a measure of progress.


Original article: https://arxiv.org/pdf/2601.02602.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-08 04:12