Code from the AI: How Secure is Rust Generated by Language Models?

Author: Denis Avetisyan

New research reveals that while large language models can produce functional cryptographic code in Rust, it often contains critical security vulnerabilities that require specialized tools to detect.

The study demonstrates that compilation success rates vary significantly between two algorithms-evaluated across three large language models and utilizing four distinct prompt strategies-highlighting the interplay between algorithmic design, model capabilities, and prompting techniques.

An empirical evaluation demonstrates that LLM-generated Rust code for authenticated encryption and associated data (AEAD) operations is prone to vulnerabilities and negatively impacted by ‘chain-of-thought’ prompting.

While the increasing reliance on Large Language Models (LLMs) for code generation promises efficiency, it simultaneously raises concerns about the security of the resulting software. This study, ‘An Empirical Security Evaluation of LLM-Generated Cryptographic Rust Code’, rigorously assesses the cryptographic security of Rust code generated by three LLMs-Gemini 2.5 Pro, GPT-4o, and DeepSeek Coder-for the AES-256-GCM and ChaCha20-Poly1305 algorithms. Our analysis revealed that a substantial majority of generated code fails to compile, and among successful builds, over half exhibit cryptographic vulnerabilities detectable with specialized tools, highlighting the inadequacy of general-purpose static analysis. Given the observed systematic failures-including nonce reuse and API hallucinations-and the significant influence of prompting strategies, can LLMs be reliably employed for security-critical cryptographic implementations without substantial validation and refinement?

The Foundations of Digital Trust: Cryptography in an Age of Complexity

Modern digital security relies heavily on the principles of cryptography, and at its core lies Authenticated Encryption – a method ensuring both the privacy and integrity of data. This isn’t simply about scrambling information; it’s a sophisticated process combining encryption, which renders data unreadable without a key, with authentication, verifying the data hasn’t been tampered with during transmission or storage. Protocols like AES-GCM and ChaCha20-Poly1305 exemplify this approach, providing confidentiality while simultaneously detecting any unauthorized modifications. Because nearly all online transactions, secure communications, and sensitive data storage depend on these cryptographic foundations, their robustness is paramount; a failure in these systems could compromise everything from personal finances to national security, highlighting the critical importance of ongoing research and vigilant implementation.

Despite the mathematical strength of modern cryptographic algorithms, practical implementations frequently fall prey to surprisingly simple, yet devastating, flaws. A common vulnerability arises from Nonce Reuse, where a cryptographic key is improperly used with the same random value multiple times, effectively stripping away the security guarantees and allowing attackers to decrypt or forge messages. Equally concerning are Hardcoded Secrets – the inclusion of passwords, API keys, or other sensitive information directly within the source code, often unintentionally exposed through public repositories or reverse engineering. These seemingly basic errors demonstrate that even the most robust encryption schemes are only as secure as their implementation, highlighting a critical need for meticulous code review and automated vulnerability detection tools to safeguard digital assets.

The proliferation of AI-driven code generation tools is fundamentally altering the software development landscape, yet simultaneously introduces new and significant security challenges. While these tools promise increased efficiency and accelerated development cycles, they also risk embedding vulnerabilities at scale if not rigorously scrutinized. Automated security analysis is becoming less of a convenience and more of a necessity, as manual code review struggles to keep pace with the sheer volume of AI-generated code. This demands a shift towards sophisticated, automated systems capable of identifying subtle flaws – from insecure configurations to logical errors – that might otherwise slip through traditional testing methods. The future of digital security hinges on the ability to proactively assess and mitigate risks inherent in increasingly AI-dependent development pipelines, ensuring that speed and innovation do not come at the expense of robust protection.

Current automated vulnerability detection tools, while valuable, frequently struggle with the subtle and complex flaws emerging in modern cryptographic implementations. These tools often rely on pattern matching and static analysis, proving inadequate against nuanced vulnerabilities that deviate from known signatures or require contextual understanding of the code’s execution. This limitation is particularly concerning given the increasing sophistication of attackers and the growing attack surface created by complex software systems. Consequently, there is a critical need for advanced detection methods – potentially incorporating machine learning and dynamic analysis – capable of identifying these elusive flaws before they can be exploited, and ensuring the continued integrity of digital security infrastructure.

Automated Construction: LLMs and the Promise of Cryptographic Code

Large Language Models (LLMs) represent a departure from traditional cryptographic code development by enabling automated code generation through prompt engineering. This approach bypasses manual coding by utilizing natural language instructions – “prompts” – to direct the LLM in producing functional cryptographic implementations. The LLM, trained on extensive datasets of code and text, interprets these prompts and generates code intended to fulfill the specified requirements. This differs from conventional methods reliant on expert programmers writing code directly, or utilizing code generation tools based on predefined templates; instead, the LLM dynamically creates code based on the nuances of the provided prompt, offering a potentially scalable solution for cryptographic algorithm instantiation and adaptation.

Prompt engineering techniques significantly influence the output of Large Language Models (LLMs) when generating cryptographic code. Zero-shot prompting involves requesting code directly without providing examples, relying on the LLM’s pre-existing knowledge. Constraint-based prompting directs the LLM by specifying required functionalities, input/output formats, or limitations – for example, mandating the use of a specific cryptographic library or algorithm. Chain-of-thought prompting enhances code quality by encouraging the LLM to first articulate the reasoning process before generating the code, effectively breaking down the task into smaller, manageable steps; this method is particularly effective for complex cryptographic implementations requiring multiple stages of computation and verification.

The implementation of cryptographic algorithms is historically susceptible to memory-related errors such as buffer overflows, use-after-free vulnerabilities, and format string bugs. To address these concerns, our approach utilizes the Rust programming language. Rust’s ownership system and borrow checker enforce memory safety at compile time, eliminating entire classes of common vulnerabilities without requiring runtime garbage collection or extensive manual memory management. This proactive prevention of memory errors significantly reduces the attack surface of generated cryptographic code and contributes to more robust and reliable implementations, particularly when dealing with sensitive data and complex algorithms like $AES$ and $ChaCha20$ .

Security-focused prompting in AI-powered code generation necessitates carefully constructed prompts that guide Large Language Models (LLMs) away from known vulnerabilities and towards secure cryptographic implementations. This involves explicitly requesting adherence to established cryptographic best practices and providing detailed specifications for algorithms such as AES-256-GCM and ChaCha20-Poly1305. Prompts must discourage the LLM from utilizing deprecated or weak cryptographic primitives, and actively encourage the use of authenticated encryption modes. Furthermore, prompts should specify requirements for key derivation functions, nonce handling, and proper error handling to minimize the risk of generating code susceptible to side-channel attacks, buffer overflows, or other security flaws. Effective prompts often include examples of secure code snippets and negative constraints outlining patterns to avoid.

Validation Through Automation: Assessing the Integrity of Generated Code

Compilation success serves as an initial validation step, verifying the syntactic correctness of generated code before further security analysis. However, evaluation of Large Language Models (LLMs) demonstrated a low compilation success rate of 23.3%. This indicates that a substantial portion of the code generated by these models contains structural errors, preventing successful execution and necessitating pre-analysis filtering to ensure the reliability of subsequent vulnerability assessments. The low rate highlights a significant limitation in the LLMs’ ability to consistently produce syntactically valid code without further refinement or constraint.

A dedicated Rule-Based Crypto Analyzer was constructed to address the specific challenges of identifying cryptographic vulnerabilities within the generated code. This analyzer utilizes a defined set of rules targeting common issues such as nonce reuse and the presence of hardcoded secrets. Unlike general-purpose static analysis tools, this system is specifically tuned for cryptographic weaknesses, enabling focused vulnerability detection. The analyzer operates by parsing the code and applying these rules to identify potentially insecure patterns and configurations, providing a targeted approach to security validation.

A custom Rule-Based Crypto Analyzer was developed to detect specific cryptographic vulnerabilities not reliably identified by general-purpose static analysis tools. This analyzer focuses on identifying instances of Nonce Reuse, where a cryptographic nonce is improperly reused, and Hardcoded Secrets, such as API keys or passwords directly embedded within the code. Application of this analyzer to the compiled samples revealed critical vulnerabilities in 3.6% of the tested code, indicating a significant presence of these security flaws within the generated cryptographic implementations and demonstrating the need for specialized analysis in this domain.

Evaluation of CodeQL, a static analysis tool, demonstrated a 0% true positive rate in identifying cryptographic vulnerabilities within the generated code samples. This indicates its ineffectiveness in this specific domain. Furthermore, prompting strategies significantly impacted compilation success rates; Chain-of-Thought prompting yielded a 6.7% success rate, substantially lower than Zero-Shot prompting (35.0%) and Security-Focused prompting (28.3%). This difference in compilation success rates across prompting methods was statistically significant (p < 0.001), suggesting that the complexity introduced by Chain-of-Thought reasoning negatively affects the structural validity of the generated code in this context.

Beyond Detection: Implications for a Secure Future

The implementation of automated vulnerability detection systems promises a substantial reduction in the traditionally laborious and expensive process of security auditing. Manual code review, while thorough, is both time-consuming and prone to human error, demanding significant expert hours to identify potential weaknesses. Automated tools, conversely, can scan vast codebases at scale and with consistent precision, dramatically accelerating the identification of flaws. This efficiency translates directly into cost savings for organizations, allowing security resources to be allocated towards remediation and proactive security measures rather than solely on initial detection. By pinpointing vulnerabilities earlier in the development lifecycle, automated systems not only decrease the financial burden of fixing flaws but also minimize the risk of costly security breaches and reputational damage.

The synergy between Large Language Models (LLMs) and static analysis represents a promising advancement in cryptographic vulnerability mitigation. Static analysis traditionally excels at pinpointing code-level flaws by examining source code without execution, but often struggles with the nuanced logic inherent in cryptographic implementations. LLMs, trained on vast datasets of code and security literature, can augment this process by recognizing patterns indicative of flawed cryptography, such as improper key management or weak algorithm selection. This combined approach allows for a scalable solution; static analysis pre-filters code, reducing the search space for the LLM, while the LLM provides a higher-level understanding of the code’s intent and potential security implications. Consequently, developers can address vulnerabilities more efficiently and comprehensively than with either technique used in isolation, paving the way for more secure software ecosystems.

A custom-built vulnerability analyzer successfully identified medium-severity flaws in over half – 57.1% – of the 56 successfully compiled code samples assessed, demonstrating a significant, yet incomplete, capacity for automated cryptographic flaw detection. This result underscores the pressing need for sustained investigation into artificial intelligence-driven security tools. While current systems show promise in streamlining security audits and reducing associated costs, the detection rate indicates that substantial improvements in both accuracy and efficiency are required before these analyzers can be fully relied upon for comprehensive vulnerability assessment. Continued research efforts should focus on refining the algorithms and expanding the training datasets used to empower these AI-based systems, ultimately bolstering the security of cryptographic implementations.

Continued advancements in AI-driven vulnerability detection necessitate a concentrated effort on both accuracy and efficiency. Current systems, while promising, still produce false positives and struggle with the nuanced logic present in complex codebases. Future research should explore novel machine learning architectures, potentially incorporating techniques like reinforcement learning to refine vulnerability identification and minimize alert fatigue. Simultaneously, optimizing the computational cost of these analyses is crucial for widespread adoption; techniques such as knowledge distillation and pruning could allow for deployment on resource-constrained systems without sacrificing performance. Ultimately, the goal is to create automated systems capable of not only pinpointing cryptographic flaws with high precision but also doing so in a scalable and cost-effective manner, thereby bolstering the security of software ecosystems.

The evaluation reveals a concerning trend: LLM-generated cryptographic Rust code, despite appearing functional, frequently harbors vulnerabilities. This aligns with a fundamental principle of software integrity. As Edsger W. Dijkstra stated, “Simplicity is prerequisite for reliability.” The study demonstrates that complexity, introduced through LLM-generated code and exacerbated by techniques like chain-of-thought prompting, directly impedes security. Static analysis tools are crucial, not to add security, but to mitigate the inherent risks stemming from unnecessarily intricate implementations. The focus should remain on producing demonstrably correct, easily verifiable code-a principle often lost in the pursuit of LLM-driven code generation.

Where to Now?

The exercise reveals a predictable truth: automation does not absolve the need for understanding. Large language models produce code, certainly, but security is not inherent in syntax. The prevalence of vulnerabilities – even in seemingly straightforward cryptographic primitives – suggests the models excel at mimicking competence, not achieving it. The reliance on specialized static analysis tools to unearth these flaws isn’t a solution, but a symptom. It highlights a shifting burden – from writing secure code to auditing machine-generated code – a fundamentally less efficient state.

The unexpected detriment of ‘chain-of-thought’ prompting deserves further scrutiny. The intention – to enhance reasoning – yielded less secure outputs. Perhaps explicit articulation of the cryptographic process encourages the model to introduce more points of failure, or merely exposes pre-existing weaknesses more readily. Whatever the cause, it demonstrates that prompting is not a neutral act; it shapes not only the code produced, but also its inherent risk profile.

Future work must move beyond simply detecting vulnerabilities. The focus should shift to understanding why these models generate insecure code. Can formal methods be integrated into the training process? Can the models be incentivized – through reward functions – to prioritize security alongside functionality? The ultimate goal isn’t to build an AI that writes secure code, but one that understands security – a distinction of critical importance, and a far more difficult undertaking.

Original article: https://arxiv.org/pdf/2604.27001.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Foundations of Digital Trust: Cryptography in an Age of Complexity

Automated Construction: LLMs and the Promise of Cryptographic Code

Validation Through Automation: Assessing the Integrity of Generated Code

Beyond Detection: Implications for a Secure Future

Where to Now?

See also: