Can AI Write Secure Code? A New Benchmark Reveals the Challenges

Author: Denis Avetisyan

Researchers have created a realistic test suite to assess how well large language models generate code that is both functional and free of security vulnerabilities.

RealSec-bench evaluates secure code generation using real-world Java repositories and static application security testing, highlighting significant gaps in current AI capabilities.

Despite advances in code generation, large language models often struggle to produce secure software, a critical gap exacerbated by a lack of realistic evaluation benchmarks. To address this, we introduce RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories, a novel assessment built from 105 real-world, high-risk Java vulnerabilities identified through a rigorous pipeline of static analysis, LLM-assisted filtering, and expert validation. Our findings reveal a significant disparity between functional correctness and security, demonstrating that while techniques like Retrieval-Augmented Generation can improve code completion, they offer minimal gains in vulnerability prevention, and prompting for general security often harms functionality. Can future LLMs bridge this gap and reliably generate both functional and secure code?

The Fragile Foundation: Secure Code Generation in the Age of LLMs

Large Language Models have demonstrated a remarkable capacity for translating natural language into functional code, opening exciting possibilities for automated software development. However, this progress is tempered by a critical challenge: ensuring the security of the generated code. While these models excel at syntax and logic, they often struggle to identify and avoid common vulnerabilities – weaknesses that malicious actors could exploit. This isn’t simply a matter of occasional errors; the sheer volume of potentially insecure code produced necessitates robust security analysis, and current automated tools frequently flag benign code as problematic, creating a substantial burden for developers who must then manually verify its safety. The ability to reliably generate secure code, rather than merely functional code, remains a key obstacle to widespread adoption of AI-assisted coding technologies.

Contemporary methods for verifying code generated by large language models frequently encounter a critical paradox: while aiming to bolster security, they often generate a substantial number of false positives. These incorrectly flagged vulnerabilities – issues that are reported as threats but are, in fact, benign – create a significant burden for developers who must then manually investigate and dismiss each alert. This phenomenon isn’t simply a matter of noise; it undermines trust in the automated security analysis and effectively diminishes the value of AI-assisted coding tools. The root of the problem lies in the models’ tendency to prioritize syntactical correctness over semantic security, leading to the identification of patterns that resemble vulnerabilities without actually posing a genuine risk. Consequently, developers face increased workload and potential alert fatigue, hindering the widespread adoption of these potentially powerful tools and leaving codebases vulnerable despite security checks.

The promise of AI-assisted coding is currently tempered by a substantial influx of false positive security alerts. While large language models can generate functional code, existing vulnerability detection methods frequently misidentify benign code segments as threats, dramatically increasing the burden on developers. Each false alarm requires investigation, consuming valuable time and resources that could be dedicated to actual problem-solving or feature development. This constant need for manual verification not only diminishes developer productivity but also erodes trust in these tools, slowing the broader adoption of AI-driven coding assistance and hindering its potential to revolutionize software development practices. The high rate of inaccurate flags necessitates a shift towards more precise and context-aware security analysis to unlock the true benefits of automated code generation.

RealSec-bench: Grounding Security Evaluation in Practicality

RealSec-bench distinguishes itself from existing security benchmarks by utilizing code sourced directly from 105 tasks extracted from 30 publicly available Java repositories. Traditional benchmarks often rely on synthetically generated code, which may not accurately reflect the complexities and nuances of real-world software development practices. By basing its evaluation on genuine projects, RealSec-bench aims to provide a more pragmatic and representative assessment of vulnerability detection capabilities, as it directly assesses performance against the types of coding patterns and project structures encountered in production environments. This approach enhances the ecological validity of the benchmark and its relevance to practical security testing.

RealSec-bench utilizes CodeQL, a semantic code analysis engine, to identify vulnerabilities without executing the code. CodeQL allows queries to be written that describe vulnerability patterns, and these queries are then run against the generated code to detect instances matching those patterns. This static analysis approach enables the benchmark to pinpoint potential security flaws, such as improper input validation or buffer overflows, directly within the source code, offering a deterministic and repeatable evaluation process. The use of CodeQL also facilitates the automated assessment of a large number of code samples, crucial for evaluating the effectiveness of different security tools and techniques.

RealSec-bench consists of 105 individual tasks sourced from 30 distinct Java repositories. These tasks are designed to evaluate the detection of 19 different Common Weakness Enumeration (CWE) vulnerability types, including issues like SQL injection, cross-site scripting, and improper input validation. The benchmark’s construction, utilizing real-world code, aims to provide a more representative assessment of code security tools and techniques compared to benchmarks relying on artificially generated code samples. The breadth of covered CWEs and the scale of tasks facilitate a comprehensive evaluation of a system’s ability to identify a wide range of security flaws present in practical Java applications.

Beyond Functional Correctness: Measuring Comprehensive Security Assurance

While unit tests effectively validate that generated code produces the expected output for given inputs – establishing functional correctness – they do not inherently guarantee security. A program can pass all functional tests yet still be vulnerable to exploits such as injection attacks, buffer overflows, or improper authentication handling. This is because unit tests typically focus on intended behavior and do not systematically explore edge cases or malicious inputs designed to bypass security mechanisms. Consequently, achieving high functional correctness, as measured by pass rates on standard unit tests, is a prerequisite for secure code generation but does not, on its own, ensure a secure outcome; dedicated security evaluations are also essential.

The RealSec-bench evaluation suite utilizes the SecurePass@k metric to provide a unified assessment of large language model (LLM) performance, considering both functional correctness and security vulnerabilities. Unlike traditional pass@k metrics which only verify functional outputs, SecurePass@k requires solutions to be both functionally accurate and free from identified security flaws. Current evaluations using RealSec-bench demonstrate that state-of-the-art LLMs achieve a SecurePass@1 score of less than 8%, indicating a substantial gap between achieving functional correctness and generating genuinely secure code. This low score highlights that while models may produce syntactically correct code, they frequently contain security vulnerabilities that render them unsuitable for deployment in security-sensitive applications.

Despite achieving a 16.19% Pass@1 rate, indicating a notable level of functional correctness, large language models continue to struggle with security considerations. This Pass@1 metric assesses whether a model generates any functionally correct solution to a given prompt, but it does not evaluate the security of that solution. The SecurePass@1 metric, which does integrate security assessment, currently reveals scores below 8% for these same models, demonstrating a significant disparity between functional capability and secure code generation. This data highlights that a model can produce a working solution without necessarily producing a secure solution, and emphasizes the need for dedicated security evaluation beyond standard functional testing.

Unveiling the Systemic Weakness: The Path Towards Secure-by-Design AI Coding

Identifying vulnerabilities in software often extends beyond single functions, frequently originating from intricate data flows spanning multiple procedures-a phenomenon known as inter-procedural dependency. These dependencies can inadvertently trigger false positives in vulnerability detection systems, as a vulnerability’s root cause may lie several functions removed from where it manifests. The RealSec-bench benchmark deliberately incorporates tasks exhibiting these complex relationships, with some requiring tracing data flow across a maximum of 34 function calls-or ‘hops’-to accurately pinpoint the source of the security flaw. This design choice reflects the reality of modern codebases, where vulnerabilities rarely exist in isolation and demand analysis that accounts for these deep interdependencies to avoid misleading results and ensure accurate security assessments.

RealSec-bench distinguishes itself by leveraging authentic codebases constructed with industry-standard build tools like Maven, a departure from synthetic or deliberately simplified vulnerability datasets. This deliberate design choice enables the benchmark to surface vulnerabilities arising from the intricate interactions within real-world software projects – dependencies, complex control flow, and the accumulated nuances of development practices. By focusing on code as it actually exists, rather than idealized versions, RealSec-bench exposes the challenges faced by security tools and Large Language Models when confronted with the complexities inherent in modern software, providing a more realistic assessment of their capabilities and limitations in identifying and preventing security flaws.

Analysis utilizing RealSec-bench demonstrates a critical gap in current Large Language Model (LLM) security practices, revealing these models are susceptible to generating vulnerable code without specific security training. The benchmark’s focus on real-world codebases uncovers that LLMs, while proficient in code synthesis, often lack the nuanced understanding of secure coding principles necessary to avoid common vulnerabilities. This suggests a proactive approach – integrating comprehensive security guidelines directly into the LLM training process – is essential. Such integration would not merely teach the models how to code, but how to code securely, potentially shifting the paradigm from reactive vulnerability patching to preventative vulnerability avoidance, ultimately enhancing the reliability and safety of software developed with AI assistance.

The creation of RealSec-bench underscores a fundamental principle: system behavior is inextricably linked to structure. This benchmark doesn’t simply assess whether large language models can generate code, but whether they can generate secure code within the complex ecosystem of real-world repositories. The challenge, as revealed by the findings, isn’t merely functional correctness, but achieving it alongside robust security-a holistic consideration. As Henri Poincaré observed, “It is through science that we arrive at truth, but it is through simplicity that we arrive at clarity.” RealSec-bench, by focusing on practical, existing codebases, strives for that clarity, offering a scalable method to evaluate LLMs beyond contrived examples and illuminate the structural dependencies between code functionality and security vulnerabilities.

Future Directions

The introduction of RealSec-bench highlights a fundamental truth: assessing secure code generation is not merely a question of identifying vulnerabilities, but of understanding the intricate relationship between function and safety. One cannot simply replace a flawed component without considering its place within the larger architecture. The benchmark reveals that current Large Language Models often prioritize functional correctness at the expense of security – a predictable outcome, given the training data’s inherent biases. The field must now move beyond superficial gains and focus on creating models that inherently understand secure coding principles, not simply mimic patterns.

A crucial next step lies in developing more nuanced evaluation metrics. Existing static analysis tools, while valuable, are often blunt instruments. They detect symptoms, not root causes. Future work should investigate dataflow-aware benchmarks that assess the model’s understanding of data propagation and potential vulnerabilities along critical paths. This necessitates a shift from simply identifying if a vulnerability exists to understanding how it could be exploited – a far more challenging, yet vital, endeavor.

Ultimately, the success of secure code generation hinges on acknowledging the systemic nature of software security. The benchmark serves as a diagnostic, revealing the weaknesses in the system. Addressing these issues demands a holistic approach, one that integrates secure coding practices throughout the entire software development lifecycle – a task far exceeding the capabilities of any single model or tool.

Original article: https://arxiv.org/pdf/2601.22706.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragile Foundation: Secure Code Generation in the Age of LLMs

RealSec-bench: Grounding Security Evaluation in Practicality

Beyond Functional Correctness: Measuring Comprehensive Security Assurance

Unveiling the Systemic Weakness: The Path Towards Secure-by-Design AI Coding

Future Directions

See also: