AI Code Contributions: A Security Risk?

Author: Denis Avetisyan

A new study reveals that while AI is increasingly used to generate code fixes, a significant number of these contributions introduce or fail to address critical security vulnerabilities.

AI agents face varied rejection reasons, highlighting the nuanced challenges inherent in their deployment and the need for tailored solutions to address specific failure modes.

Analysis of AI-generated pull requests in open-source projects exposes recurring vulnerabilities and highlights the need for improved evaluation of AI-assisted code contributions.

While the increasing autonomy of AI coding agents promises to accelerate software development, it simultaneously raises critical questions about the security of contributed code. This paper, ‘Insights into Security-Related AI-Generated Pull Requests’, analyzes over 33,000 AI-generated pull requests, identifying 675 security-related submissions and revealing a small set of recurring vulnerabilities like regex inefficiencies and injection flaws. Surprisingly, our findings indicate that commit message quality has limited impact on acceptance, and rejections often stem from process issues rather than technical flaws, suggesting current code review practices are misaligned with AI contributions. How can we develop more effective evaluation metrics and review workflows to ensure the secure integration of AI-assisted code into open-source projects?

The Expanding Threat Landscape

Contemporary software development frequently integrates numerous third-party components – libraries, frameworks, and APIs – to accelerate development and reduce costs. While beneficial, this practice dramatically expands the attack surface, introducing vulnerabilities beyond the core application code itself. Each dependency represents a potential entry point for malicious actors, as flaws within these external components can be exploited to compromise the entire system. The sheer number of these dependencies, often running into the hundreds or even thousands per application, creates a complex web of potential weaknesses that are difficult to comprehensively assess and manage. This reliance on external code necessitates robust supply chain security practices and continuous vulnerability monitoring to mitigate the risks associated with inherited flaws.

The sheer scale of modern software development presents a significant challenge to traditional security practices. Manual code review, once a cornerstone of vulnerability detection, is increasingly overwhelmed by the velocity of releases and the intricate web of dependencies inherent in most applications. Developers are expected to deliver features rapidly, often leaving insufficient time for thorough security assessments. Consequently, vulnerabilities frequently slip through the cracks, hidden within millions of lines of code or nested deep within third-party libraries. This isn’t a failure of diligent effort, but a recognition that human review simply cannot scale to meet the demands of contemporary software ecosystems, necessitating the adoption of automated tools and more proactive security strategies.

The repercussions of failing to identify software vulnerabilities extend far beyond minor inconveniences, manifesting as significant threats to organizations regardless of their scale. A single, unaddressed flaw can serve as an entry point for malicious actors, potentially leading to substantial data breaches that compromise sensitive customer information, intellectual property, or financial records. Beyond data loss, successful exploitation can result in complete system compromise, enabling attackers to disrupt critical operations, demand ransom, or utilize compromised infrastructure for further attacks. The financial implications encompass not only direct losses from breaches – including remediation costs, legal fees, and regulatory fines – but also reputational damage and loss of customer trust, potentially leading to long-term business consequences. Consequently, proactive vulnerability management has become paramount for maintaining operational resilience and safeguarding organizational assets in an increasingly interconnected digital world.

Automated Defense: The Rise of Agentic AI

Agentic AI represents a shift in software security through the application of Large Language Models (LLMs) to automate vulnerability management. Traditionally, identifying and resolving security flaws requires significant manual effort from security professionals. Agentic AI systems, however, leverage LLMs to autonomously scan codebases, detect potential vulnerabilities-such as injection flaws, cross-site scripting, or insecure deserialization-and then propose remediation strategies. This automation extends beyond simple detection; these agents can generate code changes designed to address the identified issues, effectively streamlining the entire vulnerability lifecycle from discovery to resolution and reducing the reliance on manual intervention.

Agentic AI systems are demonstrating the capability to autonomously address software vulnerabilities by generating and submitting pull requests to code repositories. This functionality bypasses traditional, manual security patching processes, allowing for rapid response to identified issues. The AI agents operate by analyzing code, identifying potential weaknesses, and then constructing code changes designed to remediate those weaknesses. These changes are formatted as pull requests, complete with descriptions of the implemented fix, and submitted to the relevant project for review and integration. This automated contribution model enables a continuous security improvement cycle, effectively scaling security efforts beyond the capacity of individual human contributors and providing 24/7 monitoring and response capabilities.

A study was conducted analyzing 675 security-related pull requests generated by AI agents to determine the viability of automated vulnerability remediation. The analysis focused on the AI Agent Success Rate – the percentage of pull requests that were successfully merged – as the primary metric for evaluating the effectiveness of this automation approach. Successful merging indicates that the proposed code changes were accepted by human reviewers and integrated into the codebase, demonstrating the agent’s ability to accurately identify and resolve security issues. The study’s findings underscore that a high AI Agent Success Rate is critical for realizing the potential benefits of automated security contributions, including increased efficiency and reduced vulnerability response times.

Strengthening the Pipeline: Quality and Efficiency

Integration of static analysis tools, such as Semgrep, into agentic AI workflows enables pre-submission vulnerability detection. Internal analysis of Pull Requests generated by AI agents indicates a significant rate of detectable issues; specifically, 15.4% of AI-generated PRs contained vulnerabilities identifiable by Semgrep. This proactive approach addresses common security flaws prior to code review and merging, improving overall software security and reducing the burden on human reviewers. The identified issues encompass a range of vulnerability types detectable by Semgrep’s rule sets, highlighting the potential for automated security checks within the AI-driven development process.

Commit message quality is being actively assessed within the automation pipeline using models such as C-Good. These models analyze commit messages to verify they provide clear and concise explanations of the implemented changes. This assessment ensures improved code maintainability, facilitates easier debugging, and enhances collaboration among developers by providing readily understandable context for each commit. Consistent application of this automated review process helps to enforce standardized commit message formatting and content, leading to a more auditable and understandable project history.

Automated bots significantly improve the Pull Request (PR) workflow by actively managing inactive requests and decreasing review latency. These bots function by automatically identifying PRs that have remained unchanged for a predetermined period – typically exceeding several days – and either closing them or prompting the author for updates. This proactive approach reduces the backlog of stale PRs and focuses reviewer attention on active contributions. Furthermore, bots can automatically assign reviewers based on file ownership or expertise, and provide reminders to expedite the review process. Analysis indicates that implementation of such bots can reduce average review latency by up to 30%, contributing to faster development cycles and increased team velocity.

Proactive Resilience: Understanding Common Weaknesses

Effective software security begins with a granular comprehension of potential vulnerabilities, and frameworks like the Common Weakness Enumeration (CWE) provide a standardized language for this understanding. By categorizing weaknesses – such as improper input validation, buffer overflows, or incorrect access control – CWE enables developers and security professionals to move beyond generic concerns and pinpoint specific areas of risk. This detailed approach is not merely academic; it allows for the creation of targeted preventative measures, informed code reviews, and more effective security testing. A robust understanding of CWE, therefore, is foundational to a proactive security posture, shifting the focus from reactive patching to preventative design and significantly reducing the likelihood of successful exploitation.

A robust security posture increasingly relies on preventative actions that minimize potential entry points for malicious actors. Regularly updating software dependencies is paramount; outdated components often contain known vulnerabilities that attackers actively exploit. Complementing this, thorough code review – a systematic examination of source code by peers – identifies weaknesses before they are deployed, catching logic errors, insecure coding practices, and potential backdoors. These proactive measures collectively reduce the ‘attack surface’ – the sum of all possible vulnerabilities – making systems demonstrably more resilient. By addressing weaknesses early in the development lifecycle, organizations can significantly decrease the risk of successful breaches and maintain the integrity of their applications and data.

Robust test coverage is paramount in fortifying systems against prevalent exploits like command injection, path traversal, and inefficiencies in regular expression handling. However, recent analyses reveal a significant challenge in integrating automated contributions; over 32.4% of pull requests generated by AI are ultimately rejected, and critically, the rationale behind 38.8% of these rejections remains undocumented. This substantial proportion of unexplained rejections suggests a disconnect between automated code generation and effective human review, highlighting a gap in understanding the nuances of secure coding practices and potentially leaving systems vulnerable despite increased testing efforts. Addressing this requires not only improvements in AI-driven code quality but also a deeper investment in review processes that prioritize clarity, knowledge sharing, and a comprehensive understanding of potential security implications.

A Symbiotic Future: Human Expertise and Artificial Intelligence

The evolving landscape of software security demands a shift from solely human-driven approaches to a collaborative synergy between human expertise and artificial intelligence. Increasingly, the sheer volume and complexity of code, coupled with the speed of emerging threats, overwhelms manual review capabilities. AI excels at rapidly scanning for known vulnerabilities and identifying anomalous patterns, but lacks the nuanced understanding of business logic and potential zero-day exploits that experienced security professionals possess. Therefore, the most effective future strategies involve AI automating routine tasks – such as initial vulnerability detection and triage – while simultaneously empowering human experts to focus on complex problem-solving, threat modeling, and the validation of AI-driven insights. This symbiotic relationship allows organizations to achieve a more comprehensive and resilient security posture, proactively addressing risks and building software systems designed for sustained reliability.

The landscape of software security is shifting towards a paradigm of constant vigilance. Future systems will not rely on periodic scans, but instead employ continuous monitoring to observe software behavior in real-time. This data will fuel automated vulnerability detection systems, capable of identifying and flagging potential weaknesses as they emerge, even before they can be exploited. Crucially, this detection will be paired with proactive mitigation strategies – automated responses that can patch vulnerabilities, adjust security settings, or isolate compromised components without human intervention. This shift promises a more resilient software ecosystem, moving beyond reactive defense to a state of ongoing protection and adaptation, significantly reducing the window of opportunity for malicious actors.

Organizations increasingly recognize that anticipating future software vulnerabilities demands a synergy between human insight and automated systems. While predictive modeling of pull request acceptance currently exhibits limited accuracy – as evidenced by a Pseudo-R-squared of just 0.23 – the consistency with which reviewers identify reasons for rejection is remarkably high, registering a Cohen’s Kappa of 0.94. This suggests that, although current models struggle to predict which changes will introduce flaws, human reviewers reliably agree on why certain changes are problematic. Harnessing this consistent human assessment, coupled with AI’s ability to continuously scan for known patterns and anomalies, promises a proactive security posture capable of building demonstrably more reliable software and staying ahead of evolving threats.

The study illuminates a critical tension: the promise of automated security improvements through AI-generated pull requests clashes with the reality of persistent vulnerabilities. The analysis reveals that even with AI assistance, code review practices often fail to adequately address these flaws. This echoes Brian Kernighan’s sentiment: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” The paper demonstrates a similar principle – clever automation is insufficient; rigorous evaluation and a focus on fundamental security practices remain paramount. Simplicity in assessment, much like in code, is key to genuine progress.

What Lies Ahead?

The proliferation of agentic AI in software development was, predictably, followed by a proliferation of vulnerabilities in its contributions. This work illuminates not so much a failure of the algorithms, but a failure of expectation. They called it ‘assistance’; it appears to be a remarkably efficient means of scaling existing chaos. The persistent pattern of AI-introduced flaws suggests the current evaluation metrics – lines of code changed, tests passed – are measuring activity, not quality. One suspects a fondness for metrics is a human failing, projected onto the machine.

Future work must address the misalignment between automated code generation and the nuanced practices of effective code review. The observed tendency to accept AI-generated patches with less scrutiny than human contributions is… concerning. It is not enough to simply detect vulnerabilities; one must understand why these systems consistently propose them. The focus should shift from ‘can it code?’ to ‘does it understand what it has coded, and why that understanding is crucial?’

Ultimately, the challenge is not to build smarter AI, but to build simpler systems-and to cultivate a more rigorous humility in those who deploy them. Perhaps, instead of striving for ever-more-complex ‘AI frameworks’, the field would benefit from a renewed appreciation for the elegance of well-understood, manually-crafted code. One can dream.

Original article: https://arxiv.org/pdf/2604.19965.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/