ChatGPT Writes Code, But at What Risk?

Author: Denis Avetisyan

A new study analyzing real-world user interactions with ChatGPT reveals that automatically generated code frequently contains security vulnerabilities.

The study demonstrates that programming languages exhibit distinct behaviors not only in their initial code generation but also in their responsiveness to subsequent user requests, highlighting an inherent dynamic in their adaptability and a potential for iterative refinement or escalating complexity.

Empirical analysis of the WildChat dataset demonstrates a disconnect between user intent and secure coding practices in code generated by large language models.

While large language models (LLMs) increasingly automate code generation, concerns remain regarding the quality and security of this AI-produced software. This paper, WildCode: An Empirical Analysis of Code Generated by ChatGPT, presents a large-scale analysis of real-world code generated by ChatGPT, evaluating both its correctness and susceptibility to vulnerabilities. Our findings confirm prior research indicating frequent security flaws in LLM-generated code, alongside a striking lack of user interest in proactively addressing these risks. Given this demonstrated gap between capability and conscientious application, how can we better align LLM-driven code generation with secure software development practices?

The Inevitable Decay of Code: Introducing AI-Assisted Generation

The integration of large language models, such as ChatGPT, into software development workflows represents a significant shift towards automated code generation, with the potential to dramatically increase developer productivity. These models excel at translating natural language requests into functional code snippets, automating repetitive tasks, and even assisting in the creation of complex algorithms. This newfound capability allows developers to focus on higher-level problem-solving and architectural design, rather than being bogged down in the minutiae of syntax and implementation. While traditionally, code creation demanded extensive manual effort and specialized expertise, these AI-powered tools are democratizing access to programming, enabling individuals with limited coding experience to contribute to software projects and accelerating the overall pace of innovation. The promise extends beyond simple automation; developers are also leveraging these models for code review, bug detection, and the generation of unit tests, further streamlining the development lifecycle.

The accelerating adoption of AI-driven code generation, while promising substantial gains in developer efficiency, simultaneously introduces a new landscape of potential security vulnerabilities. These models, trained on vast datasets that inevitably include flawed or malicious code, can inadvertently replicate these issues in generated outputs. This poses a significant risk, as developers may unknowingly integrate compromised code into applications, creating pathways for exploitation. Careful scrutiny is therefore paramount, requiring not just functional testing, but also proactive security assessments to identify and mitigate potential weaknesses before deployment. The convenience offered by these tools must be balanced with a rigorous commitment to code integrity and application security, demanding new strategies for validation and verification throughout the software development lifecycle.

A comprehensive analysis of over 82,000 real-world conversations involving AI code generation tools revealed a significant presence of buggy code within user-submitted outputs. Specifically, approximately 3.2% of the analyzed code snippets – representing 1,562 instances out of a sample of 48,391 – contained demonstrable errors. This finding underscores the inherent risk of blindly accepting AI-generated code, even as these tools become increasingly integrated into software development workflows. While offering potential gains in productivity, the observed error rate highlights the critical need for developers to thoroughly review and test any code produced by these models, mitigating the possibility of introducing vulnerabilities or malfunctions into their applications.

As AI code generation tools become increasingly prevalent, the need for systematic methods to ensure code quality and security is paramount. While these models offer substantial productivity gains, their output isn’t inherently free of errors; vulnerabilities can easily slip into generated code if left unchecked. Consequently, developers and security professionals must implement robust verification processes, including static and dynamic analysis, comprehensive testing suites, and potentially, AI-powered code review tools. These measures are crucial not only to identify and rectify bugs but also to proactively prevent the introduction of security flaws that could be exploited by malicious actors, safeguarding software systems and user data in an evolving technological landscape.

The model accurately predicts the category of user queries during conversations about code containing bugs.

Unveiling the Cracks: Security Weaknesses in Generated Code

ChatGPT-generated code in languages such as C++ and Python is susceptible to memory safety issues. These issues stem from the model’s potential to produce code with buffer overflows, use-after-free errors, and other memory management flaws. Such vulnerabilities can be exploited by malicious actors to gain unauthorized access, execute arbitrary code, or cause denial-of-service conditions. The absence of rigorous static or dynamic analysis during code generation contributes to these risks, requiring developers to implement thorough security testing and code review processes for any ChatGPT-generated code integrated into production systems.

Analysis of code generated by ChatGPT revealed a 20.61% vulnerability rate specifically within samples containing MD5 or SHA1 hash functions. This indicates a significant proportion of generated code utilizing these hashing algorithms exhibits security flaws. The identified vulnerabilities stem from improper implementation or usage of the hash functions, potentially allowing for collision attacks or other exploits. This rate was determined through static analysis of a substantial code corpus generated by the model, focusing on identifying common weaknesses associated with MD5 and SHA1 in cryptographic contexts.

ChatGPT demonstrates a susceptibility to generating code containing deserialization vulnerabilities and SQL injection flaws. Analysis indicates a 3.93% vulnerability rate for SQL injection flaws specifically within conversations that included SQL code. These vulnerabilities arise from the model’s potential to generate code that does not properly sanitize user inputs or validate data received during deserialization processes, creating avenues for malicious actors to manipulate application logic or access sensitive information. The presence of these flaws necessitates thorough security review and input validation when integrating ChatGPT-generated code into production systems.

ChatGPT-generated code occasionally includes references to modules or libraries that do not exist, a phenomenon termed ‘hallucinated modules’. This occurs when the model, during code synthesis, incorrectly assumes the availability of a specific library or misnames an existing one. Consequently, the generated code will fail to compile or will encounter runtime errors when executed, as the system cannot locate the referenced module. From a security perspective, these hallucinations can introduce vulnerabilities if the code is deployed without proper validation, potentially creating attack vectors exploitable by malicious actors who could leverage the resulting errors or unexpected behavior.

The identified vulnerabilities in ChatGPT-generated code – including memory safety issues, flaws in hash function implementation, deserialization risks, and SQL injection possibilities – directly degrade the overall quality and security of applications built using this code. A high vulnerability rate, such as the 20.61% observed with MD5/SHA1 hash functions or the 3.93% rate for SQL injection, indicates a substantial risk of exploitation. These weaknesses introduce potential entry points for malicious actors, increasing the likelihood of successful attacks and data breaches. Furthermore, the creation of ‘hallucinated modules’ leads to runtime failures and compromises the reliability of the software, impacting the overall security posture by creating unstable and unpredictable system behavior.

The Tools to Stem the Tide: Automated Vulnerability Detection with OpenGrep

OpenGrep is a freely available, open-source static analysis tool designed to identify security vulnerabilities within source code. It operates by examining code for patterns indicative of common weaknesses, supporting multiple programming languages and vulnerability types. Crucially, OpenGrep’s analysis capabilities extend to code generated by artificial intelligence models, allowing for security assessments of both human-written and AI-produced software components. The tool’s functionality includes identifying potential issues like buffer overflows, format string vulnerabilities, and cross-site scripting (XSS) flaws, offering a proactive approach to software security.

The specific instances of buggy code detailed in this report were identified through automated analysis utilizing OpenGrep. This involved subjecting the codebase to a series of vulnerability detection tests implemented within the OpenGrep framework. The tool flagged these particular code segments due to patterns associated with known security weaknesses, allowing for targeted review and subsequent confirmation of the identified vulnerabilities. This process demonstrates OpenGrep’s capability to pinpoint problematic areas requiring developer attention and remediation.

OpenGrep’s vulnerability detection capabilities encompass critical code weaknesses including memory safety issues – such as buffer overflows and use-after-free errors – that can lead to arbitrary code execution. The tool also identifies deserialization vulnerabilities, where malicious data can be used to instantiate objects and compromise system integrity. Furthermore, OpenGrep effectively scans for SQL injection flaws by analyzing code for improperly sanitized user inputs that could allow attackers to manipulate database queries. These detection capabilities collectively provide a valuable layer of defense by proactively identifying and mitigating common attack vectors before code is deployed.

Automated vulnerability detection, such as that provided by tools like OpenGrep, significantly reduces the time required to identify security weaknesses in code. Traditional manual code review is a resource-intensive process, often subject to human error and scalability limitations. Automation allows developers to integrate security checks directly into their continuous integration and continuous delivery (CI/CD) pipelines, enabling pre-deployment scanning of code changes. This proactive approach facilitates early detection of flaws – like memory safety issues or injection vulnerabilities – while the cost of remediation is lowest and before potentially exploitable code is released to production. The resulting faster feedback loop improves overall software security posture and reduces the risk of post-deployment vulnerabilities.

The Echo of Intent: Understanding User Intent & Its Impact on Security

The WildChat dataset represents a significant resource for deciphering how individuals approach code generation with large language models like ChatGPT. This collection of real-world interactions – encompassing prompts and subsequent queries – offers a unique window into user goals, ranging from simple script creation to complex application development. Analysis of the dataset reveals patterns in how users frame their requests, the types of code they seek, and the level of detail they provide. Importantly, the dataset moves beyond theoretical scenarios, capturing the nuances of authentic user behavior, including ambiguities, errors, and iterative refinement. By studying these interactions, researchers can gain crucial insights into user intent, ultimately leading to the development of more effective, reliable, and user-centered AI-assisted coding tools and security protocols.

Zero-Shot Classification offers a powerful method for dissecting the diverse range of requests directed at code-generating AI like ChatGPT, enabling categorization of user intent without prior training on specific examples. This technique analyzes the semantic meaning of a prompt and assigns it to predefined categories – such as functionality requests, debugging assistance, or performance optimization – even if the model hasn’t encountered that exact phrasing before. By identifying these underlying intents, developers can proactively build security checks into the code generation process; for instance, prompts hinting at data handling could trigger stricter validation routines, or requests involving external APIs could initiate security protocol checks. This targeted approach, driven by understanding what a user intends to build, rather than simply how they ask for it, represents a significant step towards more secure and reliable AI-assisted software development, allowing for customized security measures based on the identified purpose of the generated code.

Analysis of the WildChat dataset revealed a striking pattern: users infrequently voiced explicit security concerns when prompting ChatGPT for code generation. Of the numerous interactions examined, only six instances involved follow-up queries directly addressing potential security vulnerabilities within the generated code. This suggests a significant gap in user awareness regarding the security implications of AI-assisted coding and highlights a reliance on the model to inherently produce secure outputs. The scarcity of proactive security questioning indicates that developers and users may be unintentionally overlooking critical vulnerabilities, underscoring the need for improved educational resources and the development of tools that proactively identify and mitigate security risks in AI-generated code.

Analyzing user intent extends beyond security implications and directly informs strategies for code quality enhancement. By discerning the underlying goals behind code generation requests – whether it’s creating a functional prototype, solving a specific algorithmic problem, or learning a new programming concept – developers can prioritize bug fixes with greater precision. Issues hindering the fulfillment of frequently expressed intents, such as those related to common data structures or basic input/output operations, naturally warrant immediate attention. This intent-driven approach allows for a more efficient allocation of resources, ensuring that the most impactful bugs are addressed first and leading to a continuous improvement in the reliability and usability of AI-generated code. Ultimately, understanding why a user requests code is as crucial as understanding what code is requested, fostering a proactive cycle of refinement and optimization.

By anticipating potential vulnerabilities based on user intent, developers can shift from reactive security measures to a proactive stance when integrating AI-generated code. This means incorporating security checks and safeguards during the code generation process, rather than solely relying on post-generation audits. Such a preventative strategy minimizes the risk of introducing exploitable flaws into applications, leading to more robust and resilient software. Prioritizing security at this early stage not only reduces the potential for costly breaches but also fosters user trust and confidence in AI-assisted development tools, ultimately contributing to a more secure digital ecosystem.

The dataset generation pipeline establishes a systematic process for creating training data.

The Long View: The Path Forward: Secure Coding for AI-Generated Code

Integrating secure coding practices into every stage of the AI-assisted development lifecycle is paramount for mitigating potential vulnerabilities. This necessitates a shift from traditional post-development security checks to a proactive approach, where security considerations are woven into the initial design, code generation, and testing phases. Developers must prioritize techniques like input validation, output encoding, and the principle of least privilege, even when relying on AI-generated code. Furthermore, rigorous code review, automated static and dynamic analysis, and continuous integration/continuous deployment (CI/CD) pipelines with security gates become even more critical when AI is involved, ensuring that vulnerabilities are identified and addressed before deployment. By embedding security throughout the process, organizations can harness the benefits of AI-assisted development while minimizing the risk of introducing exploitable flaws into their applications.

Although artificial intelligence doesn’t directly create Regular Expression Denial of Service (ReDoS) vulnerabilities, their presence highlights a broader need for robust code quality assurance in AI-assisted development. ReDoS attacks exploit the exponential backtracking inherent in poorly constructed regular expressions, allowing a malicious actor to overwhelm a server with a specially crafted input. While the AI model itself might not generate the vulnerable regex, it can easily incorporate them from existing codebases or suggest patterns that, though functional, are susceptible to this type of attack. Consequently, developers must prioritize static analysis tools and rigorous testing to identify and mitigate these vulnerabilities, ensuring that even AI-suggested code adheres to best practices for regular expression construction and overall security.

The future of secure AI-driven applications hinges on advancements in automated vulnerability detection. Current static and dynamic analysis tools often struggle with the unique characteristics of AI-generated code, necessitating research into novel approaches that can effectively identify and mitigate risks. However, simply detecting vulnerabilities isn’t enough; a crucial element is understanding user intent. By analyzing the desired functionality alongside the generated code, systems can differentiate between benign code patterns and potentially malicious ones, reducing false positives and ensuring that security measures align with the application’s purpose. This synergistic approach – combining sophisticated automated analysis with a deep understanding of the developer’s goals – promises a pathway towards building more robust and trustworthy AI-powered software, minimizing the potential for exploitable weaknesses and fostering greater confidence in these emerging technologies.

The study of ChatGPT’s code generation, as presented in ‘WildCode,’ reveals a predictable entropy. Systems, even those born of sophisticated algorithms, inevitably degrade over time – in this case, manifesting as security vulnerabilities. This echoes a fundamental truth of software engineering; versioning isn’t merely a technical practice, but a form of memory, attempting to preserve functionality against the relentless arrow of time. As the research highlights, user intent often prioritizes functionality over security, accelerating this decay. Linus Torvalds aptly observed, “Talk is cheap. Show me the code.” The ‘WildCode’ dataset doesn’t just present code; it reveals the intent behind it, and that intent frequently neglects the vital work of defensive programming, hastening the system’s decline.

The Inevitable Drift

The study of Large Language Models generating code reveals, predictably, that every architecture lives a life, and one finds oneself merely witnessing its decay. The prevalence of vulnerabilities isn’t a flaw in the current iteration, but a characteristic of all complex systems. The WildChat dataset provides a snapshot, a fossil record of user intent, and it demonstrates that security is rarely a primary driver. Attempts to force security through prompt engineering are, at best, temporary measures-a polishing of the surface as the foundations shift.

Future work will undoubtedly focus on automated vulnerability detection and remediation. However, the fundamental problem remains: user needs, expressed in natural language, are inherently ambiguous and often prioritize functionality over robustness. Improvements age faster than one can understand them. Each layer of abstraction, each automated fix, introduces new potential failure modes, a new vector for entropy.

The longer arc suggests a move beyond static analysis towards dynamic, runtime verification, and even self-healing code. But even these approaches are ultimately palliative. The true challenge isn’t building more secure code, but accepting that absolute security is an illusion-a fleeting moment in the continuous process of creation and dissolution.

Original article: https://arxiv.org/pdf/2512.04259.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/