Can AI Find the Flaws in Blockchain Code?

Author: Denis Avetisyan

A new benchmark assesses the ability of artificial intelligence agents to identify, fix, and exploit vulnerabilities within smart contracts.

This benchmark evaluates agents across three distinct security modes: InDetect assesses vulnerability recall through code repository audits, InPatch verifies successful patching by ensuring tests pass and exploits fail, and InExploit examines agent interaction with an Ethereum instance via transaction replay and vulnerability-specific checks on contracts and balances.

EVMbench provides a comprehensive evaluation of AI-driven security capabilities for Ethereum Virtual Machine-based smart contracts.

The increasing prevalence of smart contracts managing substantial digital value creates a paradox: while automation promises security, vulnerabilities remain a critical risk. To address this, we introduce ‘EVMbench: Evaluating AI Agents on Smart Contract Security’, a comprehensive benchmark designed to rigorously assess the capabilities of AI agents in detecting, patching, and exploiting smart contract weaknesses. Our evaluation, utilizing 117 curated vulnerabilities and a realistic local Ethereum execution environment, demonstrates that current frontier agents are capable of end-to-end vulnerability discovery and exploitation against live blockchain instances. As AI agents become increasingly integrated into blockchain ecosystems, what further advancements are needed to ensure robust and proactive smart contract security?

The Rising Tide of Smart Contract Vulnerabilities

The decentralized finance (DeFi) landscape, while innovative, is increasingly plagued by vulnerabilities within its foundational smart contracts. These self-executing agreements, intended to automate and secure transactions, present a unique attack surface for malicious actors. Exploits, ranging from simple coding errors to complex logic flaws, have already resulted in substantial financial losses – exceeding hundreds of millions of dollars in recent years. The risk isn’t merely financial; compromised smart contracts erode trust in the entire DeFi system, hindering broader adoption. As the value locked within these contracts continues to surge, and their complexity grows with each new application, the potential for devastating exploits correspondingly increases, demanding robust security measures and proactive vulnerability detection.

Current security evaluations for smart contracts frequently struggle to keep pace with the rapidly evolving DeFi landscape. Manual audits, while valuable, are inherently limited by time constraints and the potential for human error, often proving both costly and slow to implement – a critical disadvantage given the speed at which exploits can occur. Furthermore, these traditional methods frequently offer only a snapshot in time, failing to provide the continuous monitoring necessary to detect vulnerabilities introduced through updates or evolving code interactions. This incomplete coverage leaves decentralized applications susceptible to attacks, as even seemingly minor flaws can be exploited to drain funds or compromise the integrity of the system, highlighting the urgent need for more robust and scalable security solutions.

As smart contracts become increasingly sophisticated, enabling more complex decentralized applications, the potential for subtle vulnerabilities also grows exponentially. Manual auditing, while valuable, struggles to keep pace with this rising complexity, often missing nuanced flaws within sprawling codebases. Consequently, researchers are actively developing automated solutions – tools leveraging techniques like formal verification, symbolic execution, and machine learning – to systematically analyze smart contract code. These tools aim to identify vulnerabilities that might elude human reviewers, covering a broader range of potential exploits and providing a more comprehensive security assessment. The shift towards automated analysis isn’t about replacing auditors, but rather augmenting their capabilities, allowing them to focus on higher-level risks and design flaws while the tools handle the granular, time-consuming task of identifying code-level vulnerabilities.

This smart contract exhibits a vulnerability that could be exploited by malicious actors.

EVMbench: A Framework for Quantifying AI-Driven Smart Contract Security

EVMbench is a newly developed benchmark and evaluation framework specifically designed to quantitatively assess the performance of AI agents – such as those leveraging large language models – in the context of smart contract security. The framework facilitates rigorous testing of an agent’s ability to identify vulnerabilities within Ethereum Virtual Machine (EVM) bytecode and subsequently exploit those weaknesses. By providing a standardized and automated environment for security analysis, EVMbench moves beyond qualitative assessments and enables comparative analysis of different AI agent architectures and training methodologies in a critical domain – blockchain security.

EVMbench utilizes large language models (LLMs) functioning as AI Agents to automate smart contract security analysis. Specifically, the framework is designed to interface with and evaluate agents powered by models like Gemini 3 Pro and Claude Opus 4.6, enabling standardized testing of their capabilities in identifying vulnerabilities. These agents are not directly integrated into the code; rather, EVMbench provides a structured environment for submitting contract code and receiving agent-generated reports, allowing for quantitative measurement of performance across different LLM architectures and prompt engineering strategies. The reliance on LLM agents facilitates a scalable approach to vulnerability discovery, reducing the need for manual code review and potentially uncovering a wider range of security flaws.

EVMbench utilizes three distinct evaluation modes to comprehensively assess AI agent performance in smart contract security. The Detect mode assesses an agent’s ability to identify vulnerabilities within provided smart contract code. Exploit mode evaluates the agent’s capacity to generate functional exploits leveraging identified vulnerabilities, demonstrating practical impact. Finally, Patch mode tests the agent’s ability to automatically generate code to remediate the discovered vulnerabilities, measuring the quality and effectiveness of the proposed fix. Performance is measured across all three modes, providing a holistic evaluation of an agent’s security capabilities beyond simple vulnerability identification.

Code4rena audits revealed a varied distribution of auditor counts identifying vulnerabilities in EVMbench, highlighting differing levels of engagement and expertise in identifying potential security flaws.

Establishing a Foundation for Reliable and Reproducible Testing

EVMbench utilizes a locally hosted Ethereum chain, deployed within isolated environments – typically Docker containers – to establish a secure and reproducible testing infrastructure. This approach prevents external network dependencies and eliminates potential interference from other processes or services, ensuring consistent test results. The local chain allows for complete control over the blockchain state and transaction flow, while the isolated environments guarantee that each test run operates in a clean and defined context, independent of the host system’s configuration or existing data. This isolation is critical for both automated testing and debugging, enabling reliable validation of EVM behavior and agent interactions.

Deterministic execution is a fundamental requirement for reliable testing within EVMbench, ensuring that given identical inputs, the Ethereum Virtual Machine (EVM) will consistently produce the same output. This predictability is achieved through the EVM’s defined execution model and the elimination of external factors that could introduce variance. The guarantee of consistent results is critical for two primary purposes: firstly, it allows for accurate validation of exploit success; if an exploit is designed to achieve a specific outcome, deterministic execution confirms whether it consistently achieves that outcome given the same initial conditions. Secondly, it enables precise analysis of agent behavior, providing confidence that observed actions are not the result of non-deterministic factors and are therefore representative of the agent’s intended logic.

The EVMbench re-execution framework is implemented in Rust to facilitate the precise replay of agent transactions within a controlled environment. This allows for deterministic analysis of agent behavior, enabling developers to step through each transaction and inspect state changes with high fidelity. The framework captures all necessary data, including transaction inputs, gas costs, and resulting state modifications, providing detailed insights into agent actions. By replaying transactions multiple times with the same inputs, potential issues, such as unexpected state transitions or logic errors, can be reliably identified and debugged, ensuring the integrity and predictability of the tested agents.

EVMbench employs on-chain event monitoring to rigorously assess exploit success by tracking transactions and resultant state changes directly on the Ethereum blockchain. This process involves the observation of specific event logs emitted by smart contracts during and after agent interactions. These events detail actions such as token transfers, contract calls, and modifications to storage variables. By analyzing the sequence and content of these events, EVMbench can definitively determine whether an exploit achieved its intended outcome, such as unauthorized fund withdrawals or contract control hijacking, and validate the agent’s behavior against expected results. The immutability of blockchain data ensures the reliability of these event-based verification mechanisms.

Across all task modes, agents built on the OpenCode scaffold (<span class="katex-eq" data-katex-display="false">OC</span>) demonstrate comparable performance to models run with their native CLIs (e.g., GPT-5.3-Codex with Codex, Claude models with Claude Code, and Gemini 3 Pro with Gemini CLI), as indicated by bootstrap confidence intervals (see Table J for complete results). — Across all task modes, agents built on the OpenCode scaffold ( $OC$ ) demonstrate comparable performance to models run with their native CLIs (e.g., GPT-5.3-Codex with Codex, Claude models with Claude Code, and Gemini 3 Pro with Gemini CLI), as indicated by bootstrap confidence intervals (see Table J for complete results).

A Holistic Approach to Smart Contract Security and Future Implications

EVMbench distinguishes itself through a commitment to comprehensive security assessment, moving beyond the typical practice of pinpointing isolated vulnerabilities. The framework is designed to systematically explore a codebase, actively seeking out the full spectrum of potential flaws – from subtle logic errors to critical security breaches. This holistic approach acknowledges that software security isn’t simply about fixing known issues, but about proactively discovering and addressing all possible weaknesses before they can be exploited. By prioritizing exhaustive coverage, EVMbench provides a more robust and reliable evaluation of smart contract security, ultimately fostering more resilient and trustworthy decentralized applications.

EVMbench’s innovative “Patch Mode” transcends simple vulnerability detection by actively employing tools like Foundry to synthesize functional code fixes. This capability demonstrates a significant advancement in automated security – the agent doesn’t merely flag potential flaws, but attempts to resolve them autonomously. By leveraging automated test infrastructure and code generation, the framework can propose and validate patches, effectively automating a portion of the vulnerability remediation process. This proactive approach suggests a future where security agents can contribute to a codebase’s security posture not just through identification, but through direct, automated correction, potentially reducing the burden on human developers and accelerating the resolution of critical security risks.

EVMbench distinguishes itself through a deliberate focus on high-severity vulnerabilities, recognizing that not all security flaws are created equal. This prioritization moves beyond simply identifying a large number of issues to concentrate on those posing the greatest immediate risk to smart contract security and user funds. By targeting vulnerabilities with the potential for significant damage – such as critical bugs allowing for unauthorized fund access or contract manipulation – the framework maximizes its impact on real-world security. This strategic approach ensures that limited security resources are directed towards mitigating the most dangerous threats, offering a more efficient and effective method for improving the overall resilience of the Ethereum ecosystem and beyond.

EVMbench establishes a notable capacity for automated vulnerability exploitation, achieving a 71.0% success rate when leveraging the GPT-5.3-Codex language model. This result signifies a substantial advancement in the field of smart contract security, as it demonstrates the potential to not only identify flaws within Ethereum Virtual Machine (EVM) code but also to automatically generate functional exploits. The framework’s efficacy in this area suggests a pathway toward more proactive security measures, potentially reducing the time window for malicious actors to capitalize on newly discovered vulnerabilities and bolstering the overall resilience of decentralized applications. This automated exploitation capability offers a powerful tool for security researchers, developers, and auditors seeking to rigorously test and harden smart contract systems before deployment.

EVMbench demonstrates a significant capability in automated vulnerability detection, achieving a 45.9% ‘Detect’ score when utilizing the Claude Opus 4.6 large language model. This score signifies the framework’s ability to accurately pinpoint potential security flaws within smart contract code, even before exploitation attempts. The result highlights Claude Opus 4.6’s proficiency in understanding complex code structures and identifying patterns indicative of vulnerabilities, offering a promising avenue for proactive security assessments and reducing the attack surface of blockchain applications. This level of automated detection is crucial for developers and auditors seeking to enhance the security posture of their projects and maintain user trust within the decentralized finance ecosystem.

EVMbench demonstrates a promising capacity for automated vulnerability remediation, achieving a Patch score of 41.7% when utilizing the GPT-5.3-Codex language model. This indicates the framework’s potential to not only pinpoint security flaws within smart contract code, but also to autonomously generate viable patches. While still under development, this level of success suggests a future where automated tools can significantly reduce the manual effort required for vulnerability fixes, ultimately bolstering the security and reliability of blockchain applications. The ability to automatically synthesize corrections represents a substantial advancement beyond simple vulnerability detection, potentially offering a scalable solution to the ongoing challenges of smart contract security maintenance.

Gemini 3 Pro, GPT-5.2, and GPT-5.3-Codex demonstrate the highest accuracy in detecting the number of vulnerabilities compared to ground truth data.

The pursuit of robust smart contract security, as detailed in EVMbench, demands a relentless focus on essential functionality. The benchmark’s emphasis on vulnerability detection, patching, and exploit development mirrors a core tenet of effective design: stripping away unnecessary complexity to reveal underlying strength. As Barbara Liskov aptly stated, “Programs must be right first before they are fast.” EVMbench isn’t merely about identifying flaws; it’s about establishing a foundation of correctness, a commitment to building secure systems where every line of code contributes to a resilient and trustworthy whole. The benchmark’s comprehensive approach embodies this principle, evaluating AI agents not on superficial features, but on their ability to deliver verifiable security.

Further Vectors

The introduction of EVMbench establishes a necessary, if belated, quantification of capability in a domain previously reliant on anecdote. However, the benchmark’s utility is inherently limited by its static nature. Vulnerabilities, by definition, evolve. Future iterations must embrace dynamic evaluation – agents assessed not solely on known flaws, but on their capacity for continual learning and adaptation to novel attack surfaces. The current focus on detection and patching, while pragmatic, obscures a more fundamental question: can these agents anticipate vulnerability emergence through formal verification or predictive analysis?

A persistent limitation remains the reliance on simulated environments. The chasm between benchmark performance and real-world efficacy will likely prove substantial. Transfer learning, applied judiciously, may bridge this gap, but demands careful consideration of the inherent biases within the training data. The pursuit of increasingly complex agent architectures should not overshadow the imperative for interpretability. An agent capable of identifying and mitigating threats is valuable; one that explains why is essential.

Ultimately, the value of automated security assessment resides not in replacing human auditors, but in augmenting their capabilities. The focus should shift from a competitive framing – agent versus vulnerability – to a collaborative one: agent as a tool for enhanced human understanding. Unnecessary complexity in this pursuit is violence against attention; density of meaning, the new minimalism.

Original article: https://arxiv.org/pdf/2603.04915.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Rising Tide of Smart Contract Vulnerabilities

EVMbench: A Framework for Quantifying AI-Driven Smart Contract Security

Establishing a Foundation for Reliable and Reproducible Testing

A Holistic Approach to Smart Contract Security and Future Implications

Further Vectors

See also: