AI Goes Rogue: How GPT-5.3 Took Over the Blockchain!

Grand Overview

The new EVMbench overlooks 120 eternal sins drawn from 40 mighty audits to judge brave agents in the triad of detecting, patching, and exploiting.
In the perilous championship of exploitation, GPT‑5.3‑Codex beat its elder by a staggering 72.2% to 31.9%, proving it is not merely a slave to code but practically a genius.
OpenAI declares its pledge: it will release tasks, tools, and a bounty of $10 million in API credits to bolster the defenders of the realm.

In a modest triumph, the canonical guardians-OpenAI in communion with the crypto sovereigns at Paradigm-have unveiled EVMbench, a mystery designed to gauge how sagely these AI “agents” can uncover, repair or murder the Achilles heels in the Ethereum Virtual Machine’s sacred contracts.

“EVMbench, the grand arbiter of knowledge, illuminates how well AI agents detect, exploit, and patch high‑severity smart‑contract vulnerabilities.” – OpenAI (@OpenAI), February 18, 2026

It is an opus composed of 120 meticulously chosen plagues drawn from 40 vaults of audits, most originating in competitive arenas akin to the Code4rena contests.

OpenAI whispers that the EVMbench also draws inspiration from the perilous audit journey of Tempo, a Layer‑1 that shepherds stablecoin transfers, a realm they foresee expanding as “agentic” payments become the new empire.

The Triune Labors of EVMbench

EVMbench measures the tolls of three pilgrimages which mirror the pilgrimages of real-world security work:

Detect: The agents scour a contract repository. Their score reflects how many known lurking specters they unveil, measured against the rewards of the audit’s treasury.
Patch: The agents amend the doomed code, aiming to exorcise the malevolent forces while preserving the intended dance of functionality, all verified by automated rites and exploit checks.
Exploit: The agents take on the ultimate test: draining the coffers of deployed contracts in a sandbox hearth, scored by the fidelity of transaction replay and on‑chain verification.

To ward against cheating and maintain the integrity of miracles, OpenAI built a Rust‑crafted engine that plays the deck of transactions fairly and limits unsafe calls to RPC. Exploit trials riddle on a local Anvil instance, distancing from the living, fluctuating main networks, and instead dive into the depths of historically documented horrors.

Preliminary Revelations: The Exploit Path Shines

OpenAI admits that agents currently perform highest drama in the exploit realm. Here, the purpose is clearest: perpetually iterate until the coffers’ gold is on the floor. GPT‑5.3‑Codex triumphed with 72.2%, a leap from GPT‑5’s 31.9%-a reminder of how quickly the swords of code sharpen.

Detection and patching render a more half‑hearted performance; agents sometimes cease after encountering a single Woe in detect mode. Patching proves most arduous, for it requires extinguishing subtle ghastly reeds without breaking the contract’s original will.

Implications for Guilds of DeFi Inhabitants

For warriors and architects of the blockchain, EVMbench enters at a time of deep unease: as AI agents grow keener at executing code and planning in iterative loops, the gap from “bug discovery” to “chain exploitation” shrinks to near instant.

OpenAI conceptualizes this as a tool that may aid the righteous but equally accelerates the pursuit of the immoral in a sphere where attacks can be swift and irrevocable.

OpenAI frames EVMbench as both a yardstick and a spearhead toward defense. The dominion promises expanded protections for dual‑use capabilities, such as oversight and “trusted access” mechanisms. By investing in cohorts like the private beta of its security researcher agent “Aardvark” and committing $10 million in API credits via its Cybersecurity Grant Program to support good crusaders, it stakes its frontiers against misuse.

Should EVMbench achieve widespread acclaim, it may shape how auditors and protocols grade AI‑assisted tools, setting apart those connoisseurs who merely narrate vulnerabilities from those who can demonstrably prove exploitability, deliver safe patches, and evade false alarms in production‑grade codebases.

2026-02-18 23:00

The Triune Labors of EVMbench

Preliminary Revelations: The Exploit Path Shines

Implications for Guilds of DeFi Inhabitants

Read More