19.02.2026

4 Min. Lesezeit

OpenAI and Paradigm's EVMBench: The First Serious Test for AI Security Agents

OpenAI and Paradigm just open-sourced EVMBench, a benchmark that tests how well AI agents can detect, patch, and exploit vulnerabilities in smart contracts. Smart contracts secure over $100 billion in crypto assets, and until now, there was no standardized way to measure whether AI agents could meaningfully help protect them.

That just changed.

What EVMBench Actually Tests

EVMBench evaluates AI agents across three tasks, each representing a different layer of security work:

Detection: Agents audit smart contract repositories and try to find known vulnerabilities. Scoring is based on recall, meaning whether the agent catches what professional auditors already documented.

Patching: Agents attempt to fix vulnerable code without breaking the contract. This is harder than it sounds. The fix has to remove the vulnerability while preserving every other piece of functionality, including edge cases the agent may not fully understand.

Exploitation: Agents try to drain funds from vulnerable contracts inside sandboxed blockchain environments. Grading is deterministic, based on whether the agent actually triggers the on-chain state change that would steal funds in a real deployment.

The dataset behind all of this: 120 curated vulnerabilities from 40 real audits, mostly sourced from Code4rena open audit competitions, plus additional scenarios from Stripe's Tempo blockchain project. Each task is containerized, so agents run in isolated, reproducible environments.

Every task also comes with a verified answer key. The benchmark itself has been validated to ensure every vulnerability is actually exploitable, patchable, and detectable.

The Numbers Tell the Story

When OpenAI and Paradigm started building EVMBench, the best models could exploit less than 20% of the critical, fund-draining Code4rena bugs.

Today, GPT-5.3-Codex (accessed through OpenAI's Codex CLI) hits a 72.2% success rate in exploit mode. For context, GPT-5, released just six months earlier, scored 31.9% on the same tasks.

That's a jump from roughly one in five to nearly three in four, in under a year.

Detection and patching remain weaker. Agents still struggle to conduct exhaustive audits across full codebases, and patching requires understanding deeper design assumptions that go beyond the immediate vulnerability. But the trajectory on exploitation alone is hard to ignore.

As Paradigm put it: "The rate of improvement is incredible."

Why This Matters Beyond Crypto

EVMBench is framed as a blockchain security tool, but the implications extend well past crypto.

Most AI agent benchmarks today test on synthetic tasks, toy datasets, or narrow coding challenges. EVMBench is different because it evaluates agents against real production code where mistakes have direct financial consequences. These aren't hypothetical bugs. They're vulnerabilities that, if exploited in the wild, would drain real money from real protocols.

That's exactly the kind of benchmark the broader AI agent space has been missing. When you're deploying agents in enterprise environments to handle back-office processes, compliance checks, or financial operations, you need to know how they perform against real-world complexity, not sanitized test cases.

The three-mode framework (detect, patch, exploit) is also worth paying attention to. It doesn't just test whether an agent can find a problem. It tests whether it can fix it without breaking things, and whether it understands the problem deeply enough to reproduce it. That layered evaluation is closer to how you'd want to assess any production-grade agent, regardless of the domain.

What This Signals for Enterprise AI

A few takeaways for anyone building or deploying AI agents:

Benchmarks against real work matter more than benchmarks against synthetic tasks. EVMBench's entire value comes from using actual audit data from production contracts. The same logic applies to enterprise: if you're evaluating an AI agent for invoice processing or HR operations, test it against your real data and real edge cases, not a curated demo.

Agent capability is improving faster than most people expect. Going from 20% to 72% exploit accuracy in months isn't incremental progress. It's the kind of capability jump that changes what's feasible. The agents that looked like toys a year ago are now doing work that would take a skilled human auditor hours.

Patching is still the hardest problem. Finding issues is one thing. Fixing them without introducing new ones is where agents still struggle. This maps directly to enterprise deployments: the agents that create the most value aren't the ones that flag problems. They're the ones that resolve them end to end while maintaining everything else that was working.

Open benchmarks accelerate the entire ecosystem. By open-sourcing EVMBench, OpenAI and Paradigm are giving every research team and agent builder a shared standard to measure against. Paradigm has already extended the benchmark into an operational auditing agent. Expect more teams to follow.

The Bigger Picture

Paradigm's conclusion is worth repeating: "A growing portion of audits in the future will be done by agents."

Replace "audits" with almost any knowledge-intensive, rules-based task, and the statement still holds. The question isn't whether AI agents will do this work. It's how fast they get good enough, and how we measure "good enough" in domains where mistakes are expensive.

EVMBench is one answer to that question. Enterprise needs more like it.

Heute starten

Starten Sie mit KI-Agenten zur Automatisierung von Prozessen

Nutzen Sie jetzt unsere Plattform und beginnen Sie mit der Entwicklung von KI-Agenten für verschiedene Arten von Automatisierungen

Heute starten

Starten Sie mit KI-Agenten zur Automatisierung von Prozessen

Nutzen Sie jetzt unsere Plattform und beginnen Sie mit der Entwicklung von KI-Agenten für verschiedene Arten von Automatisierungen

Heute starten

Starten Sie mit KI-Agenten zur Automatisierung von Prozessen

Nutzen Sie jetzt unsere Plattform und beginnen Sie mit der Entwicklung von KI-Agenten für verschiedene Arten von Automatisierungen