← BACK TO UPDATES

55.8% OF AI-GENERATED CODE IS VULNERABLE. Z3 SAYS SO. STATIC TOOLS CATCH 2.2%.

TEST YOUR APP NOW

Enter your deployed app URL to check for security vulnerabilities.

Z3-verified study: AI coding assistants generate vulnerable code 55.8% of the time. Semgrep/Bandit/CodeQL catch 2.2%. Security prompts move the needle 4 points. Models catch their own bugs in review at 78.7% — they know, they just don’t do.

The paper

Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code — Blain & Noiseux, Cobalt AI, April 2026 (arXiv:2604.05292v2). Dataset and scripts: github.com/dom-omg/bbd-dataset.

What they did:

  • 3,500 code artifacts
  • 7 production LLMs
  • 500 security-critical prompts
  • Z3 SMT formal verification — not regex, not pattern matching, actual proofs of exploitability

The headline numbers

Finding Number
Default vulnerability rate 55.8%
Vulnerabilities with Z3-proved exploitability 1,055
Industry-standard static tools that catch them (combined) 2.2%
CodeQL v2.25.1 (security-extended) detection rate 0% (0 of 90)
Rate when models are asked to review their own output 78.7%
Improvement from “please produce secure code” system prompts 4 pp (64.8% → 60.8%)
Models remaining at grade F after security prompts 4 of 5

The author’s summary: “Security prompts are security theater.” And: “A 4-point improvement that leaves four of five models at grade F is not a control. It’s noise.”

Generation-review asymmetry

The single most useful finding in this paper is not the raw failure rate. It’s that the same models that write broken code at 55.8% correctly identify their own bugs on re-read 78.7% of the time.

“The models possess the security knowledge. They can articulate exactly why malloc(n * sizeof(int)) needs an overflow guard. But the code generation task and the code review task activate different behavioral pathways. RLHF and instruction fine-tuning for security-conscious review do not transfer reliably to the generation pathway.”

Operationally, that means generate-then-review is already better than the current default, which is “ship whatever the agent produced.” It is not a full solution — the 21.3% false-negative rate is not “safe” by any stretch — but it is the cheapest compensating control this paper implies, and it works today.

Corroboration

Blain & Noiseux are not alone:

  • Veracode GenAI Code Security Report (2025–2026): 100+ LLMs, 80 tasks, 45% security failure rate, unchanged despite vendor “we fixed it” claims through early 2026. Java worst at 72%.
  • Perry et al. (ACM CCS 2023): developers using AI assistants wrote significantly more security bugs while reporting higher confidence.
  • GitHub-Pearce et al. (IEEE S&P 2022): 40% of Copilot suggestions contained vulnerabilities across 89 scenarios.
  • Georgia Tech Vibe Security Radar: 35 CVEs attributed to AI coding tools in March 2026 alone, up from 6 in January. Estimated true count: 400–700 across the open-source ecosystem.
  • Escape.tech + Apiiro field scans: 1,400+ vibe-coded production apps, 65% had security issues, 58% had at least one critical, 400+ exposed secrets.
  • AI-assisted developers commit 3–4× faster and introduce security findings at 10× the rate; CVSS 7.0+ bugs appear 2.5× more often in AI-generated code.

The field data and the formal verification are pointing at the same thing.

What this means for vibe coders and teams shipping AI-generated code

  1. Stop using “we prompt for secure code” as a control. It does not work. Put it in a style guide if you must, but not in a risk register.
  2. Assume every AI-generated diff is untrusted input. Same trust posture you have for user-supplied form data — that is the mental model the study’s author argues for, and it’s correct.
  3. Adopt generate-then-review. Pipe each AI-generated diff back through a model with a review-focused prompt. You will catch ~78.7% of the bugs the same model just wrote. Not a silver bullet — plan for the 21.3% — but high leverage, low cost.
  4. Test the deployed thing. Formal verification of source is 2.2% visible to the current static toolchain. The remaining 97.8% only shows up when you probe the running app — exactly the class of behavior static tools can’t see (missing auth, leaked secrets, open storage, weak RLS).
  5. Escalate for high-stakes domains. Authentication, cryptography, payments, systems-level memory: these are where the study’s vulnerability rates are highest. In those domains, mandatory human review by someone with domain-specific security expertise is the floor, not the ceiling.

How this fits with the rest of 2026’s evidence

This week alone we’ve catalogued:

Those are incidents. Broken by Default is the base rate behind them. The incidents are not freak events. They are the downstream shape of a 55.8% default vulnerability rate meeting a 2.2% static-tool catch rate.

The only compensating control that survives first contact

Probe the deployed app. VibeEval does exactly that — agents that behave like attackers against your running product, looking for exactly the class of bugs this paper quantifies and that static tools miss. When 97.8% of provable vulnerabilities are invisible to Semgrep/Bandit/CodeQL/Cppcheck/Clang/FlawFinder combined, the probe is not optional. It is the control.

Source: infosecbytes.io — Your AI Coding Assistant Has a 55.8% Chance of Writing Vulnerable Code. Study: Blain & Noiseux, Broken by Default, arXiv:2604.05292v2, April 2026.

STOP GUESSING. SCAN YOUR APP.

Join the founders who shipped secure instead of shipped exposed. 14-day trial, no card.

START FREE SCAN