55.8% OF AI-GENERATED CODE IS VULNERABLE. Z3 SAYS SO. STATIC TOOLS CATCH 2.2%.
TEST YOUR APP NOW
Enter your deployed app URL to check for security vulnerabilities.
Z3-verified study: AI coding assistants generate vulnerable code 55.8% of the time. Semgrep/Bandit/CodeQL catch 2.2%. Security prompts move the needle 4 points. Models catch their own bugs in review at 78.7% — they know, they just don’t do.
The paper
Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code — Blain & Noiseux, Cobalt AI, April 2026 (arXiv:2604.05292v2). Dataset and scripts: github.com/dom-omg/bbd-dataset.
What they did:
- 3,500 code artifacts
- 7 production LLMs
- 500 security-critical prompts
- Z3 SMT formal verification — not regex, not pattern matching, actual proofs of exploitability
The headline numbers
| Finding | Number |
|---|---|
| Default vulnerability rate | 55.8% |
| Vulnerabilities with Z3-proved exploitability | 1,055 |
| Industry-standard static tools that catch them (combined) | 2.2% |
| CodeQL v2.25.1 (security-extended) detection rate | 0% (0 of 90) |
| Rate when models are asked to review their own output | 78.7% |
| Improvement from “please produce secure code” system prompts | 4 pp (64.8% → 60.8%) |
| Models remaining at grade F after security prompts | 4 of 5 |
The author’s summary: “Security prompts are security theater.” And: “A 4-point improvement that leaves four of five models at grade F is not a control. It’s noise.”
Generation-review asymmetry
The single most useful finding in this paper is not the raw failure rate. It’s that the same models that write broken code at 55.8% correctly identify their own bugs on re-read 78.7% of the time.
“The models possess the security knowledge. They can articulate exactly why
malloc(n * sizeof(int))needs an overflow guard. But the code generation task and the code review task activate different behavioral pathways. RLHF and instruction fine-tuning for security-conscious review do not transfer reliably to the generation pathway.”
Operationally, that means generate-then-review is already better than the current default, which is “ship whatever the agent produced.” It is not a full solution — the 21.3% false-negative rate is not “safe” by any stretch — but it is the cheapest compensating control this paper implies, and it works today.
Corroboration
Blain & Noiseux are not alone:
- Veracode GenAI Code Security Report (2025–2026): 100+ LLMs, 80 tasks, 45% security failure rate, unchanged despite vendor “we fixed it” claims through early 2026. Java worst at 72%.
- Perry et al. (ACM CCS 2023): developers using AI assistants wrote significantly more security bugs while reporting higher confidence.
- GitHub-Pearce et al. (IEEE S&P 2022): 40% of Copilot suggestions contained vulnerabilities across 89 scenarios.
- Georgia Tech Vibe Security Radar: 35 CVEs attributed to AI coding tools in March 2026 alone, up from 6 in January. Estimated true count: 400–700 across the open-source ecosystem.
- Escape.tech + Apiiro field scans: 1,400+ vibe-coded production apps, 65% had security issues, 58% had at least one critical, 400+ exposed secrets.
- AI-assisted developers commit 3–4× faster and introduce security findings at 10× the rate; CVSS 7.0+ bugs appear 2.5× more often in AI-generated code.
The field data and the formal verification are pointing at the same thing.
What this means for vibe coders and teams shipping AI-generated code
- Stop using “we prompt for secure code” as a control. It does not work. Put it in a style guide if you must, but not in a risk register.
- Assume every AI-generated diff is untrusted input. Same trust posture you have for user-supplied form data — that is the mental model the study’s author argues for, and it’s correct.
- Adopt generate-then-review. Pipe each AI-generated diff back through a model with a review-focused prompt. You will catch ~78.7% of the bugs the same model just wrote. Not a silver bullet — plan for the 21.3% — but high leverage, low cost.
- Test the deployed thing. Formal verification of source is 2.2% visible to the current static toolchain. The remaining 97.8% only shows up when you probe the running app — exactly the class of behavior static tools can’t see (missing auth, leaked secrets, open storage, weak RLS).
- Escalate for high-stakes domains. Authentication, cryptography, payments, systems-level memory: these are where the study’s vulnerability rates are highest. In those domains, mandatory human review by someone with domain-specific security expertise is the floor, not the ceiling.
How this fits with the rest of 2026’s evidence
This week alone we’ve catalogued:
- A Lovable BOLA disclosure exposing other users’
.envfiles on the $6.6B vibe-coding platform. - A Vercel/Context.ai breach where one compromised AI tool pivoted into a Google Workspace and then into cloud infra.
- Snyk ToxicSkills + arXiv MCP prompt injection showing 13.4% of agent skills have critical security issues and MCP clients have wildly uneven guardrails.
- February 2026’s 170-database Lovable breach.
Those are incidents. Broken by Default is the base rate behind them. The incidents are not freak events. They are the downstream shape of a 55.8% default vulnerability rate meeting a 2.2% static-tool catch rate.
The only compensating control that survives first contact
Probe the deployed app. VibeEval does exactly that — agents that behave like attackers against your running product, looking for exactly the class of bugs this paper quantifies and that static tools miss. When 97.8% of provable vulnerabilities are invisible to Semgrep/Bandit/CodeQL/Cppcheck/Clang/FlawFinder combined, the probe is not optional. It is the control.
Source: infosecbytes.io — Your AI Coding Assistant Has a 55.8% Chance of Writing Vulnerable Code. Study: Blain & Noiseux, Broken by Default, arXiv:2604.05292v2, April 2026.
STOP GUESSING. SCAN YOUR APP.
Join the founders who shipped secure instead of shipped exposed. 14-day trial, no card.