Broken by Default: AI Coding Assistants Fail 55.8% of Security-Critical Prompts

Z3-verified study: AI coding assistants generate vulnerable code 55.8% of the time. Semgrep/Bandit/CodeQL catch 2.2%. Security prompts move the needle 4 points. Models catch their own bugs in review at 78.7% — they know, they just don’t do.

The paper

Broken by Default: A Formal Verification Study of Security Vulnerabilities in AI-Generated Code — Blain & Noiseux, Cobalt AI, April 2026 (arXiv:2604.05292v2). Dataset and scripts: github.com/dom-omg/bbd-dataset.

What they did:

3,500 code artifacts
7 production LLMs
500 security-critical prompts
Z3 SMT formal verification — not regex, not pattern matching, actual proofs of exploitability

The headline numbers

Finding	Number
Default vulnerability rate	55.8%
Vulnerabilities with Z3-proved exploitability	1,055
Industry-standard static tools that catch them (combined)	2.2%
CodeQL v2.25.1 (security-extended) detection rate	0% (0 of 90)
Rate when models are asked to review their own output	78.7%
Improvement from “please produce secure code” system prompts	4 pp (64.8% → 60.8%)
Models remaining at grade F after security prompts	4 of 5

The author’s summary: “Security prompts are security theater.” And: “A 4-point improvement that leaves four of five models at grade F is not a control. It’s noise.”

Generation-review asymmetry

The single most useful finding in this paper is not the raw failure rate. It’s that the same models that write broken code at 55.8% correctly identify their own bugs on re-read 78.7% of the time.

“The models possess the security knowledge. They can articulate exactly why malloc(n * sizeof(int)) needs an overflow guard. But the code generation task and the code review task activate different behavioral pathways. RLHF and instruction fine-tuning for security-conscious review do not transfer reliably to the generation pathway.”

Operationally, that means generate-then-review is already better than the current default, which is “ship whatever the agent produced.” It is not a full solution — the 21.3% false-negative rate is not “safe” by any stretch — but it is the cheapest compensating control this paper implies, and it works today.

Corroboration

Blain & Noiseux are not alone:

Veracode GenAI Code Security Report (2025–2026): 100+ LLMs, 80 tasks, 45% security failure rate, unchanged despite vendor “we fixed it” claims through early 2026. Java worst at 72%.
Perry et al. (ACM CCS 2023): developers using AI assistants wrote significantly more security bugs while reporting higher confidence.
GitHub-Pearce et al. (IEEE S&P 2022): 40% of Copilot suggestions contained vulnerabilities across 89 scenarios.
Georgia Tech Vibe Security Radar: 35 CVEs attributed to AI coding tools in March 2026 alone, up from 6 in January. Estimated true count: 400–700 across the open-source ecosystem.
Escape.tech + Apiiro field scans: 1,400+ vibe-coded production apps, 65% had security issues, 58% had at least one critical, 400+ exposed secrets.
AI-assisted developers commit 3–4× faster and introduce security findings at 10× the rate; CVSS 7.0+ bugs appear 2.5× more often in AI-generated code.

The field data and the formal verification are pointing at the same thing.

What this means for vibe coders and teams shipping AI-generated code

Stop using “we prompt for secure code” as a control. It does not work. Put it in a style guide if you must, but not in a risk register.
Assume every AI-generated diff is untrusted input. Same trust posture you have for user-supplied form data — that is the mental model the study’s author argues for, and it’s correct.
Adopt generate-then-review. Pipe each AI-generated diff back through a model with a review-focused prompt. You will catch ~78.7% of the bugs the same model just wrote. Not a silver bullet — plan for the 21.3% — but high leverage, low cost.
Test the deployed thing. Formal verification of source is 2.2% visible to the current static toolchain. The remaining 97.8% only shows up when you probe the running app — exactly the class of behavior static tools can’t see (missing auth, leaked secrets, open storage, weak RLS).
Escalate for high-stakes domains. Authentication, cryptography, payments, systems-level memory: these are where the study’s vulnerability rates are highest. In those domains, mandatory human review by someone with domain-specific security expertise is the floor, not the ceiling.

How this fits with the rest of 2026’s evidence

This week alone we’ve catalogued:

A Lovable BOLA disclosure exposing other users’ .env files on the $6.6B vibe-coding platform.
A Vercel/Context.ai breach where one compromised AI tool pivoted into a Google Workspace and then into cloud infra.
Snyk ToxicSkills + arXiv MCP prompt injection showing 13.4% of agent skills have critical security issues and MCP clients have wildly uneven guardrails.
February 2026’s 170-database Lovable breach.

Those are incidents. Broken by Default is the base rate behind them. The incidents are not freak events. They are the downstream shape of a 55.8% default vulnerability rate meeting a 2.2% static-tool catch rate.

The only compensating control that survives first contact

Probe the deployed app. VibeEval does exactly that — agents that behave like attackers against your running product, looking for exactly the class of bugs this paper quantifies and that static tools miss. When 97.8% of provable vulnerabilities are invisible to Semgrep/Bandit/CodeQL/Cppcheck/Clang/FlawFinder combined, the probe is not optional. It is the control.

Source: infosecbytes.io — Your AI Coding Assistant Has a 55.8% Chance of Writing Vulnerable Code. Study: Blain & Noiseux, Broken by Default, arXiv:2604.05292v2, April 2026.

55.8% OF AI-GENERATED CODE IS VULNERABLE. Z3 SAYS SO. STATIC TOOLS CATCH 2.2%.

TEST YOUR APP NOW

The paper

The headline numbers

Generation-review asymmetry

Corroboration

What this means for vibe coders and teams shipping AI-generated code

How this fits with the rest of 2026’s evidence

The only compensating control that survives first contact

STOP GUESSING. SCAN YOUR APP.

TEST YOUR APP NOW

The paper

The headline numbers

Generation-review asymmetry

Corroboration

What this means for vibe coders and teams shipping AI-generated code

How this fits with the rest of 2026’s evidence

The only compensating control that survives first contact

Keep reading

STOP GUESSING. SCAN YOUR APP.

GET THESE WEEKLY