After Testing Every Major LLM, None Ship Validation That Survives the First Pass

If your AI-generated app passes Snyk, Semgrep, and the Claude Code or Codex review skill, you have proof that the code in your repo is reasonable. That is a small slice of what actually gets you breached. The breach lands at the integration layer: between your code and the libraries it pulls in, between your services and the network, between the model that wrote the code and the validation it forgot to add.

Code scanners answer “is this file safe in isolation.” Integration is where convenience always wins over security. That is where vibe-coded apps quietly fail.

A scanner walks your AST. It can see a missing CSRF check, a hardcoded key, an eval on user input. It cannot see:

The Supabase service-role key your dashboard exposes through NEXT_PUBLIC_SUPABASE_KEY because the model didn’t know NEXT_PUBLIC_* ships to the client
A wrangler.toml that binds your D1 database to a permissive role
A staging subdomain spun up last Tuesday that nobody put behind auth
An npm package that was clean two weeks ago and shipped a fresh postinstall script in v1.4.3
A Stripe webhook handler that your code happens not to verify because the model summarized the Stripe docs and dropped the signature step

None of these live in the file. They live between the file and the world. We see them every week.

The recommendation is not “stop using scanners.” Use them. Add the Cloud Code security skill, run the Codex security review, keep Snyk in CI. Do all of that. Then accept that the result is one corner of the picture.

What we keep finding: LLM-generated validation needs 2-3 passes

There is one heuristic that has held across every model and harness we have tested over the last six months: Claude Sonnet 4.5, GPT-5, Gemini 2.5, Cursor’s tab model, Lovable’s harness, Bolt, Replit’s agent. None of them produce input or form validation that survives an honest red-team pass on the first iteration.

What you get on iteration 1: the obvious type check. email is a string. age is a number. The form has the right field names. Happy path tests go green.

What’s missing: empty string vs. null vs. undefined. Negative numbers where you assumed positive. Float where you wanted integer. NUL bytes in a filename. Path traversal sequences in filename. An authorization check that confirms the user is logged in but not that the user owns the row. File uploads that pass MIME sniffing but contain a polyglot. A 10-byte JSON parsed as a 10 MB request because the model didn’t add a body-size limit.

These all get caught on iteration 2 or 3 — but only if you ask. The default LLM workflow stops at iteration 1 because the obvious tests pass and the human moves on. The model is solving the immediate prompt (“build a signup form”) and the validation pass requires a second, explicit prompt. Without it, you don’t get it.

This is not a model failure. It is a context failure. The window is the budget; the model spends it on the feature; the edge cases need a separate spend.

Drop-in: a three-iteration validation prompt

Save this as a Claude Code Skill (~/.claude/skills/validation-loop/SKILL.md) or paste it into Codex / Cursor as a one-shot. Run it after every feature touch.

---
name: validation-loop
description: Three-iteration input and form validation hardening pass for any external entry point (HTTP handler, form action, server action, API route, webhook, CLI). Use after every feature ships.
---

You are reviewing code for input and form validation completeness.

For every external entry point in the touched files, run three iterations end-to-end. Do not stop after iteration 1.

ITERATION 1 — Surface check
List every parameter (body, query, path, header, cookie, file). For each, state:
- Type
- Source of trust (user, signed, server-side)
- Whether it is validated before use
- Whether it flows into a database query, file path, shell command, eval, redirect URL, or rendered HTML

ITERATION 2 — Edge case audit
For each parameter, answer:
- What if it is missing?
- Wrong type?
- Empty / "" / 0 / null / undefined / NaN?
- Larger than the expected max (10 MB body, 1 GB file)?
- Contains: NUL byte, Unicode escape, RTL override, path traversal (..), control character, SQL meta?
- File: zip-slip path, SVG with embedded script, polyglot, decompression bomb, MIME mismatch?
- Authorization: user A submits resource ID owned by user B?

ITERATION 3 — Hardening
For every gap surfaced in iteration 2:
- Add explicit validation (zod / pydantic / joi / manual). Reject with 4xx. Never coerce silently.
- Add a test that submits the malicious input and asserts the rejection.
- For authorization gaps, add the row-level check (`auth.uid() = user_id` for Supabase RLS, or the equivalent for your stack).

Output a per-endpoint checklist: the lines that need to change, the test to add, the validation library call.

Drop it in, run it after every feature touch. In our own engagements we have seen it close roughly nine of every ten validation gaps a scanner misses on a vibe-coded app. It catches what the codegen never asked itself about.

Composable risk: the layers above the code

Validation is one layer. The rest of the integration is composable, and every layer you stack inherits the others’ defaults.

Libraries. Your package.json is a trust delegation to several hundred maintainers, some of whom you have never heard of. A package clean today can ship a postinstall next week. There have been four notable npm supply-chain incidents in 2026 already; we covered the latest in the Apr 23 weekly digest. Running npm audit is necessary and insufficient. Pin versions, use a registry mirror, fail the build when a transitive maintainer changes hands.

Infrastructure. The default for “publish a Next.js app on Vercel” is “every preview deployment is on the open internet.” The default for Supabase is “RLS is opt-in.” The default for an S3 bucket used to be public. Cloud defaults are the integration layer, and they are tuned for ergonomics first.

The CI/CD path. Your code is reviewed. Your secrets are not, until they leak. A poisoned GitHub Action with pull_request_target permission is a credential exfil waiting for a fork. Same risk shape we covered in AI Agents in GitHub Actions: Prompt Injection, one layer up.

Third-party OAuth. A user clicks “allow all access” on an AI productivity tool and you now have a SaaS-OAuth supply chain attached to your tenant. The Vercel / Context.ai breach is the textbook case: the code was secure, the OAuth scope was over-broad, environment variables not marked “sensitive” were read.

Every one of these is invisible to the AST.

Where gapbench fits

We run a public benchmark at gapbench.vibe-eval.com — a deliberately broken set of scenarios that mirror what AI codegen ships, with a clean reference site (ref0) for false-positive calibration. The whole point is to make these integration gaps reproducible at a URL anyone can hit.

Want to see what an exposed Supabase service-role key looks like at runtime, not in code? gapbench.vibe-eval.com/site/supabase-clone/. Want to see a Stripe webhook handler that ships without signature verification, deployed and live? gapbench.vibe-eval.com/site/webhook-unverified/. Same for naked Postgres, BOLA across CRUD, JWT alg=none, MCP servers with shell access, and forty-odd other patterns we keep finding.

Heuristic scanners can flag “missing call to verifyWebhookSignature” if they recognize the pattern. They cannot tell you that your deployed webhook handler actually accepts unsigned requests. The only way to know is to send one.

That is the meta layer. Scanners give you confidence about a slice of the picture. End-to-end testing — runtime, against the deployed app, with adversarial input — is the only test that catches the integration. We do it because the slice is not enough.

Bottom line

The good news: AI codegen is not making your code less secure than a junior engineer’s first commit. The bad news: it is producing more code, faster, with the same blind spot in the same place — between the file and the world.

Three things we recommend, in order of cost and impact:

Run the validation loop after every feature. Free. Closes about 90 percent of the input gaps a scanner cannot see.
Treat integration as the test surface. Run an end-to-end scan against the deployed app, not just the repo. We do this; a handful of competitors do too. Pick one and run it weekly.
Audit the composable layer monthly. Lockfile diff, registry maintainer changes, OAuth scope review, infrastructure default check. Boring. Catches the supply-chain-shaped breaches.

We are not arguing scanners are useless. We use them. The argument is that “code clean” is one quarter of “secure.” The other three quarters are the integration layer, and AI codegen has not made that layer better. Only bigger.

Your CLAUDE.md Is Attack Surface — the skill files you load are also code
Vercel Breach via Context.ai — third-party OAuth as integration risk
Vibe Coding Security Weekly — Apr 28, 2026 — SecureVibeBench measured the best AI agents ship correct-and-secure code 23.8% of the time
Lovable BOLA Vulnerability — authorization gap that no static scanner found

AFTER TESTING EVERY MAJOR LLM, NONE SHIP VALIDATION THAT SURVIVES THE FIRST PASS

TEST YOUR APP NOW

The scanner blind spot

What we keep finding: LLM-generated validation needs 2-3 passes

Drop-in: a three-iteration validation prompt

Composable risk: the layers above the code

Where gapbench fits

Bottom line

STOP GUESSING. SCAN YOUR APP.

TEST YOUR APP NOW

The scanner blind spot

What we keep finding: LLM-generated validation needs 2-3 passes

Drop-in: a three-iteration validation prompt

Composable risk: the layers above the code

Where gapbench fits

Bottom line

Related reading

Keep reading

STOP GUESSING. SCAN YOUR APP.

GET THESE WEEKLY