What's actually different about AI-generated apps from a security standpoint?

Three things. The stack is consistent (Supabase + Next.js + Vite-shaped dominates), the failure modes are predictable (RLS misconfigured, service-role key in the browser, BOLA on CRUD, weak auth flow), and the bug lives at the URL more often than in the repo (a misconfigured deploy or staging environment is the entry point). Legacy scanners optimized for finding pattern matches in source code were built before any of those mattered.

Which legacy scanners are still worth running on AI-generated apps?

Snyk for SCA — known CVEs in dependencies are still real. GitHub native secret scanning for credentials in code. Semgrep for custom static rules if your team has the bandwidth to write them. Beyond that, the legacy scanners catch a smaller subset of AI-codegen bugs than vendors claim, mostly because the bugs aren't where their rules look.

What about the AI-flavored newer scanners?

DepthFirst, several similar tools — they market on semantic understanding and lower false-positive rates. The pitch is real engineering. The accuracy claims are unverifiable from outside because they're measured against unlabeled real-world code. Run them against gapbench (a public benchmark with labeled ground truth) and the picture becomes reproducible.

DAST — running probes against the deployed app — is necessary but not sufficient. AI-generated apps have bugs that need DAST (RLS misconfig, BOLA, exposed admin URLs) and bugs that need SAST (mass assignment, prototype pollution, deserialization). The right answer is both, with rules that understand the AI-codegen-shaped stack.

What's the scanner-evaluation methodology you recommend?

Run every scanner you're evaluating against gapbench.vibe-eval.com — the public benchmark we operate. 104 scenarios, 5 clean reference sites, full CWE/OWASP tagging. Compare per-CWE recall and false-positive surface (the rate of triggering on the clean controls). The numbers reproduce. Vendor decks don't.

Where does VibeEval fit?

AI-codegen-aware DAST + targeted SAST, calibrated against gapbench. Built for indie/team scale ($19/mo self-serve, no demo gate), with the calibration discipline as the headline operational claim. Best fit if your stack matches the AI-codegen shape and you want auditable accuracy. Less of a fit if you need enterprise SSO, SCIM, multi-tenant compliance frameworks today.

How do I actually run gapbench against another scanner?

Most scanners take a URL or a repo. Point them at gapbench.vibe-eval.com/site/ / for each scenario you want to test. Compare findings to the manifest at gapbench.vibe-eval.com/__manifest, which lists every scenario's intended CWE. Note which scenarios trigger findings and which don't. Run the same scanner against /site/ref0/ (and the topic-specific ref-* sites) and note any findings — those are false positives.

VibeEval vs Competitors: The 2026 AI-Codegen Security Scanner Landscape

Most security scanners are built for code from 2018. The AI-codegen apps shipping in 2026 are a different shape, fail in different ways, and need different detection. This is the survey of who’s in the market right now, organized by segment, with the public-benchmark methodology we recommend for picking between them.

What changed

Two structural shifts since the last comparable survey:

The stack converged. Lovable, Bolt, Cursor, Replit, and V0 produce apps that look more like each other than like hand-written code. Supabase + Next.js + Edge Functions is overwhelmingly dominant. The failure modes are correspondingly consistent: missing RLS, service-role keys in the browser, BOLA on generated CRUD, weak auth flows, exposed admin paths. A scanner that knows this distribution catches more bugs per scan than a scanner tuned to the long tail of public GitHub.

Runtime gaps grew. A static scan against a repo never sees the staging environment a developer spun up last week, the Mongo instance bound to 0.0.0.0, the MCP server mounted on a public port without auth. AI-codegen workflows make these mistakes faster and more often than hand-written workflows did. The bug lives at the URL.

The market hasn’t fully adjusted. Most legacy scanners are still pure-source SAST. Most newer “AI-aware” scanners pitch on semantic understanding without publishing reproducible accuracy numbers. The space between is where VibeEval is positioned.

Segments

We sort scanners into four segments by what they actually do.

Segment 1 — Legacy enterprise SAST/SCA

Snyk, Checkmarx, Veracode, Fortify, SonarQube, Contrast, Qualys, Rapid7. Built for enterprise, sold via sales cycle, mature tooling, deep integration. The pitch was right for 2014–2020.

For AI-generated apps:

SCA stays useful. Known-CVE detection in npm dependencies is still real. Snyk specifically is hard to beat at this.
SAST coverage drops. Rules tuned on 2018 Rails / Java codebases miss the AI-codegen shapes. Supabase RLS misconfig is invisible to a code-only scanner because the misconfig is in the dashboard, not the repo.
DAST is bolted-on. Most have it; few do it well; none do it AI-codegen-aware.
Pricing. Per-developer enterprise contracts. $20K–100K/year is normal.

Comparisons we maintain: Snyk, Checkmarx, Veracode, Fortify, SonarQube, Contrast, Qualys, Rapid7, Semgrep, GitLab Security.

Segment 2 — Modern AI-flavored scanners

DepthFirst and similar. Pitch on semantic understanding, lower false-positive rates, AI-native rules. Real engineering, real product, but accuracy claims are measured against unlabeled real-world code — which makes “lower false-positive rate” unauditable from outside.

For AI-generated apps:

Detection quality is meaningfully better than legacy SAST for the bug classes they’re tuned for.
Coverage of AI-codegen-specific bugs varies. None of them publish a per-CWE recall map against a labeled benchmark.
Pricing is opaque. Demo-gated, sales-led onboarding, enterprise contracts.
The differentiator they push (precision) is the part that’s hardest to verify.

Comparison: DepthFirst.

Segment 3 — DAST and live-app testing

OWASP ZAP, Burp Suite, Acunetix, Detectify, Rainforest QA. Black-box testing against running apps. They’ll find the staging environment, the open Postgres port, the SSRF on the image proxy. They won’t find the mass-assignment bug or the prototype-pollution sink because those need either source review or specific runtime probes the tools don’t ship by default.

For AI-generated apps:

DAST is necessary — many of the highest-impact bugs (RLS misconfig, exposed admin paths, naked databases) only show up at runtime.
AI-codegen-specific runtime probes are mostly absent. Generic DAST checks for OWASP-Top-10-shaped bugs but doesn’t probe specifically for “Supabase service-role key in the bundle” or “Stripe webhook unverified.”

Comparisons: OWASP ZAP, Burp Suite, Acunetix, Detectify, Rainforest QA, Nessus, GuardRails.

Segment 4 — Vibe-coding-specific scanners

VibeEval, plus a handful of newer entrants — SecureVibing, VibeAppScanner, VibeShip, SupaScan, SupaExplorer, SecureScanDev. All pitch on the same insight (AI-generated apps need AI-codegen-aware rules). Differentiate on pricing, calibration discipline, and which AI-codegen platforms they target.

VibeEval’s specific position: AI-codegen-aware DAST + SAST, calibrated against the public gapbench.vibe-eval.com benchmark (104 scenarios, 5 clean reference sites, all CWE-tagged). Every detection rule is benchmarked against the controls before shipping. We publish the methodology at /patterns/false-positives-and-the-ref0-control/ and the manifesto at /patterns/why-gapbench/.

Comparisons: SecureVibing, VibeAppScanner, VibeShip, SupaScan, SupaExplorer, SecureScanDev, Sqreen, Rocksmith, CyberChief.

How to actually pick

The vendor-deck-comparison approach is well-known and produces unhelpful answers. The empirical approach is short:

Pick five gapbench scenarios that resemble your stack. The why-gapbench article has the full inventory. Pick scenarios in your dominant categories — Supabase, Next.js, Stripe, OAuth, whatever applies.
Run every scanner you’re evaluating against those five scenarios. Plus ref0 and the matching ref-* clean controls.
Score on three axes:
- Recall: did the scanner fire on the vulnerable scenario?
- False-positive surface: did it fire on the matching clean control?
- Time to result: how long from URL to finding?

The scanner with high recall, low false-positive surface, and fast time-to-result is the one to pick. It might be ours. It might not. The methodology produces an answer either way.

What’s coming

Three threads to watch in the rest of 2026:

More AI-codegen-specific scanners. The category is hot; expect three to five new entrants. Most will pitch on AI-awareness without publishing benchmarks. Ask each one for their per-CWE recall map.
MCP and agent-tool security tooling. The MCP attack surface (open MCP servers, tool-spec injection) is unaddressed by every scanner in this survey except VibeEval and arguably DepthFirst. Expect new entrants and dedicated products.
Public benchmark adoption. We’re betting other vendors will eventually publish their own controls and recall maps. The first competitor to do so changes the conversation. Until then, gapbench is the only auditable benchmark for AI-codegen-shaped surfaces.

Why we built gapbench — the manifesto
Patterns hub — anatomy walkthroughs of the bugs we keep finding
Best Security Scanner for AI-Generated Apps
Free security self-audit

VIBEEVAL VS COMPETITORS: THE 2026 AI-CODEGEN SECURITY SCANNER LANDSCAPE

TEST YOUR APP NOW

What changed

Segments

Segment 1 — Legacy enterprise SAST/SCA

Segment 2 — Modern AI-flavored scanners

Segment 3 — DAST and live-app testing

Segment 4 — Vibe-coding-specific scanners

How to actually pick

What’s coming

COMMON QUESTIONS

STOP GUESSING. SCAN YOUR APP.

TEST YOUR APP NOW

What changed

Segments

Segment 1 — Legacy enterprise SAST/SCA

Segment 2 — Modern AI-flavored scanners

Segment 3 — DAST and live-app testing

Segment 4 — Vibe-coding-specific scanners

How to actually pick

What’s coming

Related reading

COMMON QUESTIONS

Keep reading

STOP GUESSING. SCAN YOUR APP.

GET THESE WEEKLY