DATA STUDIES
Failure-mode catalogs grounded in reproducible scenarios on the gapbench public benchmark and anonymized customer engagements. Each piece names the CWE / OWASP category, gives the fix shape, and links to the matched gapbench scenario so the detection can be reproduced with curl.
These data studies are failure-mode catalogs — not opinion, not vendor marketing, not regurgitated CVE feeds. Each piece links to the underlying methodology, names the canonical CWE / OWASP categories, points at the matched scenario on gapbench.vibe-eval.com so the detection can be reproduced, and ends with a curl-runnable proof block.
How to read these studies
Every study in this hub follows the same shape so you can pattern-match across them:
- Catalog scope — window, source, calibration controls, reproducibility anchor.
- Per-cut ranking — by platform, HTTP method, resource type, failure mode, surface — given as relative frequency, not absolute counts.
- CWE / OWASP mapping — every finding tagged against the canonical taxonomy.
- Per-mode fix patterns — code or config snippets that close the gap.
- Methodology + calibration — every probe is also run against
ref0and the topic-specific reference sites (ref-rls,ref-jwt,ref-oauth,ref-webhook). Probes that fire on a clean reference are killed before they ship. - Reproduce on the public benchmark — every category maps to a scenario on
gapbench.vibe-eval.com— 104 scenarios, 97 deliberately vulnerable, 7 calibration controls — with curl commands you can run today.
The reproducibility anchor is the gapbench scenario set. Read the pattern walkthroughs for the anatomy of each bug shape and the why-gapbench manifesto for the calibration argument.
Headline catalog
- How Secure Is an AI-Generated App? 2026 Failure-Mode Catalog for Lovable, Bolt, Cursor, Replit, and V0 — top 10 recurring vulnerabilities, OWASP Web + API + LLM mapped, ref0-calibrated, every category linked to its matched gapbench scenario.
Category cuts
- Supabase RLS in the Wild — 2026 Misconfiguration Atlas — five distinct failure modes, CWE per mode, fix recipe per mode, calibrated against
ref-rls. - Where Vibe Coders Leak Their Keys — 2026 Frontend Secrets Report — Stripe / Supabase service-role / OpenAI / AWS / LLM-prompt leaks; CWE per leak shape; fix per shape; calibrated against
ref0. - Broken Object-Level Auth in AI-Generated CRUD — failure shapes by HTTP method and resource name; five distinct surfaces (direct CRUD, bulk, aggregate, export, side-channel); per-stack fix patterns; calibrated against
ref-rls. - The First 60 Seconds — Time-to-First-Critical on AI-Built Apps — detection floors by class and CWE; why BOLA is structurally slower than RLS; calibrated against
ref0.
Controlled experiments
- Lovable vs Bolt vs Cursor — Same Spec, Three Apps, Three Profiles — identical prompt, three deployed apps, distinct CWE / OWASP profiles per platform; matched gapbench scenarios for each modal failure shape.
- One Feature, One Regression — How Lovable Apps Drift After Launch — regression shapes, CWE per class, per-class fix patterns that break the drift.
Field studies
- The Acquisition Audit — Buyer-Side Security Diligence for AI-Built SaaS — failure shapes at handover, holdback-clause-grade CWE triage table, gapbench-anchored diligence flow buyers can practice for free.
- Honeypot Supabase — How Long Does a Public Anon Key Survive? — seven exposure surfaces, the universal attacker playbook, five-line detection rule, calibrated against
ref-rls.
Companion pieces
- Patterns — anatomy walkthroughs anchored to gapbench — every bug class in this hub has a companion pattern walkthrough with live demo URLs and per-stack fix recipes.
- Why we built gapbench — the manifesto; read this for the argument behind the
ref0calibration approach. - False positives and the ref0 control — the methodology behind every detection in these studies.
- Is My Lovable App Secure? Builder Checklist
- Best Security Scanner for AI-Generated Apps
- Solo Founder Pre-Launch Security Checklist
- Free Security Self-Audit for AI-Built Apps
What you can cite
- Failure-mode rankings normalized to OWASP Web Top 10 2021, OWASP API Top 10 2023, and OWASP LLM Top 10 2025.
- CWE codes per category — not just the parent CWE but the specific child (CWE-862 vs CWE-863 vs CWE-639 are different bugs with different fixes).
- Categories, not vendor labels — RLS bypass, exposed secrets, BOLA, CORS misconfig, IDOR, broken auth, mass assignment.
- Reproducible detections against the gapbench public benchmark — 97 deliberately vulnerable scenarios, 7 calibration controls.
- All catalogs dated and versioned; superseded revisions stay reachable.
A note on scope and verifiability
These catalogs are grounded in (a) deliberately vulnerable scenarios on the public gapbench benchmark, and (b) anonymized customer engagements. We do not publish corpus-wide N or per-platform sample counts because the engagement portion is anonymized by design and not a uniform random sample. Where you see a relative ranking (“most common”, “highly recurring”) and no absolute percentage, that is deliberate.
The reproducibility anchor — the part anyone can verify by running curl — is the gapbench scenario set. Every failure mode in every study is reproducible against the matched scenario named in that study’s “Reproduce” block.
Related research
RUN YOUR OWN BENCHMARK
Point VibeEval at your stack and see how it scores against the catalog.