DATA STUDIES
Original research from the VibeEval scan corpus. Citation-grade tables, methods, and raw findings — built for builders, journalists, and researchers.
Data studies are research pieces grounded in real scan data — not opinion, not vendor marketing, not regurgitated CVE feeds. Each piece links to the underlying methodology, names the canonical CWE / OWASP categories, points at the matched scenario on gapbench.vibe-eval.com so the detection can be reproduced, and ends with a curl-runnable proof block.
How to read these studies
Every study in this hub follows the same shape so you can pattern-match across them:
- Headline numbers — the corpus-wide aggregate.
- Per-cut breakdown — by platform, HTTP method, resource type, failure mode, surface, etc.
- CWE / OWASP mapping — every finding tagged against the canonical taxonomy.
- Per-mode fix patterns — code or config snippets that close the gap.
- Methodology + calibration — every probe is also run against
ref0and the topic-specific reference sites (ref-rls,ref-jwt,ref-oauth,ref-webhook). Probes that fire on a clean reference are killed before the count ships. - Reproduce on the public benchmark — every category maps to a scenario on
gapbench.vibe-eval.com— 104 scenarios, 97 deliberately vulnerable, 7 calibration controls — with curl commands you can run today.
The reproducibility anchor is the gapbench scenario set, not the corpus. The corpus aggregates are the what; the gapbench scenarios are the prove it. Read the pattern walkthroughs for the anatomy of each bug shape and the why-gapbench manifesto for the calibration argument.
Headline benchmark
- How Secure Is an AI-Generated App? 2026 Benchmark of Lovable, Bolt, Cursor, Replit, and V0 — full corpus, all categories, OWASP Web + API + LLM mapped, ref0-calibrated, every category linked to its matched gapbench scenario.
Category cuts of the corpus
- Supabase RLS in the Wild — 2026 Misconfiguration Atlas — five distinct failure modes, CWE per mode, fix recipe per mode, calibrated against
ref-rls. - Where Vibe Coders Leak Their Keys — 2026 Frontend Secrets Report — Stripe / Supabase service-role / OpenAI / AWS / LLM-prompt leaks; CWE per leak shape; fix per shape; calibrated against
ref0. - Broken Object-Level Auth in AI-Generated CRUD — distribution by HTTP method and resource type; five distinct surfaces (direct CRUD, bulk, aggregate, export, side-channel); per-stack fix patterns; calibrated against
ref-rls. - The First 60 Seconds — Time-to-First-Critical — TTFC by detection technique and CWE class; why BOLA is structurally slower than RLS; calibrated against
ref0.
Controlled experiments
- Lovable vs Bolt vs Cursor — Same Spec, Three Apps, Three Profiles — identical prompt, three deployed apps, distinct CWE / OWASP profiles per platform; matched gapbench scenarios for each modal failure shape.
- One Feature, One Regression — Longitudinal Study of Lovable Apps — 90-day regression curve, CWE per regression class, per-class fix patterns that break the curve.
Field studies
- The Acquisition Audit — Buying Random Apps from Acquire.com and Flippa — 18 acquisitions, holdback-clause-grade CWE triage table, gapbench-anchored diligence flow buyers can practice for free.
- Honeypot Supabase — How Long Does a Public Anon Key Survive? — 20 honeypots, 7 exposure surfaces, the universal attacker playbook, five-line detection rule, calibrated against
ref-rls.
Companion pieces
- Patterns — anatomy walkthroughs anchored to gapbench — every bug class in this hub has a companion pattern walkthrough with live demo URLs and per-stack fix recipes.
- Why we built gapbench — the manifesto; read this for the argument behind the
ref0calibration approach. - False positives and the ref0 control — the methodology behind every detection in these studies.
- Is My Lovable App Secure? Builder Checklist
- Best Security Scanner for AI-Generated Apps
- Solo Founder Pre-Launch Security Checklist
- Free Security Self-Audit for AI-Built Apps
What you can cite
- Per-platform vulnerability rates with sample size disclosed.
- Findings normalized to OWASP Web Top 10 2021, OWASP API Top 10 2023, and OWASP LLM Top 10 2025.
- CWE codes per category — not just the parent CWE but the specific child (CWE-862 vs CWE-863 vs CWE-639 are different bugs with different fixes).
- Categories, not vendor labels — RLS bypass, exposed secrets, BOLA, CORS misconfig, IDOR, broken auth, mass assignment.
- Reproducible detections against the gapbench public benchmark — 97 deliberately vulnerable scenarios, 7 calibration controls.
- All counts dated and versioned; superseded numbers stay reachable.
A note on scope and verifiability
The corpus-wide aggregate counts in these studies are assembled from a mix of customer engagements (anonymized) and longitudinal scans against our own gapbench scenarios. The reproducibility anchor — the part anyone can verify by running curl — is the gapbench scenario set. The customer-engagement portion is anonymized by design.
If you want to verify any category’s findings, the matched gapbench scenario is named in every study and the curl command is in the reproduce block. If a number reads “we found X across N apps”, the underlying detection that found it is exactly the detection that fires against the matched scenario. That is the citation grade we hold these studies to.
Related research
RUN YOUR OWN BENCHMARK
Point VibeEval at your stack and see how it scores against the dataset.