Data Studies

Data studies are research pieces grounded in real scan data — not opinion, not vendor marketing, not regurgitated CVE feeds. Each piece links to the underlying methodology, names the canonical CWE / OWASP categories, points at the matched scenario on gapbench.vibe-eval.com so the detection can be reproduced, and ends with a curl-runnable proof block.

How to read these studies

Every study in this hub follows the same shape so you can pattern-match across them:

Headline numbers — the corpus-wide aggregate.
Per-cut breakdown — by platform, HTTP method, resource type, failure mode, surface, etc.
CWE / OWASP mapping — every finding tagged against the canonical taxonomy.
Per-mode fix patterns — code or config snippets that close the gap.
Methodology + calibration — every probe is also run against ref0 and the topic-specific reference sites (ref-rls, ref-jwt, ref-oauth, ref-webhook). Probes that fire on a clean reference are killed before the count ships.
Reproduce on the public benchmark — every category maps to a scenario on gapbench.vibe-eval.com — 104 scenarios, 97 deliberately vulnerable, 7 calibration controls — with curl commands you can run today.

The reproducibility anchor is the gapbench scenario set, not the corpus. The corpus aggregates are the what; the gapbench scenarios are the prove it. Read the pattern walkthroughs for the anatomy of each bug shape and the why-gapbench manifesto for the calibration argument.

Headline benchmark

How Secure Is an AI-Generated App? 2026 Benchmark of Lovable, Bolt, Cursor, Replit, and V0 — full corpus, all categories, OWASP Web + API + LLM mapped, ref0-calibrated, every category linked to its matched gapbench scenario.

Category cuts of the corpus

Supabase RLS in the Wild — 2026 Misconfiguration Atlas — five distinct failure modes, CWE per mode, fix recipe per mode, calibrated against ref-rls.
Where Vibe Coders Leak Their Keys — 2026 Frontend Secrets Report — Stripe / Supabase service-role / OpenAI / AWS / LLM-prompt leaks; CWE per leak shape; fix per shape; calibrated against ref0.
Broken Object-Level Auth in AI-Generated CRUD — distribution by HTTP method and resource type; five distinct surfaces (direct CRUD, bulk, aggregate, export, side-channel); per-stack fix patterns; calibrated against ref-rls.
The First 60 Seconds — Time-to-First-Critical — TTFC by detection technique and CWE class; why BOLA is structurally slower than RLS; calibrated against ref0.

Controlled experiments

Lovable vs Bolt vs Cursor — Same Spec, Three Apps, Three Profiles — identical prompt, three deployed apps, distinct CWE / OWASP profiles per platform; matched gapbench scenarios for each modal failure shape.
One Feature, One Regression — Longitudinal Study of Lovable Apps — 90-day regression curve, CWE per regression class, per-class fix patterns that break the curve.

Field studies

The Acquisition Audit — Buying Random Apps from Acquire.com and Flippa — 18 acquisitions, holdback-clause-grade CWE triage table, gapbench-anchored diligence flow buyers can practice for free.
Honeypot Supabase — How Long Does a Public Anon Key Survive? — 20 honeypots, 7 exposure surfaces, the universal attacker playbook, five-line detection rule, calibrated against ref-rls.

Companion pieces

Patterns — anatomy walkthroughs anchored to gapbench — every bug class in this hub has a companion pattern walkthrough with live demo URLs and per-stack fix recipes.
Why we built gapbench — the manifesto; read this for the argument behind the ref0 calibration approach.
False positives and the ref0 control — the methodology behind every detection in these studies.
Is My Lovable App Secure? Builder Checklist
Best Security Scanner for AI-Generated Apps
Solo Founder Pre-Launch Security Checklist
Free Security Self-Audit for AI-Built Apps

What you can cite

Per-platform vulnerability rates with sample size disclosed.
Findings normalized to OWASP Web Top 10 2021, OWASP API Top 10 2023, and OWASP LLM Top 10 2025.
CWE codes per category — not just the parent CWE but the specific child (CWE-862 vs CWE-863 vs CWE-639 are different bugs with different fixes).
Categories, not vendor labels — RLS bypass, exposed secrets, BOLA, CORS misconfig, IDOR, broken auth, mass assignment.
Reproducible detections against the gapbench public benchmark — 97 deliberately vulnerable scenarios, 7 calibration controls.
All counts dated and versioned; superseded numbers stay reachable.

A note on scope and verifiability

The corpus-wide aggregate counts in these studies are assembled from a mix of customer engagements (anonymized) and longitudinal scans against our own gapbench scenarios. The reproducibility anchor — the part anyone can verify by running curl — is the gapbench scenario set. The customer-engagement portion is anonymized by design.

If you want to verify any category’s findings, the matched gapbench scenario is named in every study and the curl command is in the reproduce block. If a number reads “we found X across N apps”, the underlying detection that found it is exactly the detection that fires against the matched scenario. That is the citation grade we hold these studies to.

How to read these studies

Headline benchmark

Category cuts of the corpus

Controlled experiments

Field studies

Companion pieces

What you can cite

A note on scope and verifiability

RUN YOUR OWN BENCHMARK

How to read these studies

Headline benchmark

Category cuts of the corpus

Controlled experiments

Field studies

Companion pieces

What you can cite

A note on scope and verifiability

Related research

RUN YOUR OWN BENCHMARK