DATA STUDIES

Original research from the VibeEval scan corpus. Citation-grade tables, methods, and raw findings — built for builders, journalists, and researchers.

Data studies are research pieces grounded in real scan data — not opinion, not vendor marketing, not regurgitated CVE feeds. Each piece links to the underlying methodology, names the canonical CWE / OWASP categories, points at the matched scenario on gapbench.vibe-eval.com so the detection can be reproduced, and ends with a curl-runnable proof block.

How to read these studies

Every study in this hub follows the same shape so you can pattern-match across them:

  1. Headline numbers — the corpus-wide aggregate.
  2. Per-cut breakdown — by platform, HTTP method, resource type, failure mode, surface, etc.
  3. CWE / OWASP mapping — every finding tagged against the canonical taxonomy.
  4. Per-mode fix patterns — code or config snippets that close the gap.
  5. Methodology + calibration — every probe is also run against ref0 and the topic-specific reference sites (ref-rls, ref-jwt, ref-oauth, ref-webhook). Probes that fire on a clean reference are killed before the count ships.
  6. Reproduce on the public benchmark — every category maps to a scenario on gapbench.vibe-eval.com — 104 scenarios, 97 deliberately vulnerable, 7 calibration controls — with curl commands you can run today.

The reproducibility anchor is the gapbench scenario set, not the corpus. The corpus aggregates are the what; the gapbench scenarios are the prove it. Read the pattern walkthroughs for the anatomy of each bug shape and the why-gapbench manifesto for the calibration argument.

Headline benchmark

Category cuts of the corpus

Controlled experiments

Field studies

Companion pieces

What you can cite

  • Per-platform vulnerability rates with sample size disclosed.
  • Findings normalized to OWASP Web Top 10 2021, OWASP API Top 10 2023, and OWASP LLM Top 10 2025.
  • CWE codes per category — not just the parent CWE but the specific child (CWE-862 vs CWE-863 vs CWE-639 are different bugs with different fixes).
  • Categories, not vendor labels — RLS bypass, exposed secrets, BOLA, CORS misconfig, IDOR, broken auth, mass assignment.
  • Reproducible detections against the gapbench public benchmark — 97 deliberately vulnerable scenarios, 7 calibration controls.
  • All counts dated and versioned; superseded numbers stay reachable.

A note on scope and verifiability

The corpus-wide aggregate counts in these studies are assembled from a mix of customer engagements (anonymized) and longitudinal scans against our own gapbench scenarios. The reproducibility anchor — the part anyone can verify by running curl — is the gapbench scenario set. The customer-engagement portion is anonymized by design.

If you want to verify any category’s findings, the matched gapbench scenario is named in every study and the curl command is in the reproduce block. If a number reads “we found X across N apps”, the underlying detection that found it is exactly the detection that fires against the matched scenario. That is the citation grade we hold these studies to.

RUN YOUR OWN BENCHMARK

Point VibeEval at your stack and see how it scores against the dataset.

START FREE SCAN