DATA STUDIES

Failure-mode catalogs grounded in reproducible scenarios on the gapbench public benchmark and anonymized customer engagements. Each piece names the CWE / OWASP category, gives the fix shape, and links to the matched gapbench scenario so the detection can be reproduced with curl.

These data studies are failure-mode catalogs — not opinion, not vendor marketing, not regurgitated CVE feeds. Each piece links to the underlying methodology, names the canonical CWE / OWASP categories, points at the matched scenario on gapbench.vibe-eval.com so the detection can be reproduced, and ends with a curl-runnable proof block.

How to read these studies

Every study in this hub follows the same shape so you can pattern-match across them:

  1. Catalog scope — window, source, calibration controls, reproducibility anchor.
  2. Per-cut ranking — by platform, HTTP method, resource type, failure mode, surface — given as relative frequency, not absolute counts.
  3. CWE / OWASP mapping — every finding tagged against the canonical taxonomy.
  4. Per-mode fix patterns — code or config snippets that close the gap.
  5. Methodology + calibration — every probe is also run against ref0 and the topic-specific reference sites (ref-rls, ref-jwt, ref-oauth, ref-webhook). Probes that fire on a clean reference are killed before they ship.
  6. Reproduce on the public benchmark — every category maps to a scenario on gapbench.vibe-eval.com — 104 scenarios, 97 deliberately vulnerable, 7 calibration controls — with curl commands you can run today.

The reproducibility anchor is the gapbench scenario set. Read the pattern walkthroughs for the anatomy of each bug shape and the why-gapbench manifesto for the calibration argument.

Headline catalog

Category cuts

Controlled experiments

Field studies

Companion pieces

What you can cite

  • Failure-mode rankings normalized to OWASP Web Top 10 2021, OWASP API Top 10 2023, and OWASP LLM Top 10 2025.
  • CWE codes per category — not just the parent CWE but the specific child (CWE-862 vs CWE-863 vs CWE-639 are different bugs with different fixes).
  • Categories, not vendor labels — RLS bypass, exposed secrets, BOLA, CORS misconfig, IDOR, broken auth, mass assignment.
  • Reproducible detections against the gapbench public benchmark — 97 deliberately vulnerable scenarios, 7 calibration controls.
  • All catalogs dated and versioned; superseded revisions stay reachable.

A note on scope and verifiability

These catalogs are grounded in (a) deliberately vulnerable scenarios on the public gapbench benchmark, and (b) anonymized customer engagements. We do not publish corpus-wide N or per-platform sample counts because the engagement portion is anonymized by design and not a uniform random sample. Where you see a relative ranking (“most common”, “highly recurring”) and no absolute percentage, that is deliberate.

The reproducibility anchor — the part anyone can verify by running curl — is the gapbench scenario set. Every failure mode in every study is reproducible against the matched scenario named in that study’s “Reproduce” block.

RUN YOUR OWN BENCHMARK

Point VibeEval at your stack and see how it scores against the catalog.

START FREE SCAN