gapbench is a public security benchmark we operate at gapbench.vibe-eval.com. It hosts 104 scenarios — 97 deliberately vulnerable apps, 5 clean reference sites, and 2 calibration targets — each tagged to CWE and OWASP categories. Anyone can hit the URLs, run any scanner against them, and reproduce findings.

Why does a scanner need ground truth?

Heuristic scanners optimize for recall. They flag anything that pattern-matches a vulnerability. Without ground truth — known-good code that should not trigger — there is no way to measure false-positive rate, which is the actual cost a security team pays. ref0 is the clean reference site. If a rule fires on ref0, the rule has a false-positive case and we kill or refine it.

Isn't measuring against real-world code more honest than a synthetic benchmark?

Real-world code has unknown ground truth — you don't know which findings are true positives and which are noise. That's the entire reason scanner accuracy claims are unverifiable. A synthetic benchmark with labeled ground truth is the only way to get measurable numbers. Real-world code is fine for finding rules to add; it cannot grade them.

How is this different from CWE Bench Java or Juliet?

Juliet and CWE-Bench are static-analysis test suites — they exist as code, not as deployed apps. gapbench scenarios are running on the public internet, with HTTP surfaces, real databases, real auth, real rendering. Static-only scanners cannot use gapbench at all because there is no source for them to read. Runtime scanners can. That distinction is the point — the gaps live at the URL, not in the repo.

Why do you publish this if competitors can use it too?

Because the asset is the operating discipline, not the URL list. Anyone can scan gapbench. Only the team that built it knows which scenario was added in response to which detection regression, what the upcoming additions are, and how to run a scan against the moving target. The benchmark is a moat for whoever maintains it; for everyone else it is a fair test surface.

What CWE and OWASP categories are covered?

Across the 104 scenarios we currently cover 70+ distinct CWEs and every category of OWASP Top 10, OWASP API Top 10, and OWASP LLM Top 10. The scenario manifest at gapbench.vibe-eval.com/__manifest is the authoritative list with per-scenario CWE tags.

How often does the benchmark change?

Continuously. Every new detection we ship in the scanner gets a corresponding gapbench scenario or modification. Every false-positive report from a customer gets a clean-side scenario or a tweak to ref0. The total count drifts upward — we started at 55, the manifest currently shows 104, and that number will keep rising.

Why we built gapbench, and why every heuristic scanner needs a ref0

Every legacy scanner — Snyk, Semgrep, Checkmarx, SonarQube, even the recent AI-flavored ones — is at its core a heuristic pattern matcher. Regex on source. AST on source. Dataflow on source. Occasionally a probe at runtime. The output is a list of “could be vulnerable” items the engineer has to triage. Triage burden is the real cost. Heuristic scanners optimize for recall and pay the precision tax in human hours.

Two things have made that trade-off worse, not better.

What changed

AI-generated code. Lovable, Bolt, Cursor, Replit, V0 emit patterns the legacy scanners weren’t trained on. Supabase service-role keys glued into Next.js client components. Stripe webhook handlers without signature checks because the Stripe docs got summarized into a prompt and the verification step dropped on the floor. BOLA across CRUD routes because the model generated the controllers in pattern-matched batches. Heuristic rules tuned on 2018 Rails apps miss this.

Runtime gaps. A static scan can flag a missing CSRF token in code, but it cannot flag a staging.example.com admin panel that a Lovable user spun up last week and forgot. The vuln lives at the URL, not in the repo. Static-only scanners are blind to half the surface AI codegen produces.

We built gapbench to solve the calibration problem these gaps create. The premise is simple: if your scanner says it has a 95% true-positive rate, show me the test set. Most scanners can’t, because their test set is “real-world repos,” which by definition have unknown ground truth.

What’s actually in the benchmark

As of today the manifest reports 104 scenarios. The breakdown:

97 deliberately vulnerable scenarios. Each one mimics what Lovable, Bolt, Cursor, Replit, or V0 produces — Supabase plus Next.js plus Edge Functions, Vite plus Express plus Prisma, the modern AI-codegen stack. Not 2014 PHP. Not 2019 Rails. The distribution is deliberately skewed toward AI-codegen-shaped surfaces.
5 clean reference sites. ref0 is the general-purpose clean control. ref-oauth, ref-jwt, ref-webhook, ref-rls are the topic-specific controls — same shape as the vulnerable scenarios, but configured correctly. If your scanner flags ref-rls, you have a false-positive on a “RLS done right” surface, and your rule needs to be tighter.
2 calibration targets — noisy-errors and captcha-challenge — used to tune scanners against benign-looking traffic that shouldn’t crash a probe.

Every scenario is tagged to CWEs. The manifest at gapbench.vibe-eval.com/__manifest carries the tags. We can measure recall numerically — of N CWE-X surfaces planted, how many did we report? Snyk and Semgrep can’t run that calculation against their own corpus because their corpus is unlabeled.

Why heuristic scanners can’t catch up by trying harder

The natural response to “your rules miss AI-generated code” is “we’ll add rules for AI-generated code.” That’s where the trap is. More rules without ground truth means more false positives, which means more triage burden, which means the scanner becomes shelf-ware. The recall improves. The precision craters. The customer turns it off.

The way out is the calibration loop. Add a rule. Test it against ref0 and the matching ref-* site. If it triggers on a clean control, the rule is wrong. Either tighten it or remove it. That loop is mechanically simple and most scanner vendors don’t run it because they don’t have the controls. They could build them — nothing stops them — but the controls have to be specific to the failure modes of the rules, which means the team adding rules has to also be building benchmark scenarios in parallel. That is operational discipline, not a technology.

Why this couldn’t have been done in 2018

It’s worth saying explicitly: gapbench is feasible because of where the industry is in 2026, not because anyone is smarter than they were in 2018.

The conditions that make it work:

The AI-codegen stack converged. Five years ago, “secure this app” meant adapting to whichever framework the team had picked. Today, ~80% of new AI-built apps fit a small set of templates (Supabase + Next.js + Vite-shaped, mostly). A benchmark that mimics that distribution is genuinely representative.
Modern hosting makes it cheap to run 100 vulnerable apps simultaneously. Each scenario is a small Docker container. Subdomain routing on a single VPS plus a TLS cert covers the whole benchmark.
CWE and OWASP taxonomies have stabilized. The same names mean the same things across vendors. Tagging scenarios with consistent CWEs is meaningful in a way it wasn’t when category boundaries were fuzzier.
AI scanners are emerging as a category. A “scanner benchmark” only matters if there are scanners to benchmark. The market exists now.

None of this was true in 2018. A “public benchmark for AI-codegen apps” would have been incoherent — the apps didn’t exist, the scanners didn’t exist, the testing infrastructure was harder. It works now because the conditions stacked.

The competitor question, addressed directly

We get asked frequently: “if gapbench is public, why don’t competitors just use it to game their detection?”

Three reasons it’s not really a problem:

The benchmark grows. Every new detection we ship adds at least one scenario. Anyone gaming the static benchmark falls behind the moving one. The discipline of continuously adding is the moat — not the URL list.
Gaming the benchmark = solving the problem. If a competitor’s scanner catches every gapbench scenario without false-positives on the controls, that scanner is a good scanner. We’d ship one of theirs to a customer who needed it. The benchmark surfaces real engineering, regardless of who does the work.
The accuracy claim is auditable, not the marketing claim. Even if a competitor matches our gapbench numbers, the accuracy claim is “X% on the public benchmark.” That’s reproducible by anyone. Marketing claims like “8x more true positives” are not. The benchmark forces the conversation onto reproducible terms.

The honest position: we publish the benchmark because it makes our customer conversations easier, not because it’s strategically clever. Customers ask “is your scanner accurate?” and we point at a URL.

What’s missing and where we’ll add

Gapbench has gaps. Honest ones:

Mobile app scenarios. We don’t have meaningful coverage for native mobile (iOS, Android) apps. The AI-codegen tooling for mobile is less mature, but the surface is real and underserved.
Real-time / multiplayer scenarios. WebSocket and WebRTC have their own attack surfaces; we have one WebSocket scenario but not enough.
Embedded / IoT-shape scenarios. Devices that talk to a cloud, with their own auth and update mechanisms, are an emerging surface.
Long-form business-logic flows. Multi-step user flows (signup → onboarding → first paid action) where the bug requires walking the flow. Hard to scenarize, important.
Larger codebase scenarios. Most current scenarios are single-purpose. A scenario that mimics a 50K-line app with one specific bug planted somewhere is more representative of real customer code.

We’re working on each. The roadmap is public; we publish progress in /updates/.

What this looks like in practice

The pattern articles in this series each cite a vulnerable scenario and, where one exists, a matching clean control. Read The Supabase service-role key in your frontend bundle. The vulnerable surface is at gapbench.vibe-eval.com/site/supabase-clone/. The clean side is at gapbench.vibe-eval.com/site/ref-rls/. Run any scanner against both. The scanner that fires on the first and stays silent on the second has earned its detection. The scanner that fires on both has a false-positive case to clean up. The scanner that stays silent on both either has a recall gap or doesn’t speak HTTP, and either way the customer should know.

The same applies to JWT alg=none (anchor: jwt-alg-confusion vs ref-jwt), to OAuth redirect attacks (oauth-redirect vs ref-oauth), to webhook signature bypass (webhook-unverified vs ref-webhook). The ref-* family was built specifically for the auth and trust-boundary surfaces where heuristic scanners over-trigger. We will keep extending it.

The marketing claim

Heuristic scanners measure their accuracy against the codebases they were trained on. We measure ours against a public benchmark you can audit, where every false positive and every missed vuln is reproducible at a URL. Show us a competitor that publishes their ref0.

That sentence is the entire business case. It’s not a feature claim — it’s a category move. The series of pattern articles this manifesto sits at the head of is the proof, one bug at a time.

What you should do with this

If you’re a builder evaluating scanners, run all of them against gapbench. Compare findings. The scanner with the best gapbench performance is not necessarily the best for your stack — gapbench is biased toward AI-codegen shapes — but it gives you a reproducible signal where vendor demos give you a sales pitch.

If you’re a journalist or researcher, the URLs and the manifest are public. The CWE tags are normalized. There’s nothing to negotiate access to.

If you’re a competitor, please run your scanner against gapbench too. We will absolutely publish the results, theirs and ours. We win or lose on the public number.

CWE / OWASP coverage

The benchmark currently covers 70+ distinct CWEs. The dominant categories: CWE-200 (information exposure), CWE-287 (broken auth), CWE-306 (missing auth), CWE-352 (CSRF), CWE-639 (BOLA), CWE-79 (XSS), CWE-89 (SQLi), CWE-918 (SSRF), CWE-94 (code injection), and the LLM-specific entries on the OWASP LLM Top 10. The full mapping lives in the manifest.

Reproduce it yourself

Manifest: https://gapbench.vibe-eval.com/__manifest
Clean general control: https://gapbench.vibe-eval.com/site/ref0/
RLS clean control: https://gapbench.vibe-eval.com/site/ref-rls/
JWT clean control: https://gapbench.vibe-eval.com/site/ref-jwt/
OAuth clean control: https://gapbench.vibe-eval.com/site/ref-oauth/
Webhook clean control: https://gapbench.vibe-eval.com/site/ref-webhook/

Run VibeEval against any of these. The vulnerable scenarios should fire detections. The clean controls should not. That is the whole methodology.

Pattern: The Supabase service-role key in your frontend bundle
Pattern: JWT alg=none is not dead
Pattern: MCP servers without auth — the prompt that ran rm -rf
Pattern: False positives and the ref0 control

WHY WE BUILT GAPBENCH

What changed

What’s actually in the benchmark

Why heuristic scanners can’t catch up by trying harder

Why this couldn’t have been done in 2018

The competitor question, addressed directly

What’s missing and where we’ll add

What this looks like in practice

The marketing claim

What you should do with this

CWE / OWASP coverage

Reproduce it yourself

COMMON QUESTIONS

RUN THE SCANNER WE CALIBRATE AGAINST

What changed

What’s actually in the benchmark

Why heuristic scanners can’t catch up by trying harder

Why this couldn’t have been done in 2018

The competitor question, addressed directly

What’s missing and where we’ll add

What this looks like in practice

The marketing claim

What you should do with this

CWE / OWASP coverage

Reproduce it yourself

Related reading

COMMON QUESTIONS

RUN THE SCANNER WE CALIBRATE AGAINST