WHY WE BUILT GAPBENCH
Every legacy scanner is a heuristic pattern matcher. Heuristics need calibration. Calibration needs ground truth. Most scanners do not have ground truth. We built one — 104 scenarios, public, audit-yourself, with five clean controls so false positives are observable instead of asserted.
Every legacy scanner — Snyk, Semgrep, Checkmarx, SonarQube, even the recent AI-flavored ones — is at its core a heuristic pattern matcher. Regex on source. AST on source. Dataflow on source. Occasionally a probe at runtime. The output is a list of “could be vulnerable” items the engineer has to triage. Triage burden is the real cost. Heuristic scanners optimize for recall and pay the precision tax in human hours.
Two things have made that trade-off worse, not better.
What changed
AI-generated code. Lovable, Bolt, Cursor, Replit, V0 emit patterns the legacy scanners weren’t trained on. Supabase service-role keys glued into Next.js client components. Stripe webhook handlers without signature checks because the Stripe docs got summarized into a prompt and the verification step dropped on the floor. BOLA across CRUD routes because the model generated the controllers in pattern-matched batches. Heuristic rules tuned on 2018 Rails apps miss this.
Runtime gaps. A static scan can flag a missing CSRF token in code, but it cannot flag a staging.example.com admin panel that a Lovable user spun up last week and forgot. The vuln lives at the URL, not in the repo. Static-only scanners are blind to half the surface AI codegen produces.
We built gapbench to solve the calibration problem these gaps create. The premise is simple: if your scanner says it has a 95% true-positive rate, show me the test set. Most scanners can’t, because their test set is “real-world repos,” which by definition have unknown ground truth.
What’s actually in the benchmark
As of today the manifest reports 104 scenarios. The breakdown:
- 97 deliberately vulnerable scenarios. Each one mimics what Lovable, Bolt, Cursor, Replit, or V0 produces — Supabase plus Next.js plus Edge Functions, Vite plus Express plus Prisma, the modern AI-codegen stack. Not 2014 PHP. Not 2019 Rails. The distribution is deliberately skewed toward AI-codegen-shaped surfaces.
- 5 clean reference sites.
ref0is the general-purpose clean control.ref-oauth,ref-jwt,ref-webhook,ref-rlsare the topic-specific controls — same shape as the vulnerable scenarios, but configured correctly. If your scanner flagsref-rls, you have a false-positive on a “RLS done right” surface, and your rule needs to be tighter. - 2 calibration targets —
noisy-errorsandcaptcha-challenge— used to tune scanners against benign-looking traffic that shouldn’t crash a probe.
Every scenario is tagged to CWEs. The manifest at gapbench.vibe-eval.com/__manifest carries the tags. We can measure recall numerically — of N CWE-X surfaces planted, how many did we report? Snyk and Semgrep can’t run that calculation against their own corpus because their corpus is unlabeled.
Why heuristic scanners can’t catch up by trying harder
The natural response to “your rules miss AI-generated code” is “we’ll add rules for AI-generated code.” That’s where the trap is. More rules without ground truth means more false positives, which means more triage burden, which means the scanner becomes shelf-ware. The recall improves. The precision craters. The customer turns it off.
The way out is the calibration loop. Add a rule. Test it against ref0 and the matching ref-* site. If it triggers on a clean control, the rule is wrong. Either tighten it or remove it. That loop is mechanically simple and most scanner vendors don’t run it because they don’t have the controls. They could build them — nothing stops them — but the controls have to be specific to the failure modes of the rules, which means the team adding rules has to also be building benchmark scenarios in parallel. That is operational discipline, not a technology.
Why this couldn’t have been done in 2018
It’s worth saying explicitly: gapbench is feasible because of where the industry is in 2026, not because anyone is smarter than they were in 2018.
The conditions that make it work:
- The AI-codegen stack converged. Five years ago, “secure this app” meant adapting to whichever framework the team had picked. Today, ~80% of new AI-built apps fit a small set of templates (Supabase + Next.js + Vite-shaped, mostly). A benchmark that mimics that distribution is genuinely representative.
- Modern hosting makes it cheap to run 100 vulnerable apps simultaneously. Each scenario is a small Docker container. Subdomain routing on a single VPS plus a TLS cert covers the whole benchmark.
- CWE and OWASP taxonomies have stabilized. The same names mean the same things across vendors. Tagging scenarios with consistent CWEs is meaningful in a way it wasn’t when category boundaries were fuzzier.
- AI scanners are emerging as a category. A “scanner benchmark” only matters if there are scanners to benchmark. The market exists now.
None of this was true in 2018. A “public benchmark for AI-codegen apps” would have been incoherent — the apps didn’t exist, the scanners didn’t exist, the testing infrastructure was harder. It works now because the conditions stacked.
The competitor question, addressed directly
We get asked frequently: “if gapbench is public, why don’t competitors just use it to game their detection?”
Three reasons it’s not really a problem:
- The benchmark grows. Every new detection we ship adds at least one scenario. Anyone gaming the static benchmark falls behind the moving one. The discipline of continuously adding is the moat — not the URL list.
- Gaming the benchmark = solving the problem. If a competitor’s scanner catches every gapbench scenario without false-positives on the controls, that scanner is a good scanner. We’d ship one of theirs to a customer who needed it. The benchmark surfaces real engineering, regardless of who does the work.
- The accuracy claim is auditable, not the marketing claim. Even if a competitor matches our gapbench numbers, the accuracy claim is “X% on the public benchmark.” That’s reproducible by anyone. Marketing claims like “8x more true positives” are not. The benchmark forces the conversation onto reproducible terms.
The honest position: we publish the benchmark because it makes our customer conversations easier, not because it’s strategically clever. Customers ask “is your scanner accurate?” and we point at a URL.
What’s missing and where we’ll add
Gapbench has gaps. Honest ones:
- Mobile app scenarios. We don’t have meaningful coverage for native mobile (iOS, Android) apps. The AI-codegen tooling for mobile is less mature, but the surface is real and underserved.
- Real-time / multiplayer scenarios. WebSocket and WebRTC have their own attack surfaces; we have one WebSocket scenario but not enough.
- Embedded / IoT-shape scenarios. Devices that talk to a cloud, with their own auth and update mechanisms, are an emerging surface.
- Long-form business-logic flows. Multi-step user flows (signup → onboarding → first paid action) where the bug requires walking the flow. Hard to scenarize, important.
- Larger codebase scenarios. Most current scenarios are single-purpose. A scenario that mimics a 50K-line app with one specific bug planted somewhere is more representative of real customer code.
We’re working on each. The roadmap is public; we publish progress in /updates/.
What this looks like in practice
The pattern articles in this series each cite a vulnerable scenario and, where one exists, a matching clean control. Read The Supabase service-role key in your frontend bundle. The vulnerable surface is at gapbench.vibe-eval.com/site/supabase-clone/. The clean side is at gapbench.vibe-eval.com/site/ref-rls/. Run any scanner against both. The scanner that fires on the first and stays silent on the second has earned its detection. The scanner that fires on both has a false-positive case to clean up. The scanner that stays silent on both either has a recall gap or doesn’t speak HTTP, and either way the customer should know.
The same applies to JWT alg=none (anchor: jwt-alg-confusion vs ref-jwt), to OAuth redirect attacks (oauth-redirect vs ref-oauth), to webhook signature bypass (webhook-unverified vs ref-webhook). The ref-* family was built specifically for the auth and trust-boundary surfaces where heuristic scanners over-trigger. We will keep extending it.
The marketing claim
Heuristic scanners measure their accuracy against the codebases they were trained on. We measure ours against a public benchmark you can audit, where every false positive and every missed vuln is reproducible at a URL. Show us a competitor that publishes their ref0.
That sentence is the entire business case. It’s not a feature claim — it’s a category move. The series of pattern articles this manifesto sits at the head of is the proof, one bug at a time.
What you should do with this
If you’re a builder evaluating scanners, run all of them against gapbench. Compare findings. The scanner with the best gapbench performance is not necessarily the best for your stack — gapbench is biased toward AI-codegen shapes — but it gives you a reproducible signal where vendor demos give you a sales pitch.
If you’re a journalist or researcher, the URLs and the manifest are public. The CWE tags are normalized. There’s nothing to negotiate access to.
If you’re a competitor, please run your scanner against gapbench too. We will absolutely publish the results, theirs and ours. We win or lose on the public number.
CWE / OWASP coverage
The benchmark currently covers 70+ distinct CWEs. The dominant categories: CWE-200 (information exposure), CWE-287 (broken auth), CWE-306 (missing auth), CWE-352 (CSRF), CWE-639 (BOLA), CWE-79 (XSS), CWE-89 (SQLi), CWE-918 (SSRF), CWE-94 (code injection), and the LLM-specific entries on the OWASP LLM Top 10. The full mapping lives in the manifest.
Reproduce it yourself
- Manifest: https://gapbench.vibe-eval.com/__manifest
- Clean general control: https://gapbench.vibe-eval.com/site/ref0/
- RLS clean control: https://gapbench.vibe-eval.com/site/ref-rls/
- JWT clean control: https://gapbench.vibe-eval.com/site/ref-jwt/
- OAuth clean control: https://gapbench.vibe-eval.com/site/ref-oauth/
- Webhook clean control: https://gapbench.vibe-eval.com/site/ref-webhook/
Run VibeEval against any of these. The vulnerable scenarios should fire detections. The clean controls should not. That is the whole methodology.
Related reading
COMMON QUESTIONS
RUN THE SCANNER WE CALIBRATE AGAINST
Every detection in VibeEval is benchmarked against gapbench before it ships. Run it on your app.