FALSE POSITIVES AND THE REF0 CONTROL
Every detection ships with a calibration test. The vulnerable scenario must trigger the finding; the matching clean reference must not. If the clean side triggers, the rule is wrong. The discipline is mechanical and evergreen — and most scanners don't run it because they don't have the controls.
This is the methodology page. Every other article in the series links here. This one explains why.
The calibration problem
A heuristic scanner is, in some sense, an opinionated reader of code. It sees eval(req.body) and says “that’s RCE.” It sees cors({ origin: true, credentials: true }) and says “that’s a CORS misconfig.” It sees a Supabase client created with a key from process.env and… what? The right answer depends on which key.
Heuristics work by pattern-matching. Patterns are imperfect. The scanner’s recall — how many real bugs it catches — is one axis. Its precision — how many of its findings are actually bugs — is the other. The trade-off between them is the central engineering problem in scanner work.
Most scanners optimize for recall. The customer-facing pitch is “we’ll catch everything.” The operational reality is alert fatigue: most findings are noise, the security team triages, the engineering team disregards, and the scanner becomes shelfware.
Calibration is the operational discipline that fixes this. After every new rule, you ask: does this rule fire on code that isn’t the bug? If yes, you tighten or kill the rule before it ships.
Calibration requires ground truth. Ground truth requires labeled examples — code that’s known-vulnerable and code that’s known-safe. Most scanners don’t have this. They have what they call a “test corpus,” which is usually a collection of real-world repos. Real-world repos have unknown ground truth. You can use them to find new rules; you can’t use them to grade rules.
What gapbench provides
The gapbench benchmark currently has 104 scenarios. Of those:
- 97 are deliberately vulnerable. Each is tagged with the CWE(s) it represents. The model is “if a scanner is supposed to catch this CWE, it should fire on this scenario.”
- 5 are clean reference sites:
ref0,ref-oauth,ref-jwt,ref-webhook,ref-rls. Each looks like the vulnerable scenarios but is configured correctly. The model is “no detection should fire on these.” - 2 are calibration targets:
noisy-errorsandcaptcha-challenge. These exist to keep scanner probes from falsely flagging benign-looking traffic patterns.
Each ref-* site mirrors the corresponding vulnerable surface. ref-rls is “RLS done right” — same Supabase shape as supabase-clone, same Next.js, same auth — but with RLS policies in place, no service-role key in the client. ref-jwt is “JWT done right” — same auth flow as jwt-alg-confusion but with algorithms pinned and signature verification mandatory.
The asymmetry is deliberate. Generic clean code (ref0) catches generic over-triggering. Topic-specific clean code (ref-*) catches over-triggering specific to a vuln class — these are the false-positive surfaces where heuristic scanners struggle most, because the safe and unsafe versions look textually similar.
How a rule lands
Walk-through of how a new detection ships in VibeEval:
- A rule is proposed. Either from a customer engagement, from a CVE we want to detect proactively, or from a regression in our own monitoring.
- The rule is implemented. Code goes into the scanner.
- The rule is tested against the corresponding vulnerable scenario. If a scenario for the bug doesn’t exist on gapbench, we write one. The rule must fire.
- The rule is tested against
ref0. If it fires, we have a false-positive case. Either tighten the rule or kill it. - The rule is tested against the topic-specific
ref-*site. If we have one for this vuln class. If it fires, same conclusion. - The rule is tested against a sample of customer scans (with consent). Because real-world data still surfaces things the controls don’t anticipate.
- If all four pass — the vulnerable scenario fires, the controls don’t, the customer scans look clean — the rule ships.
This loop is mechanical. It’s also the part most scanner vendors don’t run, because they don’t have the controls. They have step 1, 2, 3, 6 and skip 4, 5. The result is rules that fire reasonably often and have unknown false-positive characteristics until the customer complains.
What this looks like to a competitor running the same benchmark
Honest competitive perspective: any scanner can scan gapbench. We publish the URLs. We publish the manifest with the CWE tags. A competitor running their scanner against gapbench produces:
- Recall by CWE: of N planted scenarios for CWE-X, how many did they fire on?
- False-positive surface: of the 5 clean control sites, how many did they fire on?
- Per-scenario detection map: which scenarios did they catch, which did they miss?
That set of numbers is, for the first time, an apples-to-apples comparison. It’s not perfect — gapbench is biased toward AI-codegen-shaped stacks — but it’s reproducible.
We will run our scanner against gapbench publicly. We will, when asked, publish comparisons against any competitor that scans the same set. Whoever publishes their own gapbench results gets a win in the credibility column even if their numbers are worse than ours.
A worked example — calibrating the Supabase RLS rule
Concrete walkthrough so the methodology isn’t abstract.
We ship a detection: “Supabase service-role key in JavaScript bundle.” The rule fires when a JWT in the bundle has role: service_role in its decoded payload.
Step 1 — vulnerable scenario. gapbench.vibe-eval.com/site/supabase-clone/ ships the service-role key in _app.js. The detection fires. ✓
Step 2 — initial false-positive sweep. Run the detection against ref0. ref0 is a generic clean Next.js app — no Supabase, no auth, no JWTs at all. Detection should not fire. ✓
Step 3 — topic-specific clean control. Run against ref-rls. ref-rls is the same Supabase shape as supabase-clone but with the service-role key kept server-side. The bundle contains the anon key (which has role: anon) but no service-role key. Detection should not fire. ✓
Step 4 — adversarial probes. Try to break the rule:
- Anon JWT plus a service-role JWT in a separate file the bundle imports lazily — does the rule still find it?
- A JWT-shaped string that’s not actually a JWT (a random base64 sequence that decodes to something that looks like a JWT but isn’t signed) — does the rule false-positive?
- A genuine service-role key in a comment block in the bundle (rare but possible) — does the rule classify correctly?
For each probe, we either tighten the rule or, if it’s already tight, document the boundary.
Step 5 — production sweep. Run the rule across a sample of customer scans (with consent). Look for finding/non-finding split. If most customer scans fire, the rule may be over-firing on common patterns we didn’t anticipate. If too few fire, the rule may be too narrow.
The discipline is the loop. Every rule that ships has gone through these five steps. Most heuristic scanners ship rules that have only gone through step 1.
False positives across the industry — what the data says
We’ve run our test suite (the gapbench probes plus our internal recall + FP probes) against the major scanners. Honest summary, no specific competitor numbers because the field is fast-moving and we want this article to age well:
- Legacy SAST (Snyk, Semgrep, Checkmarx, Veracode, Fortify): High recall on classic CWE patterns. False-positive rates highly variable depending on tuning. Per-rule FP rates against
ref0-shape clean code are unpublished by the vendors and would be embarrassing to publish. - Modern AI-flavored (DepthFirst etc.): Better precision on code that matches their training distribution. Untested on AI-codegen-shaped surfaces specifically; their public benchmarks don’t include AI-codegen-stack scenarios.
- DAST (ZAP, Burp, Acunetix): Solid recall on OWASP-Top-10-shape bugs at runtime. Limited rules for AI-codegen-specific bugs.
- VibeEval: Calibrated against gapbench. Per-CWE numbers published. Falsifiable.
The field is still early on the public-benchmark adoption. The first competitor to publish their gapbench numbers raises the floor for everyone.
Why we keep adding scenarios
Gapbench is a moving target on purpose. Every new detection we ship gets a corresponding scenario. Every false-positive customer report gets a clean-control variant. Every threat-research thread that lands in our ticket queue produces at least one new vulnerable surface.
The current state: 104 scenarios. A year ago we were at 55. A year from now we’ll likely be at 200+. The growth isn’t a vanity metric — it tracks the work of “ship rule + ship calibration update for that rule.”
We resist the temptation to lump similar scenarios together. supabase-clone and config-leak and sentry-dsn-leak are all “secret in a bundle” surfaces, but each has slightly different shape — Supabase JWT vs raw API key vs DSN URL. Lumping them would simplify the manifest at the cost of false-positive resolution. Each separate scenario is a separate calibration target.
What customers can do with this
Three concrete uses:
-
Scanner evaluation. When picking between scanners, run all of them against a representative subset of gapbench. The numbers come back deterministic.
-
Scanner regression detection. If you already use a scanner, run it against gapbench monthly. Track the per-scenario detection map over time. If a previously-detected scenario stops firing, you have a regression in the scanner’s rules.
-
Custom rule validation. If your team writes custom Semgrep rules or other scanner extensions, gapbench scenarios are labeled test data you can use to validate the rules. Better than testing on your own codebase, where ground truth is unknown.
Each of these is more useful than the typical “vendor demo” approach to scanner evaluation.
What you should take away
-
Scanner accuracy claims are unverifiable without labeled ground truth. Ask any vendor where their false-positive rate comes from. The answer should reference labeled test data, not “we measured customer dismissals.”
-
The presence of clean controls in a benchmark is more important than the count of vulnerable scenarios. Anyone can plant 1,000 vulnerable scenarios; very few maintain the discipline of clean controls for each vuln class.
-
A scanner that hasn’t been tested against your stack’s specific shape is a wild card. Run gapbench against any scanner you’re evaluating, focus on the scenarios that resemble your stack, and weight the results accordingly.
Reproduce it yourself
- ref0: https://gapbench.vibe-eval.com/site/ref0/
- ref-rls: https://gapbench.vibe-eval.com/site/ref-rls/
- ref-jwt: https://gapbench.vibe-eval.com/site/ref-jwt/
- ref-oauth: https://gapbench.vibe-eval.com/site/ref-oauth/
- ref-webhook: https://gapbench.vibe-eval.com/site/ref-webhook/
- Calibration target (benign noisy errors): https://gapbench.vibe-eval.com/site/noisy-errors/
- Calibration target (captcha challenge): https://gapbench.vibe-eval.com/site/captcha-challenge/
- Manifest: https://gapbench.vibe-eval.com/__manifest
Related reading
- Pattern: Why we built gapbench
- Pattern: The Supabase service-role key in your frontend bundle
- Pattern: JWT alg=none is not dead
COMMON QUESTIONS
RUN THE SCANNER WE CALIBRATE
Every rule in VibeEval is benchmarked against gapbench's clean controls before it ships.