← BACK

METHODOLOGY

How VibeEval probes your live app — what we look at, how findings are graded, and what we explicitly do not claim. No black box. No vapor.

On this page

  1. Our approach
  2. Evidence, not opinions
  3. What we look at
  4. Context-aware grading
  5. Severity & scoring
  6. Calibration on gapbench
  7. The receipt model
  8. What we’re not claiming
  9. Your data & deletion

01 — Our approach

VibeEval is a black-box, evidence-first dynamic scanner. We test the same surface an attacker has — your live, deployed application — without source code, without infrastructure credentials, and without a stored model of your internals.

Each scan is one-shot and ephemeral: a clean run against your URL, results streamed back, then nothing left behind on our side except the finding records on your account. That stance is a security property, not just an architecture choice — we can’t leak what we never accumulate.

Why black-box. Most production breaches happen on the outside surface — a key in a bundle, an open API, a misconfigured CORS, a missing RLS policy. Testing from the outside is the only way to know what an attacker actually sees.

02 — Evidence, not opinions

We don’t emit a finding from a heuristic, a vibe, or a model’s guess. Every finding ships with the actual response that triggered it. If we can’t show you the wire-level evidence, the finding doesn’t exist.

That means:

  • Findings derived from real HTTP responses, real DOM state, real network behavior — not from inferred risk.
  • Active probes only fire when the response itself proves the issue. A 200 with leaked data is a finding; a 403 is not.
  • No “informational” noise. If it’s in the report, it is reproducible.

03 — What we look at

We probe across the categories below. The exact checks evolve weekly as new patterns emerge in AI-generated apps; we do not publish a check inventory because it’s a moving target and because a list is a checklist an attacker can route around.

IDENTITY

Authentication, sessions, tokens, OAuth flows, CSRF, cookie scope.

API & DATA

Authorization, tenant isolation, object access, open data stores, exposed admin endpoints.

INJECTION

Parser abuse across query, template, markup, and serialization surfaces.

TRANSPORT

Headers, CORS, TLS, subdomain hygiene, exposed infrastructure.

SECRETS

Keys, tokens, credentials shipped to the browser or leaked through metadata.

BUSINESS LOGIC

Client-trusted state, payment-flow assumptions, weak randomness, price tampering.

AI & AGENT

Prompt injection, tool-call abuse, RAG poisoning, exposed model endpoints, MCP misuse.

SUPPLY CHAIN

Dependency hygiene, CI exposure, third-party telemetry leaks.

04 — Context-aware grading

The same response can be a non-issue on one stack and a critical exposure on another. We adjust severity based on what your app actually is — single-page vs. server-rendered, which data store is behind the API, whether auth is session-cookie or token-based — so a finding reflects real risk in your specific setup, not a generic CWE band.

We do not use an LLM as the primary detector. Reasoning layers exist to classify and explain findings that are already grounded in a captured response.

05 — Severity & scoring

Each finding is graded on a four-level scale. Scans produce two scores, both on a 0–100 band where higher is better.

Severity levels

Severity Meaning
CRITICAL Direct path to data loss, account takeover, or unbounded billing. Fix today.
HIGH Exploitable with modest effort or chains cleanly with a small additional step.
MEDIUM Real weakness that hardens the attack path or increases blast radius.
LOW / INFO Hygiene. Worth fixing in the next pass; not a same-day item.

Gap score

Starts at 100. Each finding deducts by severity:

Severity Deduction
CRITICAL −25
HIGH −15
MEDIUM −5
LOW / INFO −1

Floored at 0. Interpretation bands:

  • 80–100 — low residual risk. Watchful monitoring.
  • 60–79 — moderate. Remediate the Highs first; recheck within the sprint.
  • <60 — high. Assume the path to compromise is short. Fix and rescan.

Coverage score

How much of the surface we actually got to probe. A black-box scan isn’t worth much if a WAF blocked half the requests, or if auth-gated routes never loaded. The coverage score reflects:

  • Whether key categories returned a probeable response (or were blocked / challenged).
  • Whether authenticated routes were reached when credentials were supplied.
  • Whether the run completed within its time budget.

A high gap score with a low coverage score is the most dangerous combination — it doesn’t mean you’re clean, it means we didn’t get a clear look. Always read the two scores together.

Both scores are triage signals — they tell you whether to look now or later. They are not compliance numbers, maturity grades, or a substitute for human judgment on the specific finding.

06 — Calibration on gapbench

Every detection we ship is calibrated against a public benchmark we operate: gapbench.vibe-eval.com. Anyone can hit the same URLs and reproduce the same findings — calibration is not a claim, it is an audit you can run yourself.

The benchmark is 104 scenarios today:

  • ~97 deliberately vulnerable scenarios. Each tagged to CWE and OWASP, each a live HTTP surface (not just a code sample) — exposed Postgres, naked Redis, S3 ListBucket, MCP server with shell access, Supabase service-role keys in the bundle, BOLA across CRUD, JWT alg=none, Stripe webhooks without signature checks, and the rest of the long tail. Built to mimic what Lovable, Bolt, Cursor, Replit, and V0 actually produce.
  • 5 clean reference sitesref0 plus four topic-specific controls (ref-oauth, ref-jwt, ref-webhook, ref-rls). If a clean control triggers a rule, that rule is a false positive — and we kill it before the next release. This is the part most scanners don’t have: ground truth for negatives.
  • 2 calibration targets (noisy-errors, captcha-challenge) — for handling 5xx noise and bot challenges without misreading them as findings.

What this gives us:

  • Measurable recall. Of N CWE-X surfaces planted, how many did we report? We can answer that with a number. Heuristic scanners trained on real-world repos can’t — they have no labels.
  • Measurable precision. Every finding is also run against the matching ref-* clean site. If it fires there, it’s noise. The false-positive number is observed, not asserted.
  • A public adversarial regression suite. Every new detection adds or modifies a scenario. The benchmark grows as the threat surface grows.

You don’t have to trust our accuracy claims. Run the same scanner you ran against your app against gapbench.vibe-eval.com/site/<scenario>/ and compare what we report to the labels we publish. If the numbers don’t match, that’s the bug we want to hear about.

Show us a competitor that publishes their ref0. Most scanner accuracy numbers are measured against the codebases the scanner was trained on. Ours are measured against a public benchmark you can audit, where every false positive and every missed vuln is reproducible at a URL.

07 — The receipt model

Every finding ships as a receipt: a self-contained record an engineer can act on, and an auditor can re-verify, without needing to log into VibeEval.

A receipt always includes:

  • The vulnerability class and its CWE mapping.
  • The path on your app where the evidence was captured.
  • A timestamped excerpt of the actual response that triggered the rule.
  • The severity grade and one-line rationale.
  • A concrete remediation step.

If you can’t reproduce a finding from the receipt alone, that’s a bug — please tell us.

08 — What we’re not claiming

Honest limits, stated up front:

  • We see the outside of your app. Pure static-analysis bugs deep inside a private worker are out of scope.
  • Auth is shallow by default. We support credentialed scans, but we do not solve arbitrary multi-step SSO automatically.
  • WAFs distort results. When upstream protection blocks our probes we say so; some real bugs become invisible from the outside.
  • No coverage percentage. Anyone publishing “100% OWASP coverage” from a black-box probe is fitting a number to a benchmark. We’d rather show receipts than score ourselves.
  • A scan is point-in-time. To approximate continuous coverage, trigger us on every deploy. We’re a probe, not a monitoring system.

If a competitor’s methodology page is shorter than this section, ask what’s missing.

09 — Your data & deletion

Category Storage Duration
Scan targets (URLs you submit) Encrypted at rest Account lifetime + 30 days
Findings & receipts Encrypted at rest Account lifetime + 30 days
Credentials for authenticated scans Scoped per-job, never logged Deleted at end of scan
Operational logs Aggregated, no app payloads Short rolling window
Payment records Stripe ~7 years (legal requirement)

We never store your application’s source code, your database contents, your end-users’ personal data, or full response bodies beyond the truncated excerpt on each receipt.

You can delete every scan, finding and receipt from your dashboard with one click. Full details, sub-processor list and DPA at Trust & Security.


Questions? security@vibe-eval.com. Found a bug in the methodology itself? Same address — disclosure metadata at /.well-known/security.txt.