METHODOLOGY
How VibeEval probes your live app — what we look at, how findings are graded, and what we explicitly do not claim. No black box. No vapor.
On this page
- Our approach
- Evidence, not opinions
- What we look at
- Context-aware grading
- Severity & scoring
- Calibration on gapbench
- The receipt model
- What we’re not claiming
- Your data & deletion
01 — Our approach
VibeEval is a black-box, evidence-first dynamic scanner. We test the same surface an attacker has — your live, deployed application — without source code, without infrastructure credentials, and without a stored model of your internals.
Each scan is one-shot and ephemeral: a clean run against your URL, results streamed back, then nothing left behind on our side except the finding records on your account. That stance is a security property, not just an architecture choice — we can’t leak what we never accumulate.
02 — Evidence, not opinions
We don’t emit a finding from a heuristic, a vibe, or a model’s guess. Every finding ships with the actual response that triggered it. If we can’t show you the wire-level evidence, the finding doesn’t exist.
That means:
- Findings derived from real HTTP responses, real DOM state, real network behavior — not from inferred risk.
- Active probes only fire when the response itself proves the issue. A 200 with leaked data is a finding; a 403 is not.
- No “informational” noise. If it’s in the report, it is reproducible.
03 — What we look at
We probe across the categories below. The exact checks evolve weekly as new patterns emerge in AI-generated apps; we do not publish a check inventory because it’s a moving target and because a list is a checklist an attacker can route around.
IDENTITY
Authentication, sessions, tokens, OAuth flows, CSRF, cookie scope.
API & DATA
Authorization, tenant isolation, object access, open data stores, exposed admin endpoints.
INJECTION
Parser abuse across query, template, markup, and serialization surfaces.
TRANSPORT
Headers, CORS, TLS, subdomain hygiene, exposed infrastructure.
SECRETS
Keys, tokens, credentials shipped to the browser or leaked through metadata.
BUSINESS LOGIC
Client-trusted state, payment-flow assumptions, weak randomness, price tampering.
AI & AGENT
Prompt injection, tool-call abuse, RAG poisoning, exposed model endpoints, MCP misuse.
SUPPLY CHAIN
Dependency hygiene, CI exposure, third-party telemetry leaks.
04 — Context-aware grading
The same response can be a non-issue on one stack and a critical exposure on another. We adjust severity based on what your app actually is — single-page vs. server-rendered, which data store is behind the API, whether auth is session-cookie or token-based — so a finding reflects real risk in your specific setup, not a generic CWE band.
We do not use an LLM as the primary detector. Reasoning layers exist to classify and explain findings that are already grounded in a captured response.
05 — Severity & scoring
Each finding is graded on a four-level scale. Scans produce two scores, both on a 0–100 band where higher is better.
Severity levels
| Severity | Meaning |
|---|---|
| CRITICAL | Direct path to data loss, account takeover, or unbounded billing. Fix today. |
| HIGH | Exploitable with modest effort or chains cleanly with a small additional step. |
| MEDIUM | Real weakness that hardens the attack path or increases blast radius. |
| LOW / INFO | Hygiene. Worth fixing in the next pass; not a same-day item. |
Gap score
Starts at 100. Each finding deducts by severity:
| Severity | Deduction |
|---|---|
| CRITICAL | −25 |
| HIGH | −15 |
| MEDIUM | −5 |
| LOW / INFO | −1 |
Floored at 0. Interpretation bands:
- 80–100 — low residual risk. Watchful monitoring.
- 60–79 — moderate. Remediate the Highs first; recheck within the sprint.
- <60 — high. Assume the path to compromise is short. Fix and rescan.
Coverage score
How much of the surface we actually got to probe. A black-box scan isn’t worth much if a WAF blocked half the requests, or if auth-gated routes never loaded. The coverage score reflects:
- Whether key categories returned a probeable response (or were blocked / challenged).
- Whether authenticated routes were reached when credentials were supplied.
- Whether the run completed within its time budget.
A high gap score with a low coverage score is the most dangerous combination — it doesn’t mean you’re clean, it means we didn’t get a clear look. Always read the two scores together.
Both scores are triage signals — they tell you whether to look now or later. They are not compliance numbers, maturity grades, or a substitute for human judgment on the specific finding.
06 — Calibration on gapbench
Every detection we ship is calibrated against a public benchmark we operate: gapbench.vibe-eval.com. Anyone can hit the same URLs and reproduce the same findings — calibration is not a claim, it is an audit you can run yourself.
The benchmark is 104 scenarios today:
- ~97 deliberately vulnerable scenarios. Each tagged to CWE and OWASP, each a live HTTP surface (not just a code sample) — exposed Postgres, naked Redis, S3 ListBucket, MCP server with shell access, Supabase service-role keys in the bundle, BOLA across CRUD, JWT alg=none, Stripe webhooks without signature checks, and the rest of the long tail. Built to mimic what Lovable, Bolt, Cursor, Replit, and V0 actually produce.
- 5 clean reference sites —
ref0plus four topic-specific controls (ref-oauth,ref-jwt,ref-webhook,ref-rls). If a clean control triggers a rule, that rule is a false positive — and we kill it before the next release. This is the part most scanners don’t have: ground truth for negatives. - 2 calibration targets (
noisy-errors,captcha-challenge) — for handling 5xx noise and bot challenges without misreading them as findings.
What this gives us:
- Measurable recall. Of N CWE-X surfaces planted, how many did we report? We can answer that with a number. Heuristic scanners trained on real-world repos can’t — they have no labels.
- Measurable precision. Every finding is also run against the matching
ref-*clean site. If it fires there, it’s noise. The false-positive number is observed, not asserted. - A public adversarial regression suite. Every new detection adds or modifies a scenario. The benchmark grows as the threat surface grows.
You don’t have to trust our accuracy claims. Run the same scanner you ran against your app against gapbench.vibe-eval.com/site/<scenario>/ and compare what we report to the labels we publish. If the numbers don’t match, that’s the bug we want to hear about.
07 — The receipt model
Every finding ships as a receipt: a self-contained record an engineer can act on, and an auditor can re-verify, without needing to log into VibeEval.
A receipt always includes:
- The vulnerability class and its CWE mapping.
- The path on your app where the evidence was captured.
- A timestamped excerpt of the actual response that triggered the rule.
- The severity grade and one-line rationale.
- A concrete remediation step.
If you can’t reproduce a finding from the receipt alone, that’s a bug — please tell us.
08 — What we’re not claiming
Honest limits, stated up front:
- We see the outside of your app. Pure static-analysis bugs deep inside a private worker are out of scope.
- Auth is shallow by default. We support credentialed scans, but we do not solve arbitrary multi-step SSO automatically.
- WAFs distort results. When upstream protection blocks our probes we say so; some real bugs become invisible from the outside.
- No coverage percentage. Anyone publishing “100% OWASP coverage” from a black-box probe is fitting a number to a benchmark. We’d rather show receipts than score ourselves.
- A scan is point-in-time. To approximate continuous coverage, trigger us on every deploy. We’re a probe, not a monitoring system.
If a competitor’s methodology page is shorter than this section, ask what’s missing.
09 — Your data & deletion
| Category | Storage | Duration |
|---|---|---|
| Scan targets (URLs you submit) | Encrypted at rest | Account lifetime + 30 days |
| Findings & receipts | Encrypted at rest | Account lifetime + 30 days |
| Credentials for authenticated scans | Scoped per-job, never logged | Deleted at end of scan |
| Operational logs | Aggregated, no app payloads | Short rolling window |
| Payment records | Stripe | ~7 years (legal requirement) |
We never store your application’s source code, your database contents, your end-users’ personal data, or full response bodies beyond the truncated excerpt on each receipt.
You can delete every scan, finding and receipt from your dashboard with one click. Full details, sub-processor list and DPA at Trust & Security.
Questions? security@vibe-eval.com. Found a bug in the methodology itself? Same address — disclosure metadata at /.well-known/security.txt.