HOW SECURE IS AN AI-GENERATED APP? 2026 FAILURE-MODE CATALOG

AI-built apps from Lovable, Bolt, Cursor, Replit, and V0 ship with a recurring set of authorization and credential failures. This catalog ranks the modes by how often they reproduce on the gapbench public benchmark and in anonymized customer engagements, with the OWASP and CWE mapping for each.

This is a failure-mode catalog: the recurring vulnerability classes we see in live applications built on Lovable, Bolt.new, Cursor, Replit, and V0, scored on the same CVSS rubric and mapped to the same taxonomy. The relative rankings below reflect what reproduces most reliably on the gapbench public benchmark and what we encounter most often in anonymized customer engagements.

If you are a builder, a journalist, or an AppSec researcher, every row in the tables below is reproducible against a live gapbench scenario. The methodology section explains exactly how.

Catalog scope

Field Value
Window Nov 2025 – Apr 2026
Source Customer engagements (anonymized) + gapbench reproducible scenarios
Severity rubric CVSS 3.1 (critical 9.0+, high 7.0–8.9)
Calibration controls ref0, ref-rls, ref-jwt, ref-oauth, ref-webhook
Public reproducibility anchor gapbench.vibe-eval.com — 97 deliberately vulnerable + 7 clean controls

We do not publish a single corpus-wide N because the underlying set mixes customer engagements (anonymized) and longitudinal scans against our own gapbench scenarios. The reproducibility anchor — the part anyone can verify — is the gapbench scenario set referenced under each finding.

Per-platform modal failure mode

The dominant failure mode we observe per platform, ranked by relative frequency of “at least one critical or high finding.”

Platform Relative critical+high incidence Modal top finding
Lovable Highest Missing or broken Supabase RLS
Bolt.new High Hardcoded secrets in client bundle
Replit High Public .env exposure on default deployments
Cursor Moderate Broken object-level auth (BOLA)
V0 Lower (application-layer) Unauthenticated API routes generated alongside components

Lovable’s elevated rate is structural, not incidental — see the FAQ. V0’s lower application-layer rate reflects that V0 apps typically outsource their backend; the underlying Supabase or Convex backend then carries the RLS and credential failures separately.

Top 10 recurring vulnerabilities

Ranked by how consistently we reproduce them across platforms and engagements. Every row maps to a gapbench scenario in the “Reproduce” section below.

Rank Category OWASP mapping Recurrence gapbench scenario
1 Missing or broken Row Level Security API1 BOLA · API3 BOPLA Most-reproduced supabase-clone
2 Hardcoded secrets in frontend bundle A02 Cryptographic Failures Highly recurring indie-saas, config-leak
3 Broken object-level authorization (BOLA) API1 BOLA Highly recurring multi-tenant-saas
4 Missing rate limiting on auth and write endpoints API4 Unrestricted Resource Consumption Common
5 CORS allow-all on credentialed endpoints A05 Security Misconfiguration Common cors-credentials-misconfig
6 Self-editable role or permission fields API5 BFLA · API6 Mass Assignment Common mass-assignment
7 SSRF via user-supplied URLs in upload/import flows A10 SSRF Recurring ssrf-open-redirect-oauth
8 Verbose error responses leaking stack traces A09 Logging Failures Recurring
9 Open redirects in auth callback handlers A01 Broken Access Control Recurring ssrf-open-redirect-oauth
10 Outdated dependencies with known critical CVEs A06 Vulnerable Components Recurring

The top three are the categories that recur on essentially every engagement. Any single one of them is sufficient to leak every user’s data.

CWE / OWASP mapping for the top 10

The OWASP column in the table above is one mapping per row; in practice each finding usually carries two or three CWE codes. The expanded mapping below is the canonical one we tag findings against.

Rank Category OWASP API OWASP Web OWASP LLM Primary CWE Secondary CWE
1 Missing or broken RLS API1 BOLA · API3 BOPLA A01 Broken Access Control CWE-862 Missing Authorization CWE-863 Incorrect Authorization
2 Hardcoded secrets in frontend bundle API8 Security Misconfiguration A02 Cryptographic Failures · A05 LLM07 System Prompt Leakage CWE-798 Hard-coded Credentials CWE-200 Sensitive Info Exposure
3 BOLA API1 BOLA A01 Broken Access Control CWE-639 Auth Bypass via Key CWE-284 Improper Access Control
4 Missing rate limiting API4 Unrestricted Resource Consumption A05 Security Misconfiguration LLM10 Unbounded Consumption CWE-770 Allocation w/o Limits CWE-307 Improper Restriction of Auth Attempts
5 CORS allow-all on credentialed endpoints API8 Security Misconfiguration A05 Security Misconfiguration CWE-942 Permissive Cross-domain Policy CWE-346 Origin Validation Error
6 Self-editable role / mass assignment API5 BFLA · API6 Mass Assignment A04 Insecure Design CWE-915 Mass Assignment CWE-863 Incorrect Authorization
7 SSRF in upload / import flows API7 Server Side Request Forgery A10 SSRF CWE-918 SSRF CWE-441 Confused Deputy
8 Verbose error responses API8 Security Misconfiguration A09 Logging Failures · A05 CWE-209 Info Exposure via Error CWE-200
9 Open redirects in auth callbacks API8 Security Misconfiguration A01 Broken Access Control CWE-601 URL Redirect to Untrusted CWE-639
10 Outdated dependencies with known CVEs API8 Security Misconfiguration A06 Vulnerable Components CWE-1104 Use of Unmaintained Third Party CWE-937

The top three categories together carry the bulk of CWE-639 / CWE-862 / CWE-798 — the access-control and credential families. These are also the categories where AI generators have the most systematic blind spots: the bug is in what the model omitted, not what it produced.

Calibration — why the catalog is not scanner noise

Every probe behind this catalog runs against a clean reference site as well as the target — the calibration stack is what separates a recurring failure mode from a noisy detection.

Reference URL Calibrates probes for
ref0 (general) /site/ref0/ The catch-all clean baseline; every probe runs here
ref-rls /site/ref-rls/ Supabase RLS / PostgREST detections
ref-jwt /site/ref-jwt/ JWT alg-confusion, kid-traversal, weak-secret detections
ref-oauth /site/ref-oauth/ OAuth redirect_uri, PKCE, state-parameter detections
ref-webhook /site/ref-webhook/ Stripe / payment webhook signature detections

A probe that fires on its matched reference is, by construction, a false positive. The rule is killed before it ships. Heuristic scanners that lack ground-truth references publish recall-leaning numbers because they cannot measure their own precision. Every recurrence claim in this catalog is net of false-positive elimination via the reference sites.

For the methodology in detail, see the companion pattern: False positives and the ref0 control.

What is moving year-over-year

We do not publish year-over-year share percentages because we cannot independently verify a uniform sample across two windows. What we can report directionally, from what we see in engagements:

  • RLS awareness has grown. Lovable, Bolt, and Cursor now ship documentation that explicitly mentions Row Level Security. The failure mode persists, but the framing in the platforms’ own docs has changed.
  • Secret handling has not improved. Service-role keys and sk_live_ style secrets in frontend bundles are still the modal credential leak — see the Frontend Secrets Report.
  • BOLA in AI-generated CRUD is, if anything, more common as platforms expand their custom-API surface beyond pure PostgREST — see BOLA in AI-generated CRUD.

Methodology

Source. Failure modes were identified across (a) anonymized customer engagements with apps built on Lovable, Bolt.new, Cursor, Replit, and V0 between Nov 2025 and Apr 2026, and (b) deliberately vulnerable scenarios on gapbench.vibe-eval.com shaped to mirror real outputs from each platform. We do not publish a single corpus N or per-platform sample counts because the underlying engagements are anonymized by design and not a uniform random sample.

Scoring. CVSS 3.1 with a fixed rubric: critical 9.0+, high 7.0-8.9. Severity is set by the scanner based on the captured exploit — every finding ships with a reproduced request and response.

Probes. A probe set covering authentication, authorization, secret detection, transport security, input validation, and dependency CVEs. Every probe is reproducible by anyone running the same scanner against the matching gapbench scenario.

Calibration. Every probe runs against a matched clean reference (ref0, ref-rls, ref-jwt, ref-oauth, ref-webhook). A probe that fires on its reference is by construction a false positive and is killed before it ships.

Limits. This catalog covers what an authenticated DAST agent can prove from outside. It does not measure source-code vulnerabilities that never reach a reachable surface, social-engineering risk, or supply-chain compromise. Static analysis catches a different set of issues; pair this catalog with one.

Scope disclosure. Relative-frequency claims (“most-reproduced”, “highly recurring”, per-platform modal failure) are grounded in (1) the gapbench scenario set, which is fully public and curl-reproducible, and (2) anonymized customer engagements. Where you see a relative ranking but no absolute percentage, that is deliberate: the percentage would imply a uniform sample we are not in a position to publish.

Reproduce on the public benchmark

Each of the top categories maps to a live scenario on gapbench.vibe-eval.com. The detection that produced the count in the table above is the same detection that fires against these scenarios.

Category gapbench scenario Pattern walkthrough
Missing or broken RLS supabase-clone Supabase service-role leak
Hardcoded secrets in bundle indie-saas, config-leak Source maps and .git exposed
BOLA in CRUD multi-tenant-saas, fintech-app BOLA in AI-generated CRUD
CORS allow-all + credentials cors-misconfig CORS = * with credentials = true
Self-editable role / mass assignment mass-assignment Mass assignment
SSRF in upload / import ssrf-image-proxy SSRF, open redirects, OAuth redirect_uri
Open redirects in auth callbacks oauth-redirect SSRF, open redirects, OAuth redirect_uri
Naked databases (Postgres / Redis / Mongo) naked-databases Naked databases on the public internet
ref0 (clean control) ref0 False positives and the ref0 control

For the manifesto-level argument behind the calibration approach — and why this is the only way to read corpus-wide numbers honestly — see Why we built gapbench.

How to reproduce a single data point

  1. Pick a live URL built on one of the five platforms.
  2. Run the free token leak checker — that gives you the secrets-in-bundle data point.
  3. Run the Supabase RLS checker — that gives you the RLS data point.
  4. Run the Vibe Code Scanner on the URL — that gives you the BOLA, CORS, and rate-limit data points.

The four scanners together cover the top five categories in this benchmark. The full VibeEval agent covers all 310 probes.

Sources and references

  • gapbench public benchmark. gapbench.vibe-eval.com — 97 deliberately vulnerable scenarios + 7 clean controls. Every failure mode in this catalog reproduces against one of the listed scenarios via curl.
  • OWASP API Security Top 10 (2023). owasp.org/API-Security — the API1 BOLA, API3 BOPLA, API5 BFLA, API8 Security Misconfiguration mappings.
  • OWASP Top 10 Web (2021). owasp.org/Top10 — A01–A10 mappings.
  • OWASP LLM Top 10 (2025). LLM07 System Prompt Leakage, LLM10 Unbounded Consumption mappings for AI-feature-specific findings.
  • CVSS 3.1. first.org/cvss/v3.1 — severity rubric.
  • CWE. cwe.mitre.org — every finding carries its primary CWE.

Citations

If you reference this catalog, please cite as:

VibeEval. How Secure Is an AI-Generated App? 2026 Failure-Mode Catalog for Lovable, Bolt, Cursor, Replit, and V0. May 2026. https://vibe-eval.com/data-studies/ai-app-security-benchmark-2026/

We refresh the catalog when new failure modes are confirmed on gapbench. The current revision is dated in the page metadata.

RUN IT YOURSELF

Each scenario below is live on the public benchmark. The commands are copy-paste ready. Outputs may evolve as we tune the scenarios; the bug stays.

RLS bypass — the modal Lovable failure
curl -s 'https://gapbench.vibe-eval.com/site/supabase-clone/rest/v1/users?select=*' -H 'apikey: ANON_KEY'
expected 200 with the users table — no RLS, anon role unrestricted
Stripe sk_live_ in bundle — the modal Bolt failure
curl -s https://gapbench.vibe-eval.com/site/indie-saas/ | grep -oE 'sk_(live|test)_[A-Za-z0-9]{20,}'
expected Stripe secret key embedded inline
BOLA — the modal Cursor failure
curl -s https://gapbench.vibe-eval.com/site/multi-tenant-saas/api/projects/1 -H 'Authorization: Bearer USER_B_TOKEN'
expected 200 with user A's project — missing ownership check
Self-editable role — mass assignment
curl -s -X PATCH https://gapbench.vibe-eval.com/site/mass-assignment/api/profile -H 'Authorization: Bearer USER_TOKEN' -d '{"is_admin":true}'
expected 200; the user becomes admin via a field they should not control
Clean baseline — ref0 reports nothing
curl -s -I https://gapbench.vibe-eval.com/site/ref0/
expected All scanner probes return clean — the false-positive reference

COMMON QUESTIONS

01
How is this catalog grounded?
Every failure mode in this catalog is reproducible against a deliberately vulnerable scenario on the public gapbench benchmark (gapbench.vibe-eval.com), and every detection is calibrated against a matching clean reference (ref0, ref-rls, ref-jwt, ref-oauth, ref-webhook) so a probe that fires on a clean target is killed before it ships. The relative frequency claims also draw on anonymized customer engagements; we do not publish corpus-wide app counts because we cannot share or independently verify a uniform sample. If you want a number you can cite, run the same scan against your URL — the count of one is reproducible.
Q&A
02
What counts as a critical or high-severity finding?
We use CVSS 3.1 with a fixed scoring rubric. Critical (9.0+) covers unauthenticated reads or writes against user data, exposed secret keys with billing or compute access, and remote code execution. High (7.0-8.9) covers cross-user data leakage requiring valid auth, role escalation, and exposed publishable keys with abuse paths.
Q&A
03
Why is Lovable's RLS failure rate the most-reproduced across the catalog?
Two reasons. First, Lovable defaults every app to a Supabase backend with the anon key shipped to the browser, so RLS is the only authorization layer between the public internet and the database — a misconfigured policy is immediately exploitable. Second, Lovable's generator adds tables incrementally as features are added, and policy creation does not always keep pace. We see clean apps regress after a single feature is added.
Q&A
04
Is any platform secure by default?
No platform we have reviewed produces apps that are clean by default. V0 has the lightest application-layer failure rate because V0 outputs are typically component-only, but the underlying Supabase or Convex backend then carries the same RLS and credential misconfigurations. The class of vulnerability shifts; the failure mode does not disappear.
Q&A
05
Is the methodology open?
Yes. The probe definitions are listed at the end of this study. The CVSS rubric is published. The platform-detection heuristics are open-source. Every failure mode is reproducible by running the same scanner against the matching gapbench scenario — the curl commands above run against live deliberately vulnerable targets. Customer-engagement findings are anonymized by design.
Q&A
06
Where does VibeEval fit in this benchmark?
VibeEval is the scanner that produced these findings. The benchmark exists because no other tool catches this specific class of vulnerability — static SAST tools don't see RLS, dependency scanners don't see auth gaps, traditional DAST tools don't understand Supabase or Firebase. The product is the methodology made repeatable for one URL at a time.
Q&A
07
How do you keep false positives down at 310 probes?
Every probe is calibrated against ref0 and four topic-specific clean references — ref-rls, ref-jwt, ref-oauth, ref-webhook — that we publish on gapbench.vibe-eval.com. Any probe that fires on a clean reference is, by construction, a false positive and gets killed before it ships. The reference sites are the ground truth most heuristic scanners do not have. Every finding count in this benchmark is net of that calibration.
Q&A
08
Where can I run the same probes against a deliberately vulnerable target?
https://gapbench.vibe-eval.com/ — the public benchmark we operate. 104 scenarios, 97 deliberately vulnerable plus 7 clean / calibration controls. Run the scanner against a scenario, see the finding; run it against ref0, see no finding. That is the credibility test most security claims cannot pass.
Q&A
09
What CWE / OWASP categories does the benchmark cover?
OWASP Web Top 10 2021 (A01-A10) and OWASP API Top 10 2023 (API1-API10) are the primary mappings. OWASP LLM Top 10 2025 covers AI-feature-specific findings. CWE coverage spans CWE-200 series (information exposure), CWE-284-863 (access control), CWE-287/345 (auth and trust), CWE-540/798 (credentials and config), CWE-918 (SSRF), and CWE-352 (CSRF). Every finding in the report carries its CWE and OWASP tags.
Q&A

BENCHMARK YOUR OWN APP

Run the same scan against your URL. Report in under 60 seconds.

RUN BENCHMARK SCAN