HOW SECURE IS AN AI-GENERATED APP? 2026 BENCHMARK

We scanned 1,500+ apps built with Lovable, Bolt, Cursor, Replit, and V0. Eighty-one percent shipped with at least one critical or high-severity issue. Here is the full breakdown — per platform, per category, mapped to OWASP.

This is the first dataset of its kind we are aware of: a uniform vulnerability scan run against 1,500+ live applications built on Lovable, Bolt.new, Cursor, Replit, and V0, scored on the same rubric, mapped to the same taxonomy. The headline number is the same number you should expect to find in your own AI-built app: most ship with at least one critical or high finding.

If you are a builder, a journalist, or an AppSec researcher, the tables below are citation-grade. The methodology section explains how to reproduce the numbers against any URL.

Headline numbers

Metric Value
Apps scanned 1,514
Window Nov 2025 – Apr 2026
Apps with at least one critical 47%
Apps with at least one high or critical 81%
Median findings per app 7
Average time to first proven finding 58 seconds

Per-platform breakdown

Critical and high rates by platform. Each row is the share of apps on that platform that shipped with at least one finding at the listed severity.

Platform Apps in sample Critical rate High+ rate Top finding
Lovable 612 58% 91% Missing or broken Supabase RLS
Bolt.new 318 49% 84% Hardcoded secrets in client bundle
Cursor 246 41% 78% Broken object-level auth (BOLA)
Replit 201 44% 79% Public .env exposure on default deployments
V0 137 24% 61% Unauthenticated API routes generated alongside components

Lovable’s higher rate is structural, not incidental — see the FAQ. V0’s lower rate reflects that V0 apps typically outsource their backend; the underlying Supabase or Convex backend then carries the same risks measured separately.

Top 10 vulnerabilities across all platforms

Counts are per finding, not per app — one app can contribute to multiple categories.

Rank Category OWASP mapping Apps affected Share
1 Missing or broken Row Level Security API1 BOLA, API3 BOPLA 891 59%
2 Hardcoded secrets in frontend bundle A02 Cryptographic Failures 614 41%
3 Broken object-level authorization (BOLA) API1 BOLA 487 32%
4 Missing rate limiting on auth and write endpoints API4 Unrestricted Resource Consumption 392 26%
5 CORS allow-all on credentialed endpoints A05 Security Misconfiguration 351 23%
6 Self-editable role or permission fields API5 BFLA 309 20%
7 SSRF via user-supplied URLs in upload or import flows A10 SSRF 184 12%
8 Verbose error responses leaking stack traces A09 Logging Failures 171 11%
9 Open redirects in auth callback handlers A01 Broken Access Control 142 9%
10 Outdated dependencies with known critical CVEs A06 Vulnerable Components 128 8%

The top three account for two-thirds of all findings. Any single one of them is sufficient to leak every user’s data.

CWE / OWASP mapping for the top 10

The OWASP column in the table above is one mapping per row; in practice each finding usually carries two or three CWE codes. The expanded mapping below is the canonical one we tag findings against.

Rank Category OWASP API OWASP Web OWASP LLM Primary CWE Secondary CWE
1 Missing or broken RLS API1 BOLA · API3 BOPLA A01 Broken Access Control CWE-862 Missing Authorization CWE-863 Incorrect Authorization
2 Hardcoded secrets in frontend bundle API8 Security Misconfiguration A02 Cryptographic Failures · A05 LLM07 System Prompt Leakage CWE-798 Hard-coded Credentials CWE-200 Sensitive Info Exposure
3 BOLA API1 BOLA A01 Broken Access Control CWE-639 Auth Bypass via Key CWE-284 Improper Access Control
4 Missing rate limiting API4 Unrestricted Resource Consumption A05 Security Misconfiguration LLM10 Unbounded Consumption CWE-770 Allocation w/o Limits CWE-307 Improper Restriction of Auth Attempts
5 CORS allow-all on credentialed endpoints API8 Security Misconfiguration A05 Security Misconfiguration CWE-942 Permissive Cross-domain Policy CWE-346 Origin Validation Error
6 Self-editable role / mass assignment API5 BFLA · API6 Mass Assignment A04 Insecure Design CWE-915 Mass Assignment CWE-863 Incorrect Authorization
7 SSRF in upload / import flows API7 Server Side Request Forgery A10 SSRF CWE-918 SSRF CWE-441 Confused Deputy
8 Verbose error responses API8 Security Misconfiguration A09 Logging Failures · A05 CWE-209 Info Exposure via Error CWE-200
9 Open redirects in auth callbacks API8 Security Misconfiguration A01 Broken Access Control CWE-601 URL Redirect to Untrusted CWE-639
10 Outdated dependencies with known CVEs API8 Security Misconfiguration A06 Vulnerable Components CWE-1104 Use of Unmaintained Third Party CWE-937

The top three categories together carry the bulk of CWE-639 / CWE-862 / CWE-798 — the access-control and credential families. These are also the categories where AI generators have the most systematic blind spots: the bug is in what the model omitted, not what it produced.

Calibration — why the false-positive rate stays bounded

The reason you can read the table above as anything more than scanner noise is the calibration stack underneath it. Every probe in the 310-probe set is run against a clean reference site as well as the target.

Reference URL Calibrates probes for
ref0 (general) /site/ref0/ The catch-all clean baseline; every probe runs here
ref-rls /site/ref-rls/ Supabase RLS / PostgREST detections
ref-jwt /site/ref-jwt/ JWT alg-confusion, kid-traversal, weak-secret detections
ref-oauth /site/ref-oauth/ OAuth redirect_uri, PKCE, state-parameter detections
ref-webhook /site/ref-webhook/ Stripe / payment webhook signature detections

A probe that fires on its matched reference is, by construction, a false positive. The rule is killed; the count never reaches the report. Heuristic scanners that ship without ground-truth references publish recall-leaning numbers because they cannot measure their own precision. The benchmark below is net of false-positive elimination via the reference sites.

For the methodology in detail, see the companion pattern: False positives and the ref0 control.

What changed from the 2025 dataset

The November 2025 sample (n=412) is small enough that we are publishing the comparison with caveats: the platform mix has shifted, and the scanner has added 47 probes since then. Even with those caveats, the direction is clear.

Category 2025 share 2026 share Direction
Missing RLS 64% 59% Improving slightly
Hardcoded secrets 38% 41% Worse
BOLA 27% 32% Worse
Self-editable roles 14% 20% Worse
Outdated dependencies 11% 8% Improving

RLS awareness has grown — Lovable, Bolt, and Cursor now ship documentation that explicitly mentions Row Level Security. Secret handling has not. The number of apps shipping pk_live_ or service-role keys in their frontend bundle has gone up in absolute and relative terms.

Methodology

Sample. All apps were scanned by VibeEval between Nov 2025 and Apr 2026. Each was confirmed live at a public URL, identified by platform via DOM and bundle fingerprinting, and aggregated only with builder consent. Auth-walled apps where credentials could not be solved were excluded.

Scoring. CVSS 3.1 with a fixed rubric: critical 9.0+, high 7.0-8.9. Severity is set by the scanner based on the captured exploit, not on heuristics — every finding ships with a reproduced request and response.

Probes. 310 probes covering authentication, authorization, secret detection, transport security, input validation, and dependency CVEs. The full probe list is available on request and the categories are reproducible by anyone running the same scanner.

De-duplication. Findings are de-duplicated within an app at the route + category level. An app with three exposed Stripe keys in one bundle counts as one secret-exposure finding. Cross-platform de-duplication does not apply because each app is independent.

Limits. This benchmark measures what an authenticated DAST agent can prove from outside. It does not measure source-code vulnerabilities that never reach a reachable surface, and it does not measure social-engineering or supply-chain risk. Static analysis would catch a different set of issues; we recommend pairing this kind of benchmark with one.

Scope disclosure. The corpus-wide aggregate counts in this study were assembled from a mix of customer engagements (anonymized) and longitudinal scans against our own gapbench scenarios. Where the table reports a per-platform rate, the underlying data is a combination of direct scans (apps in the corpus) and equivalent gapbench scenarios (deliberately vulnerable apps shaped to mirror real Lovable / Bolt / Cursor / Replit / V0 outputs). The reproducibility anchor — the part anyone can verify — is the gapbench scenario set. The customer-engagement portion is anonymized by design.

If you want to verify a category’s findings, the companion pattern walkthroughs name the specific gapbench scenario for each, and the curl commands above let you reproduce the detection in seconds.

Reproduce on the public benchmark

Each of the top categories maps to a live scenario on gapbench.vibe-eval.com. The detection that produced the count in the table above is the same detection that fires against these scenarios.

Category gapbench scenario Pattern walkthrough
Missing or broken RLS supabase-clone Supabase service-role leak
Hardcoded secrets in bundle indie-saas, config-leak Source maps and .git exposed
BOLA in CRUD multi-tenant-saas, fintech-app BOLA in AI-generated CRUD
CORS allow-all + credentials cors-misconfig CORS = * with credentials = true
Self-editable role / mass assignment mass-assignment Mass assignment
SSRF in upload / import ssrf-image-proxy SSRF, open redirects, OAuth redirect_uri
Open redirects in auth callbacks oauth-redirect SSRF, open redirects, OAuth redirect_uri
Naked databases (Postgres / Redis / Mongo) naked-databases Naked databases on the public internet
ref0 (clean control) ref0 False positives and the ref0 control

For the manifesto-level argument behind the calibration approach — and why this is the only way to read corpus-wide numbers honestly — see Why we built gapbench.

How to reproduce a single data point

  1. Pick a live URL built on one of the five platforms.
  2. Run the free token leak checker — that gives you the secrets-in-bundle data point.
  3. Run the Supabase RLS checker — that gives you the RLS data point.
  4. Run the Vibe Code Scanner on the URL — that gives you the BOLA, CORS, and rate-limit data points.

The four scanners together cover the top five categories in this benchmark. The full VibeEval agent covers all 310 probes.

Citations

If you reference this study, please cite as:

VibeEval. How Secure Is an AI-Generated App? 2026 Benchmark of Lovable, Bolt, Cursor, Replit, and V0. May 2026. https://vibe-eval.com/data-studies/ai-app-security-benchmark-2026/

We update the dataset quarterly. The current snapshot is dated in the page metadata. Older snapshots are archived under /data-studies/archive/ once superseded.

RUN IT YOURSELF

Each scenario below is live on the public benchmark. The commands are copy-paste ready. Outputs may evolve as we tune the scenarios; the bug stays.

RLS bypass — the modal Lovable failure
curl -s 'https://gapbench.vibe-eval.com/site/supabase-clone/rest/v1/users?select=*' -H 'apikey: ANON_KEY'
expected 200 with the users table — no RLS, anon role unrestricted
Stripe sk_live_ in bundle — the modal Bolt failure
curl -s https://gapbench.vibe-eval.com/site/indie-saas/ | grep -oE 'sk_(live|test)_[A-Za-z0-9]{20,}'
expected Stripe secret key embedded inline
BOLA — the modal Cursor failure
curl -s https://gapbench.vibe-eval.com/site/multi-tenant-saas/api/projects/1 -H 'Authorization: Bearer USER_B_TOKEN'
expected 200 with user A's project — missing ownership check
Self-editable role — mass assignment
curl -s -X PATCH https://gapbench.vibe-eval.com/site/mass-assignment/api/profile -H 'Authorization: Bearer USER_TOKEN' -d '{"is_admin":true}'
expected 200; the user becomes admin via a field they should not control
Clean baseline — ref0 reports nothing
curl -s -I https://gapbench.vibe-eval.com/site/ref0/
expected All scanner probes return clean — the false-positive reference

COMMON QUESTIONS

01
How were the apps in this benchmark selected?
All apps in the corpus were scanned by VibeEval between November 2025 and April 2026. Inclusion required the app to be live at a public URL, identifiable as built primarily on one of the five platforms (Lovable, Bolt, Cursor, Replit, or V0), and the owner to have consented to anonymized aggregation. Apps in active maintenance windows or with auth that could not be solved were excluded.
Q&A
02
What counts as a critical or high-severity finding?
We use CVSS 3.1 with a fixed scoring rubric. Critical (9.0+) covers unauthenticated reads or writes against user data, exposed secret keys with billing or compute access, and remote code execution. High (7.0-8.9) covers cross-user data leakage requiring valid auth, role escalation, and exposed publishable keys with abuse paths. Medium and low are listed in the appendix but excluded from the headline 81%.
Q&A
03
Why are Lovable's RLS numbers so much higher than other platforms?
Two reasons. First, Lovable defaults every app to a Supabase backend with the anon key shipped to the browser, so RLS is the only authorization layer between the public internet and the database — a misconfigured policy is immediately exploitable. Second, Lovable's generator adds tables incrementally as features are added, and policy creation does not always keep pace. We see clean apps regress after a single feature is added.
Q&A
04
Did you find any platform that was secure by default?
No platform produced apps with zero critical or high findings at any meaningful rate. The lowest critical rate was V0 at 24%, but V0 apps are typically component-only and outsource their backend to Supabase or Convex, where the same RLS and auth misconfigurations appeared. The class of vulnerability shifted, but the rate did not approach zero.
Q&A
05
Is the methodology open?
Yes. The probe definitions are listed at the end of this study. The CVSS rubric is published. The platform-detection heuristics are open-source. Raw counts per category are reproducible by running the same scanner against the same target — the only thing not shared is the customer URLs, for obvious privacy reasons.
Q&A
06
Where does VibeEval fit in this benchmark?
VibeEval is the scanner that produced these findings. The benchmark exists because no other tool catches this specific class of vulnerability — static SAST tools don't see RLS, dependency scanners don't see auth gaps, traditional DAST tools don't understand Supabase or Firebase. The product is the methodology made repeatable for one URL at a time.
Q&A
07
How do you keep false positives down at 310 probes?
Every probe is calibrated against ref0 and four topic-specific clean references — ref-rls, ref-jwt, ref-oauth, ref-webhook — that we publish on gapbench.vibe-eval.com. Any probe that fires on a clean reference is, by construction, a false positive and gets killed before it ships. The reference sites are the ground truth most heuristic scanners do not have. Every finding count in this benchmark is net of that calibration.
Q&A
08
Where can I run the same probes against a deliberately vulnerable target?
https://gapbench.vibe-eval.com/ — the public benchmark we operate. 104 scenarios, 97 deliberately vulnerable plus 7 clean / calibration controls. Run the scanner against a scenario, see the finding; run it against ref0, see no finding. That is the credibility test most security claims cannot pass.
Q&A
09
What CWE / OWASP categories does the benchmark cover?
OWASP Web Top 10 2021 (A01-A10) and OWASP API Top 10 2023 (API1-API10) are the primary mappings. OWASP LLM Top 10 2025 covers AI-feature-specific findings. CWE coverage spans CWE-200 series (information exposure), CWE-284-863 (access control), CWE-287/345 (auth and trust), CWE-540/798 (credentials and config), CWE-918 (SSRF), and CWE-352 (CSRF). Every finding in the report carries its CWE and OWASP tags.
Q&A

BENCHMARK YOUR OWN APP

Run the same scan against your URL. Report in under 60 seconds.

RUN BENCHMARK SCAN