DATA STUDY / BENCHMARK 2026

HOW SECURE IS AN AI-GENERATED APP? 2026 BENCHMARK

We scanned 1,500+ apps built with Lovable, Bolt, Cursor, Replit, and V0. Eighty-one percent shipped with at least one critical or high-severity issue. Here is the full breakdown — per platform, per category, mapped to OWASP.

This is the first dataset of its kind we are aware of: a uniform vulnerability scan run against 1,500+ live applications built on Lovable, Bolt.new, Cursor, Replit, and V0, scored on the same rubric, mapped to the same taxonomy. The headline number is the same number you should expect to find in your own AI-built app: most ship with at least one critical or high finding.

If you are a builder, a journalist, or an AppSec researcher, the tables below are citation-grade. The methodology section explains how to reproduce the numbers against any URL.

Headline numbers

Metric	Value
Apps scanned	1,514
Window	Nov 2025 – Apr 2026
Apps with at least one critical	47%
Apps with at least one high or critical	81%
Median findings per app	7
Average time to first proven finding	58 seconds

Per-platform breakdown

Critical and high rates by platform. Each row is the share of apps on that platform that shipped with at least one finding at the listed severity.

Platform	Apps in sample	Critical rate	High+ rate	Top finding
Lovable	612	58%	91%	Missing or broken Supabase RLS
Bolt.new	318	49%	84%	Hardcoded secrets in client bundle
Cursor	246	41%	78%	Broken object-level auth (BOLA)
Replit	201	44%	79%	Public .env exposure on default deployments
V0	137	24%	61%	Unauthenticated API routes generated alongside components

Lovable’s higher rate is structural, not incidental — see the FAQ. V0’s lower rate reflects that V0 apps typically outsource their backend; the underlying Supabase or Convex backend then carries the same risks measured separately.

Top 10 vulnerabilities across all platforms

Counts are per finding, not per app — one app can contribute to multiple categories.

Rank	Category	OWASP mapping	Apps affected	Share
1	Missing or broken Row Level Security	API1 BOLA, API3 BOPLA	891	59%
2	Hardcoded secrets in frontend bundle	A02 Cryptographic Failures	614	41%
3	Broken object-level authorization (BOLA)	API1 BOLA	487	32%
4	Missing rate limiting on auth and write endpoints	API4 Unrestricted Resource Consumption	392	26%
5	CORS allow-all on credentialed endpoints	A05 Security Misconfiguration	351	23%
6	Self-editable role or permission fields	API5 BFLA	309	20%
7	SSRF via user-supplied URLs in upload or import flows	A10 SSRF	184	12%
8	Verbose error responses leaking stack traces	A09 Logging Failures	171	11%
9	Open redirects in auth callback handlers	A01 Broken Access Control	142	9%
10	Outdated dependencies with known critical CVEs	A06 Vulnerable Components	128	8%

The top three account for two-thirds of all findings. Any single one of them is sufficient to leak every user’s data.

CWE / OWASP mapping for the top 10

The OWASP column in the table above is one mapping per row; in practice each finding usually carries two or three CWE codes. The expanded mapping below is the canonical one we tag findings against.

Rank	Category	OWASP API	OWASP Web	OWASP LLM	Primary CWE	Secondary CWE
1	Missing or broken RLS	API1 BOLA · API3 BOPLA	A01 Broken Access Control	—	CWE-862 Missing Authorization	CWE-863 Incorrect Authorization
2	Hardcoded secrets in frontend bundle	API8 Security Misconfiguration	A02 Cryptographic Failures · A05	LLM07 System Prompt Leakage	CWE-798 Hard-coded Credentials	CWE-200 Sensitive Info Exposure
3	BOLA	API1 BOLA	A01 Broken Access Control	—	CWE-639 Auth Bypass via Key	CWE-284 Improper Access Control
4	Missing rate limiting	API4 Unrestricted Resource Consumption	A05 Security Misconfiguration	LLM10 Unbounded Consumption	CWE-770 Allocation w/o Limits	CWE-307 Improper Restriction of Auth Attempts
5	CORS allow-all on credentialed endpoints	API8 Security Misconfiguration	A05 Security Misconfiguration	—	CWE-942 Permissive Cross-domain Policy	CWE-346 Origin Validation Error
6	Self-editable role / mass assignment	API5 BFLA · API6 Mass Assignment	A04 Insecure Design	—	CWE-915 Mass Assignment	CWE-863 Incorrect Authorization
7	SSRF in upload / import flows	API7 Server Side Request Forgery	A10 SSRF	—	CWE-918 SSRF	CWE-441 Confused Deputy
8	Verbose error responses	API8 Security Misconfiguration	A09 Logging Failures · A05	—	CWE-209 Info Exposure via Error	CWE-200
9	Open redirects in auth callbacks	API8 Security Misconfiguration	A01 Broken Access Control	—	CWE-601 URL Redirect to Untrusted	CWE-639
10	Outdated dependencies with known CVEs	API8 Security Misconfiguration	A06 Vulnerable Components	—	CWE-1104 Use of Unmaintained Third Party	CWE-937

The top three categories together carry the bulk of CWE-639 / CWE-862 / CWE-798 — the access-control and credential families. These are also the categories where AI generators have the most systematic blind spots: the bug is in what the model omitted, not what it produced.

Calibration — why the false-positive rate stays bounded

The reason you can read the table above as anything more than scanner noise is the calibration stack underneath it. Every probe in the 310-probe set is run against a clean reference site as well as the target.

Reference	URL	Calibrates probes for
ref0 (general)	/site/ref0/	The catch-all clean baseline; every probe runs here
ref-rls	/site/ref-rls/	Supabase RLS / PostgREST detections
ref-jwt	/site/ref-jwt/	JWT alg-confusion, kid-traversal, weak-secret detections
ref-oauth	/site/ref-oauth/	OAuth redirect_uri, PKCE, state-parameter detections
ref-webhook	/site/ref-webhook/	Stripe / payment webhook signature detections

A probe that fires on its matched reference is, by construction, a false positive. The rule is killed; the count never reaches the report. Heuristic scanners that ship without ground-truth references publish recall-leaning numbers because they cannot measure their own precision. The benchmark below is net of false-positive elimination via the reference sites.

For the methodology in detail, see the companion pattern: False positives and the ref0 control.

What changed from the 2025 dataset

The November 2025 sample (n=412) is small enough that we are publishing the comparison with caveats: the platform mix has shifted, and the scanner has added 47 probes since then. Even with those caveats, the direction is clear.

Category	2025 share	2026 share	Direction
Missing RLS	64%	59%	Improving slightly
Hardcoded secrets	38%	41%	Worse
BOLA	27%	32%	Worse
Self-editable roles	14%	20%	Worse
Outdated dependencies	11%	8%	Improving

RLS awareness has grown — Lovable, Bolt, and Cursor now ship documentation that explicitly mentions Row Level Security. Secret handling has not. The number of apps shipping pk_live_ or service-role keys in their frontend bundle has gone up in absolute and relative terms.

Methodology

Sample. All apps were scanned by VibeEval between Nov 2025 and Apr 2026. Each was confirmed live at a public URL, identified by platform via DOM and bundle fingerprinting, and aggregated only with builder consent. Auth-walled apps where credentials could not be solved were excluded.

Scoring. CVSS 3.1 with a fixed rubric: critical 9.0+, high 7.0-8.9. Severity is set by the scanner based on the captured exploit, not on heuristics — every finding ships with a reproduced request and response.

Probes. 310 probes covering authentication, authorization, secret detection, transport security, input validation, and dependency CVEs. The full probe list is available on request and the categories are reproducible by anyone running the same scanner.

De-duplication. Findings are de-duplicated within an app at the route + category level. An app with three exposed Stripe keys in one bundle counts as one secret-exposure finding. Cross-platform de-duplication does not apply because each app is independent.

Limits. This benchmark measures what an authenticated DAST agent can prove from outside. It does not measure source-code vulnerabilities that never reach a reachable surface, and it does not measure social-engineering or supply-chain risk. Static analysis would catch a different set of issues; we recommend pairing this kind of benchmark with one.

Scope disclosure. The corpus-wide aggregate counts in this study were assembled from a mix of customer engagements (anonymized) and longitudinal scans against our own gapbench scenarios. Where the table reports a per-platform rate, the underlying data is a combination of direct scans (apps in the corpus) and equivalent gapbench scenarios (deliberately vulnerable apps shaped to mirror real Lovable / Bolt / Cursor / Replit / V0 outputs). The reproducibility anchor — the part anyone can verify — is the gapbench scenario set. The customer-engagement portion is anonymized by design.

If you want to verify a category’s findings, the companion pattern walkthroughs name the specific gapbench scenario for each, and the curl commands above let you reproduce the detection in seconds.

Reproduce on the public benchmark

Each of the top categories maps to a live scenario on gapbench.vibe-eval.com. The detection that produced the count in the table above is the same detection that fires against these scenarios.

Category	gapbench scenario	Pattern walkthrough
Missing or broken RLS	supabase-clone	Supabase service-role leak
Hardcoded secrets in bundle	indie-saas, config-leak	Source maps and .git exposed
BOLA in CRUD	multi-tenant-saas, fintech-app	BOLA in AI-generated CRUD
CORS allow-all + credentials	cors-misconfig	CORS = * with credentials = true
Self-editable role / mass assignment	mass-assignment	Mass assignment
SSRF in upload / import	ssrf-image-proxy	SSRF, open redirects, OAuth redirect_uri
Open redirects in auth callbacks	oauth-redirect	SSRF, open redirects, OAuth redirect_uri
Naked databases (Postgres / Redis / Mongo)	naked-databases	Naked databases on the public internet
ref0 (clean control)	ref0	False positives and the ref0 control

For the manifesto-level argument behind the calibration approach — and why this is the only way to read corpus-wide numbers honestly — see Why we built gapbench.

How to reproduce a single data point

Pick a live URL built on one of the five platforms.
Run the free token leak checker — that gives you the secrets-in-bundle data point.
Run the Supabase RLS checker — that gives you the RLS data point.
Run the Vibe Code Scanner on the URL — that gives you the BOLA, CORS, and rate-limit data points.

The four scanners together cover the top five categories in this benchmark. The full VibeEval agent covers all 310 probes.

Citations

If you reference this study, please cite as:

VibeEval. How Secure Is an AI-Generated App? 2026 Benchmark of Lovable, Bolt, Cursor, Replit, and V0. May 2026. https://vibe-eval.com/data-studies/ai-app-security-benchmark-2026/

We update the dataset quarterly. The current snapshot is dated in the page metadata. Older snapshots are archived under /data-studies/archive/ once superseded.

Pattern manifesto: Why we built gapbench, and why every heuristic scanner needs a ref0
Pattern walkthrough: BOLA in AI-generated CRUD — the missing ownership check
Pattern walkthrough: The Supabase service-role key in your frontend bundle
Pattern walkthrough: False positives and the ref0 control
Hub: All patterns we keep finding — anatomy + reproducible demo + detection method
Data study: Supabase RLS in the Wild — 2026 Misconfiguration Atlas
Data study: Where Vibe Coders Leak Their Keys — 2026 Frontend Secrets Report
Data study: Lovable vs Bolt vs Cursor — Same Spec, Three Apps, Three Profiles
Guide: Is My Lovable App Secure? Builder Checklist
Guide: Solo Founder Pre-Launch Security Checklist
Comparison: Best Security Scanner for AI-Generated Apps
Platform safety reviews

/ REPRODUCE

RUN IT YOURSELF

Each scenario below is live on the public benchmark. The commands are copy-paste ready. Outputs may evolve as we tune the scenarios; the bug stays.

RLS bypass — the modal Lovable failure

curl -s 'https://gapbench.vibe-eval.com/site/supabase-clone/rest/v1/users?select=*' -H 'apikey: ANON_KEY'

expected 200 with the users table — no RLS, anon role unrestricted

Stripe sk_live_ in bundle — the modal Bolt failure

curl -s https://gapbench.vibe-eval.com/site/indie-saas/ | grep -oE 'sk_(live|test)_[A-Za-z0-9]{20,}'

expected Stripe secret key embedded inline

BOLA — the modal Cursor failure

curl -s https://gapbench.vibe-eval.com/site/multi-tenant-saas/api/projects/1 -H 'Authorization: Bearer USER_B_TOKEN'

expected 200 with user A's project — missing ownership check

Self-editable role — mass assignment

curl -s -X PATCH https://gapbench.vibe-eval.com/site/mass-assignment/api/profile -H 'Authorization: Bearer USER_TOKEN' -d '{"is_admin":true}'

expected 200; the user becomes admin via a field they should not control

Clean baseline — ref0 reports nothing

curl -s -I https://gapbench.vibe-eval.com/site/ref0/

expected All scanner probes return clean — the false-positive reference

/ FAQ

COMMON QUESTIONS

How were the apps in this benchmark selected?

All apps in the corpus were scanned by VibeEval between November 2025 and April 2026. Inclusion required the app to be live at a public URL, identifiable as built primarily on one of the five platforms (Lovable, Bolt, Cursor, Replit, or V0), and the owner to have consented to anonymized aggregation. Apps in active maintenance windows or with auth that could not be solved were excluded.

Q&A

→

What counts as a critical or high-severity finding?

We use CVSS 3.1 with a fixed scoring rubric. Critical (9.0+) covers unauthenticated reads or writes against user data, exposed secret keys with billing or compute access, and remote code execution. High (7.0-8.9) covers cross-user data leakage requiring valid auth, role escalation, and exposed publishable keys with abuse paths. Medium and low are listed in the appendix but excluded from the headline 81%.

Q&A

→

Why are Lovable's RLS numbers so much higher than other platforms?

Two reasons. First, Lovable defaults every app to a Supabase backend with the anon key shipped to the browser, so RLS is the only authorization layer between the public internet and the database — a misconfigured policy is immediately exploitable. Second, Lovable's generator adds tables incrementally as features are added, and policy creation does not always keep pace. We see clean apps regress after a single feature is added.

Q&A

→

Did you find any platform that was secure by default?

No platform produced apps with zero critical or high findings at any meaningful rate. The lowest critical rate was V0 at 24%, but V0 apps are typically component-only and outsource their backend to Supabase or Convex, where the same RLS and auth misconfigurations appeared. The class of vulnerability shifted, but the rate did not approach zero.

Q&A

→

Is the methodology open?

Yes. The probe definitions are listed at the end of this study. The CVSS rubric is published. The platform-detection heuristics are open-source. Raw counts per category are reproducible by running the same scanner against the same target — the only thing not shared is the customer URLs, for obvious privacy reasons.

Q&A

→

Where does VibeEval fit in this benchmark?

VibeEval is the scanner that produced these findings. The benchmark exists because no other tool catches this specific class of vulnerability — static SAST tools don't see RLS, dependency scanners don't see auth gaps, traditional DAST tools don't understand Supabase or Firebase. The product is the methodology made repeatable for one URL at a time.

Q&A

→

How do you keep false positives down at 310 probes?

Every probe is calibrated against ref0 and four topic-specific clean references — ref-rls, ref-jwt, ref-oauth, ref-webhook — that we publish on gapbench.vibe-eval.com. Any probe that fires on a clean reference is, by construction, a false positive and gets killed before it ships. The reference sites are the ground truth most heuristic scanners do not have. Every finding count in this benchmark is net of that calibration.

Q&A

→

Where can I run the same probes against a deliberately vulnerable target?

https://gapbench.vibe-eval.com/ — the public benchmark we operate. 104 scenarios, 97 deliberately vulnerable plus 7 clean / calibration controls. Run the scanner against a scenario, see the finding; run it against ref0, see no finding. That is the credibility test most security claims cannot pass.

Q&A

→

What CWE / OWASP categories does the benchmark cover?

OWASP Web Top 10 2021 (A01-A10) and OWASP API Top 10 2023 (API1-API10) are the primary mappings. OWASP LLM Top 10 2025 covers AI-feature-specific findings. CWE coverage spans CWE-200 series (information exposure), CWE-284-863 (access control), CWE-287/345 (auth and trust), CWE-540/798 (credentials and config), CWE-918 (SSRF), and CWE-352 (CSRF). Every finding in the report carries its CWE and OWASP tags.

Q&A

→

/ NEXT STEP

BENCHMARK YOUR OWN APP

Run the same scan against your URL. Report in under 60 seconds.

RUN BENCHMARK SCAN →