fix(opensearch): escape wildcard metacharacters in search_labels to prevent DoS (CWE-89) by sebastiondev · Pull Request #3026 · HKUDS/LightRAG

sebastiondev · 2026-05-06T21:12:25Z

Summary

This PR fixes an unsanitized-input issue in OpenSearchGraphStorage where user-supplied query strings are interpolated directly into a Lucene wildcard query and a PPL (Piped Processing Language) string literal. An authenticated user can craft input that produces a pathological wildcard pattern, forcing OpenSearch into expensive term enumeration — a well-known Lucene wildcard DoS vector (CWE-89 / CWE-1333).

File: lightrag/kg/opensearch_impl.py
Functions: OpenSearchGraphStorage.search_labels() (wildcard sink) and _escape_ppl() (PPL hardening)
Reachable via: GET /graph/label/search?label=... (router calls into the OpenSearch backend when configured as graph storage).
Severity: Medium — authenticated DoS against the search backend. Not RCE or data exfiltration.

Vulnerability details

In search_labels, the user-controlled query string was lower-cased and embedded into a wildcard clause like:

{"wildcard": {"entity_id": {"value": f"*{query.lower()}*", "case_insensitive": True, "boost": 2}}}

Because * and ? in the input are not escaped, an attacker can submit a value such as *??????????????????????????*, which expands into a wildcard pattern with leading wildcards and many single-character wildcards. This forces OpenSearch to enumerate a very large number of terms and is a classic source of cluster-wide latency / OOM. The endpoint is gated by combined_auth, but any logged-in user (or anyone with an API key) can trigger it; limit is capped at 100 results but does not bound the wildcard expansion cost.

A secondary issue exists in _escape_ppl, which escaped backslash and single quote but not control characters (\n, \r, \t). PPL string literals are line-sensitive, so embedded newlines/tabs in user-supplied node labels could affect query parsing in the graph traversal queries.

Fix

Two minimal changes in lightrag/kg/opensearch_impl.py:

New helper _escape_wildcard() that escapes \, *, and ? so they are treated as literal characters in the wildcard clause. Applied at the single sink in search_labels.
_escape_ppl() extended to also strip \n, \r, \t (replaced with spaces) as defense-in-depth for PPL string literals.

Diff is 23 lines of production code — no behavioural change for legitimate queries (a search for the literal string foo*bar now correctly matches nodes whose name contains foo*bar, instead of being interpreted as a wildcard).

Tests

Added tests/test_cwe89_opensearch_injection.py with 7 unit tests covering:

_escape_wildcard correctly escapes *, ?, \ and is a no-op for benign input.
_escape_ppl escapes backslash, single quote, and neutralizes \n/\r/\t.
The pathological pattern *??????????????????????????* is rendered inert after escaping (no unescaped wildcard metacharacters remain in the user-controlled portion).
Normal queries (alice, node_42) still produce the expected wrapped pattern *alice*.

$ python3 -m pytest tests/test_cwe89_opensearch_injection.py -q
.......                                                                  [100%]
7 passed in 0.18s

Security analysis

Why exploitable: The sink builds a Lucene wildcard query from raw user input with case_insensitive=True. Lucene's wildcard rewrite cost grows quickly with pattern complexity and indexed term count. Patterns with leading * followed by many ? are pathological because each ? multiplies the number of candidate terms that must be evaluated. A few concurrent requests can saturate the OpenSearch cluster.

Preconditions:

Authenticated API access (the route uses combined_auth).
OpenSearch is configured as the graph storage backend.

Mitigation provided: After the fix, * and ? from user input are escaped to \* and \?, which Lucene treats as literal characters. The wrapping *...* added by the application is the only remaining wildcard expansion, which is bounded and well-behaved.

Adversarial review

Before submitting, we tried to disprove this finding. The endpoint is auth-gated, so we considered whether an authenticated user already has equivalent capability — they don't: nothing else in the API lets a normal authenticated user wedge the storage backend, and there is no per-user rate limit on this route that would blunt the impact. We also checked whether OpenSearch defaults would naturally cap wildcard cost — indices.query.bool.max_clause_count and similar limits do not apply to single-clause wildcard rewrites. Finally, we verified there is no parallel unfixed sink in opensearch_impl.py that re-introduces the issue.

cc @lewiswigmore

…arch queries Escape OpenSearch wildcard special characters (*, ?, \) in user-supplied search queries to search_labels() to prevent denial-of-service via expensive wildcard patterns like "*???...???*". Also harden _escape_ppl() to strip control characters (\n, \r, \t) that could interfere with PPL string literal parsing in graphLookup queries. CWE-89

danielaskdd · 2026-05-07T18:01:18Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a326264316

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-07T18:03:47Z

+        """
+        value = value.replace("\\", "\\\\").replace("'", "\\'")
+        # Strip control characters that could break the PPL string literal
+        value = value.replace("\n", " ").replace("\r", " ").replace("\t", " ")


Preserve control characters when escaping PPL labels

When PPL graph lookup is enabled, a node label that legitimately contains \n, \r, or \t is first found by get_node(start_label), but this replacement changes the value used in where entity_id = '...'. Since upsert_node indexes the original node_id unchanged, the PPL query searches for a different label and returns no connected edges instead of falling back to the BFS path. This affects any extracted or manually created entity names with internal tabs/newlines; escape these characters without changing their value, or force the fallback path for labels PPL cannot represent.

Useful? React with 👍 / 👎.

sebastiondev added 2 commits March 28, 2026 19:51

test: add PoC tests for OpenSearch wildcard and PPL injection (CWE-89)

a326264

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

danielaskdd added tracked Issue is tracked by project security labels May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(opensearch): escape wildcard metacharacters in search_labels to prevent DoS (CWE-89)#3026

fix(opensearch): escape wildcard metacharacters in search_labels to prevent DoS (CWE-89)#3026
sebastiondev wants to merge 2 commits intoHKUDS:mainfrom
sebastiondev:fix/cwe89-opensearch-impl-unsanitize-d709

sebastiondev commented May 6, 2026

Uh oh!

danielaskdd commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sebastiondev commented May 6, 2026

Summary

Vulnerability details

Fix

Tests

Security analysis

Adversarial review

Uh oh!

danielaskdd commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants