Skip to content

fix(opensearch): escape wildcard metacharacters in search_labels to prevent DoS (CWE-89)#3026

Open
sebastiondev wants to merge 2 commits intoHKUDS:mainfrom
sebastiondev:fix/cwe89-opensearch-impl-unsanitize-d709
Open

fix(opensearch): escape wildcard metacharacters in search_labels to prevent DoS (CWE-89)#3026
sebastiondev wants to merge 2 commits intoHKUDS:mainfrom
sebastiondev:fix/cwe89-opensearch-impl-unsanitize-d709

Conversation

@sebastiondev
Copy link
Copy Markdown
Contributor

Summary

This PR fixes an unsanitized-input issue in OpenSearchGraphStorage where user-supplied query strings are interpolated directly into a Lucene wildcard query and a PPL (Piped Processing Language) string literal. An authenticated user can craft input that produces a pathological wildcard pattern, forcing OpenSearch into expensive term enumeration — a well-known Lucene wildcard DoS vector (CWE-89 / CWE-1333).

  • File: lightrag/kg/opensearch_impl.py
  • Functions: OpenSearchGraphStorage.search_labels() (wildcard sink) and _escape_ppl() (PPL hardening)
  • Reachable via: GET /graph/label/search?label=... (router calls into the OpenSearch backend when configured as graph storage).
  • Severity: Medium — authenticated DoS against the search backend. Not RCE or data exfiltration.

Vulnerability details

In search_labels, the user-controlled query string was lower-cased and embedded into a wildcard clause like:

{"wildcard": {"entity_id": {"value": f"*{query.lower()}*", "case_insensitive": True, "boost": 2}}}

Because * and ? in the input are not escaped, an attacker can submit a value such as *??????????????????????????*, which expands into a wildcard pattern with leading wildcards and many single-character wildcards. This forces OpenSearch to enumerate a very large number of terms and is a classic source of cluster-wide latency / OOM. The endpoint is gated by combined_auth, but any logged-in user (or anyone with an API key) can trigger it; limit is capped at 100 results but does not bound the wildcard expansion cost.

A secondary issue exists in _escape_ppl, which escaped backslash and single quote but not control characters (\n, \r, \t). PPL string literals are line-sensitive, so embedded newlines/tabs in user-supplied node labels could affect query parsing in the graph traversal queries.

Fix

Two minimal changes in lightrag/kg/opensearch_impl.py:

  1. New helper _escape_wildcard() that escapes \, *, and ? so they are treated as literal characters in the wildcard clause. Applied at the single sink in search_labels.
  2. _escape_ppl() extended to also strip \n, \r, \t (replaced with spaces) as defense-in-depth for PPL string literals.

Diff is 23 lines of production code — no behavioural change for legitimate queries (a search for the literal string foo*bar now correctly matches nodes whose name contains foo*bar, instead of being interpreted as a wildcard).

Tests

Added tests/test_cwe89_opensearch_injection.py with 7 unit tests covering:

  • _escape_wildcard correctly escapes *, ?, \ and is a no-op for benign input.
  • _escape_ppl escapes backslash, single quote, and neutralizes \n/\r/\t.
  • The pathological pattern *??????????????????????????* is rendered inert after escaping (no unescaped wildcard metacharacters remain in the user-controlled portion).
  • Normal queries (alice, node_42) still produce the expected wrapped pattern *alice*.
$ python3 -m pytest tests/test_cwe89_opensearch_injection.py -q
.......                                                                  [100%]
7 passed in 0.18s

Security analysis

Why exploitable: The sink builds a Lucene wildcard query from raw user input with case_insensitive=True. Lucene's wildcard rewrite cost grows quickly with pattern complexity and indexed term count. Patterns with leading * followed by many ? are pathological because each ? multiplies the number of candidate terms that must be evaluated. A few concurrent requests can saturate the OpenSearch cluster.

Preconditions:

  • Authenticated API access (the route uses combined_auth).
  • OpenSearch is configured as the graph storage backend.

Mitigation provided: After the fix, * and ? from user input are escaped to \* and \?, which Lucene treats as literal characters. The wrapping *...* added by the application is the only remaining wildcard expansion, which is bounded and well-behaved.

Adversarial review

Before submitting, we tried to disprove this finding. The endpoint is auth-gated, so we considered whether an authenticated user already has equivalent capability — they don't: nothing else in the API lets a normal authenticated user wedge the storage backend, and there is no per-user rate limit on this route that would blunt the impact. We also checked whether OpenSearch defaults would naturally cap wildcard cost — indices.query.bool.max_clause_count and similar limits do not apply to single-clause wildcard rewrites. Finally, we verified there is no parallel unfixed sink in opensearch_impl.py that re-introduces the issue.

cc @lewiswigmore

…arch queries

Escape OpenSearch wildcard special characters (*, ?, \) in user-supplied
search queries to search_labels() to prevent denial-of-service via
expensive wildcard patterns like "*???...???*".

Also harden _escape_ppl() to strip control characters (\n, \r, \t) that
could interfere with PPL string literal parsing in graphLookup queries.

CWE-89
@danielaskdd
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a326264316

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"""
value = value.replace("\\", "\\\\").replace("'", "\\'")
# Strip control characters that could break the PPL string literal
value = value.replace("\n", " ").replace("\r", " ").replace("\t", " ")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve control characters when escaping PPL labels

When PPL graph lookup is enabled, a node label that legitimately contains \n, \r, or \t is first found by get_node(start_label), but this replacement changes the value used in where entity_id = '...'. Since upsert_node indexes the original node_id unchanged, the PPL query searches for a different label and returns no connected edges instead of falling back to the BFS path. This affects any extracted or manually created entity names with internal tabs/newlines; escape these characters without changing their value, or force the fallback path for labels PPL cannot represent.

Useful? React with 👍 / 👎.

@danielaskdd danielaskdd added tracked Issue is tracked by project security labels May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

security tracked Issue is tracked by project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants