fix(opensearch): escape wildcard metacharacters in search_labels to prevent DoS (CWE-89)#3026
Conversation
…arch queries Escape OpenSearch wildcard special characters (*, ?, \) in user-supplied search queries to search_labels() to prevent denial-of-service via expensive wildcard patterns like "*???...???*". Also harden _escape_ppl() to strip control characters (\n, \r, \t) that could interfere with PPL string literal parsing in graphLookup queries. CWE-89
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a326264316
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """ | ||
| value = value.replace("\\", "\\\\").replace("'", "\\'") | ||
| # Strip control characters that could break the PPL string literal | ||
| value = value.replace("\n", " ").replace("\r", " ").replace("\t", " ") |
There was a problem hiding this comment.
Preserve control characters when escaping PPL labels
When PPL graph lookup is enabled, a node label that legitimately contains \n, \r, or \t is first found by get_node(start_label), but this replacement changes the value used in where entity_id = '...'. Since upsert_node indexes the original node_id unchanged, the PPL query searches for a different label and returns no connected edges instead of falling back to the BFS path. This affects any extracted or manually created entity names with internal tabs/newlines; escape these characters without changing their value, or force the fallback path for labels PPL cannot represent.
Useful? React with 👍 / 👎.
Summary
This PR fixes an unsanitized-input issue in
OpenSearchGraphStoragewhere user-supplied query strings are interpolated directly into a Lucene wildcard query and a PPL (Piped Processing Language) string literal. An authenticated user can craft input that produces a pathological wildcard pattern, forcing OpenSearch into expensive term enumeration — a well-known Lucene wildcard DoS vector (CWE-89 / CWE-1333).lightrag/kg/opensearch_impl.pyOpenSearchGraphStorage.search_labels()(wildcard sink) and_escape_ppl()(PPL hardening)GET /graph/label/search?label=...(router calls into the OpenSearch backend when configured as graph storage).Vulnerability details
In
search_labels, the user-controlledquerystring was lower-cased and embedded into a wildcard clause like:{"wildcard": {"entity_id": {"value": f"*{query.lower()}*", "case_insensitive": True, "boost": 2}}}Because
*and?in the input are not escaped, an attacker can submit a value such as*??????????????????????????*, which expands into a wildcard pattern with leading wildcards and many single-character wildcards. This forces OpenSearch to enumerate a very large number of terms and is a classic source of cluster-wide latency / OOM. The endpoint is gated bycombined_auth, but any logged-in user (or anyone with an API key) can trigger it;limitis capped at 100 results but does not bound the wildcard expansion cost.A secondary issue exists in
_escape_ppl, which escaped backslash and single quote but not control characters (\n,\r,\t). PPL string literals are line-sensitive, so embedded newlines/tabs in user-supplied node labels could affect query parsing in the graph traversal queries.Fix
Two minimal changes in
lightrag/kg/opensearch_impl.py:_escape_wildcard()that escapes\,*, and?so they are treated as literal characters in the wildcard clause. Applied at the single sink insearch_labels._escape_ppl()extended to also strip\n,\r,\t(replaced with spaces) as defense-in-depth for PPL string literals.Diff is 23 lines of production code — no behavioural change for legitimate queries (a search for the literal string
foo*barnow correctly matches nodes whose name containsfoo*bar, instead of being interpreted as a wildcard).Tests
Added
tests/test_cwe89_opensearch_injection.pywith 7 unit tests covering:_escape_wildcardcorrectly escapes*,?,\and is a no-op for benign input._escape_pplescapes backslash, single quote, and neutralizes\n/\r/\t.*??????????????????????????*is rendered inert after escaping (no unescaped wildcard metacharacters remain in the user-controlled portion).alice,node_42) still produce the expected wrapped pattern*alice*.Security analysis
Why exploitable: The sink builds a Lucene wildcard query from raw user input with
case_insensitive=True. Lucene's wildcard rewrite cost grows quickly with pattern complexity and indexed term count. Patterns with leading*followed by many?are pathological because each?multiplies the number of candidate terms that must be evaluated. A few concurrent requests can saturate the OpenSearch cluster.Preconditions:
combined_auth).Mitigation provided: After the fix,
*and?from user input are escaped to\*and\?, which Lucene treats as literal characters. The wrapping*...*added by the application is the only remaining wildcard expansion, which is bounded and well-behaved.Adversarial review
Before submitting, we tried to disprove this finding. The endpoint is auth-gated, so we considered whether an authenticated user already has equivalent capability — they don't: nothing else in the API lets a normal authenticated user wedge the storage backend, and there is no per-user rate limit on this route that would blunt the impact. We also checked whether OpenSearch defaults would naturally cap wildcard cost —
indices.query.bool.max_clause_countand similar limits do not apply to single-clause wildcard rewrites. Finally, we verified there is no parallel unfixed sink inopensearch_impl.pythat re-introduces the issue.cc @lewiswigmore