Skip to content

fix(snuba): distinguish upstream proxy errors from snuba errors#115252

Draft
JoshFerge wants to merge 2 commits intomasterfrom
jferg/snuba-upstream-error-handling
Draft

fix(snuba): distinguish upstream proxy errors from snuba errors#115252
JoshFerge wants to merge 2 commits intomasterfrom
jferg/snuba-upstream-error-handling

Conversation

@JoshFerge
Copy link
Copy Markdown
Member

@JoshFerge JoshFerge commented May 9, 2026

Summary

When envoy (or another upstream proxy) returns a 504 with body upstream request timeout before the request reaches snuba, _bulk_snuba_query fails to parse it as JSON and raises a generic SnubaError("Failed to parse snuba error response"). This obscures the real failure mode — the request never reached snuba, so:

  • there are no snuba traces for the failed query
  • the actual cause lives in the proxy's response body, not snuba logs
  • all proxy 5xx flavors get bucketed under the same opaque error
  • the API layer can't tell it's a timeout, so the frontend gets a generic 500 instead of a 504

Investigation context: issue 6969043325 is dominated by 504s from envoy in front of snuba-api, almost all with the literal body upstream request timeout.

This change introduces two new exception classes in src/sentry/utils/snuba.py:

  • InvalidSnubaResponseError — raised when a non-2xx response can't be decoded as JSON. Captures status and body for debugging.
  • SnubaUpstreamRequestTimeout — subclass for 504 responses, or any status with the envoy-default body upstream request timeout.

Both subclass SnubaError, so existing handlers that catch SnubaError continue to work — this only refines what callers can choose to catch.

It also wires SnubaUpstreamRequestTimeout into handle_query_errors in src/sentry/api/utils.py alongside QueryExecutionTimeMaximum / ReadTimeoutError, so it surfaces as the existing TimeoutException (HTTP 504) — the frontend already knows what to do with that.

Test plan

  • New unit tests in tests/sentry/utils/test_snuba.py::SnubaInvalidJsonResponseTest cover:
    • 504 with upstream request timeout body → SnubaUpstreamRequestTimeout
    • non-504 status with the same body → SnubaUpstreamRequestTimeout (defensive)
    • other non-JSON 5xx (e.g. 502 no healthy upstream) → InvalidSnubaResponseError only
  • New test in tests/sentry/api/test_utils.py::HandleQueryErrorsTest::test_handle_snuba_upstream_request_timeout confirms SnubaUpstreamRequestTimeout is converted to TimeoutException by handle_query_errors
  • Full tests/sentry/utils/test_snuba.py and tests/sentry/api/test_utils.py::HandleQueryErrorsTest pass
  • prek run -q passes on touched files

When envoy (or another upstream proxy) returns a 504 with body
"upstream request timeout" before reaching snuba, json.loads fails and
we raise a generic SnubaError("Failed to parse snuba error response").
This obscures the real failure mode: the request never reached snuba,
so there are no snuba traces and the cause appears in the proxy's
response body, not in snuba logs.

Introduce two new exception classes:
- InvalidSnubaResponseError: non-2xx response that can't be parsed as
  JSON. Captures status and body for debugging.
- SnubaUpstreamRequestTimeout (subclass): 504, or any status with the
  envoy-default body "upstream request timeout".

Both subclass SnubaError so existing handlers continue to work.

Agent transcript: https://claudescope.sentry.dev/share/Issjz4igy1xszXogQRdyfKOkdGzESNNiXeq7QfRIlOI
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label May 9, 2026
as TimeoutException (HTTP 504) — same path the frontend already handles
for QueryExecutionTimeMaximum and ReadTimeoutError.

Agent transcript: https://claudescope.sentry.dev/share/A7_ad-be8QOnVd4HEJMyDyxUsVB8nzpyGk-jPA6R3P8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant