Skip to content

fix: classify agent chat errors so upstream failures don't surface as HTTP 500#5267

Draft
cursor[bot] wants to merge 1 commit into
mainfrom
cursor/agent-cff834ce
Draft

fix: classify agent chat errors so upstream failures don't surface as HTTP 500#5267
cursor[bot] wants to merge 1 commit into
mainfrom
cursor/agent-cff834ce

Conversation

@cursor

@cursor cursor Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

A Sentry HTTP 500 alert was triggered on POST /api/v1/agents/chats/{chat_id}/messages (issue 7540987063). The handler currently returns codes.Internal (HTTP 500) for every non-gorm.ErrRecordNotFound error returned by the agent service, which means:

  • Anthropic rate-limiting (429) → HTTP 500
  • Anthropic transient outages (5xx) → HTTP 500
  • Caller cancellation (browser navigated away, context.Canceled) → HTTP 500
  • Caller deadline exceeded → HTTP 500

Each of these gets captured as an HTTP 500 message by the LoggingMiddleware's captureHTTPError and pollutes the Sentry 500 issue feed even though none of them is a SuperPlane server bug.

Changes

  • Add agents.ProviderHTTPError interface in pkg/agents/provider.go, implemented by anthropic.apiError, so the gRPC layer can classify upstream HTTP failures without taking a dependency on a specific provider's internal types.
  • Add translateAgentServiceError(err, fallback) in pkg/grpc/actions/agents/common.go that maps:
    • context.Canceledcodes.Canceled (HTTP 499)
    • context.DeadlineExceededcodes.DeadlineExceeded (HTTP 504)
    • gorm.ErrRecordNotFound / agents.ErrSessionAlreadyTerminatedcodes.NotFound
    • agents.ErrSessionForbiddencodes.PermissionDenied
    • Upstream 429 → codes.ResourceExhausted (HTTP 429)
    • Upstream 401/403 → codes.FailedPrecondition
    • Upstream 408/504 → codes.DeadlineExceeded
    • Upstream 5xx → codes.Unavailable (HTTP 503)
    • Upstream 4xx (other) → codes.InvalidArgument
    • Anything else → codes.Internal (kept for genuine bugs)
  • Wire the helper into SendAgentChatMessage, InterruptAgentChat, and DefineAgentOutcome. Internal errors keep their Error-level log so genuine bugs remain loud; everything else logs at Warn so the operator still sees the underlying error.
  • In agents.Service.SendMessage, stop flipping the session to AgentSessionStatusFailed when the only reason the provider call failed is that the caller cancelled the request — otherwise a navigated-away tab leaves the chat stuck in "failed".

Tests

pkg/grpc/actions/agents/error_translation_test.go covers each translation case for the three affected handlers using a stub service and a fake ProviderHTTPError.

go test -count=1 -run "^(TestSendAgentChatMessage_Maps...|TestInterruptAgentChat_MapsUpstreamErrors|TestDefineAgentOutcome_MapsContextCanceled)$" ./pkg/grpc/actions/agents/

Also ran:

  • go vet ./pkg/grpc/... ./pkg/agents/... ./pkg/models/... ./pkg/workers/... ./pkg/server/... ./pkg/public/...
  • go test -count=1 ./pkg/agents/anthropic/...
  • gofmt -s -l pkg/agents pkg/grpc/actions/agents (clean)

Notes

Without Sentry credentials in the cloud agent environment I couldn't introspect the originating exception payload, but the only error path that could produce HTTP 500 on this endpoint without an explicit codes.NotFound mapping is the one fixed here. The fix is intentionally a defensive widening that also reduces Sentry noise from future upstream-only incidents.

Open in Web Open in Cursor 

… HTTP 500

The /api/v1/agents/chats/{chat_id}/messages handler returned codes.Internal
(HTTP 500) for every non-NotFound error, which generated Sentry alerts
whenever the upstream provider rate-limited (429), had a transient outage
(5xx), or the client cancelled the request mid-flight.

Map the most common transient/upstream failures to dedicated gRPC codes so
they surface with accurate HTTP status (codes.Unavailable, ResourceExhausted,
Canceled, DeadlineExceeded, FailedPrecondition) and stop polluting the
HTTP 500 issue feed. Internal errors keep their Error-level log so genuine
bugs remain visible.

Also stop flipping the session to 'failed' when the only reason
provider.SendMessage returned an error is that the caller cancelled the
request; otherwise a navigated-away tab leaves the chat looking broken.
@superplanehq-integration

Copy link
Copy Markdown

👋 Commands for maintainers:

  • /sp start - Start an ephemeral machine (takes ~30s)
  • /sp stop - Stop a running machine (auto-executed on pr close)

@superplane-gh-integration-9000

Copy link
Copy Markdown

OSS Guard found dependency licenses that are not permitted for this project.

Project license (from repository): Apache-2.0

Permitted dependency licenses: MIT,Apache-2.0,BSD-2-Clause,BSD-3-Clause,ISC,0BSD,Unlicense,CC0-1.0,CC-BY-4.0,Zlib,MPL-2.0,OpenSSL,BlueOak-1.0.0

Reason: One or more dependencies use licenses that are not compatible with the project license.

osv-scanner report:

argparse 2.0.1 (npm) - Python-2.0
csv 3.3.5 (RubyGems) - Ruby, BSD-2-Clause
elkjs 0.10.0 (npm) - EPL-2.0
khroma 2.1.0 (npm) - UNKNOWN
posthog-js 1.368.2 (npm) - non-standard
stdlib 1.26.2 (Go) - UNKNOWN

Add approved exceptions in your repository's osv-scanner.toml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant