fix: classify agent chat errors so upstream failures don't surface as HTTP 500#5267
Draft
cursor[bot] wants to merge 1 commit into
Draft
fix: classify agent chat errors so upstream failures don't surface as HTTP 500#5267cursor[bot] wants to merge 1 commit into
cursor[bot] wants to merge 1 commit into
Conversation
… HTTP 500
The /api/v1/agents/chats/{chat_id}/messages handler returned codes.Internal
(HTTP 500) for every non-NotFound error, which generated Sentry alerts
whenever the upstream provider rate-limited (429), had a transient outage
(5xx), or the client cancelled the request mid-flight.
Map the most common transient/upstream failures to dedicated gRPC codes so
they surface with accurate HTTP status (codes.Unavailable, ResourceExhausted,
Canceled, DeadlineExceeded, FailedPrecondition) and stop polluting the
HTTP 500 issue feed. Internal errors keep their Error-level log so genuine
bugs remain visible.
Also stop flipping the session to 'failed' when the only reason
provider.SendMessage returned an error is that the caller cancelled the
request; otherwise a navigated-away tab leaves the chat looking broken.
|
👋 Commands for maintainers:
|
|
❌ OSS Guard found dependency licenses that are not permitted for this project. Project license (from repository): Apache-2.0 Permitted dependency licenses: MIT,Apache-2.0,BSD-2-Clause,BSD-3-Clause,ISC,0BSD,Unlicense,CC0-1.0,CC-BY-4.0,Zlib,MPL-2.0,OpenSSL,BlueOak-1.0.0 Reason: One or more dependencies use licenses that are not compatible with the project license. osv-scanner report: Add approved exceptions in your repository's |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A Sentry HTTP 500 alert was triggered on
POST /api/v1/agents/chats/{chat_id}/messages(issue 7540987063). The handler currently returnscodes.Internal(HTTP 500) for every non-gorm.ErrRecordNotFounderror returned by the agent service, which means:context.Canceled) → HTTP 500Each of these gets captured as an HTTP 500 message by the
LoggingMiddleware'scaptureHTTPErrorand pollutes the Sentry 500 issue feed even though none of them is a SuperPlane server bug.Changes
agents.ProviderHTTPErrorinterface inpkg/agents/provider.go, implemented byanthropic.apiError, so the gRPC layer can classify upstream HTTP failures without taking a dependency on a specific provider's internal types.translateAgentServiceError(err, fallback)inpkg/grpc/actions/agents/common.gothat maps:context.Canceled→codes.Canceled(HTTP 499)context.DeadlineExceeded→codes.DeadlineExceeded(HTTP 504)gorm.ErrRecordNotFound/agents.ErrSessionAlreadyTerminated→codes.NotFoundagents.ErrSessionForbidden→codes.PermissionDeniedcodes.ResourceExhausted(HTTP 429)codes.FailedPreconditioncodes.DeadlineExceededcodes.Unavailable(HTTP 503)codes.InvalidArgumentcodes.Internal(kept for genuine bugs)SendAgentChatMessage,InterruptAgentChat, andDefineAgentOutcome. Internal errors keep theirError-level log so genuine bugs remain loud; everything else logs atWarnso the operator still sees the underlying error.agents.Service.SendMessage, stop flipping the session toAgentSessionStatusFailedwhen the only reason the provider call failed is that the caller cancelled the request — otherwise a navigated-away tab leaves the chat stuck in "failed".Tests
pkg/grpc/actions/agents/error_translation_test.gocovers each translation case for the three affected handlers using a stub service and a fakeProviderHTTPError.Also ran:
go vet ./pkg/grpc/... ./pkg/agents/... ./pkg/models/... ./pkg/workers/... ./pkg/server/... ./pkg/public/...go test -count=1 ./pkg/agents/anthropic/...gofmt -s -l pkg/agents pkg/grpc/actions/agents(clean)Notes
Without Sentry credentials in the cloud agent environment I couldn't introspect the originating exception payload, but the only error path that could produce
HTTP 500on this endpoint without an explicitcodes.NotFoundmapping is the one fixed here. The fix is intentionally a defensive widening that also reduces Sentry noise from future upstream-only incidents.