fix(agent): Langfuse 4.x flush — correct env vars + remove explicit flush#138
Merged
fix(agent): Langfuse 4.x flush — correct env vars + remove explicit flush#138
Conversation
…emove explicit flush #137 re-enabled Langfuse with OTEL_BSP_EXPORT_TIMEOUT + OTEL_EXPORTER_OTLP _TIMEOUT, which are documented-ignored by the OTLP HTTP exporter (# Not used. No way currently to pass timeout to export.) and by Langfuse's own LangfuseSpanProcessor. Every /team request post-deploy hit Lambda's 60s timeout, forcing an emergency CLI flip of LANGFUSE_TRACING_ENABLED=false. Deep source-code research on langfuse-python v4.0.6 identified two separate blocking surfaces: 1. OTLPSpanExporter retry loop — 6 retries with exponential backoff (~63s total). Bounded only by the exporter's own `timeout=` ctor arg, which Langfuse wires to the LANGFUSE_TIMEOUT env var (seconds, default 5). That's the knob, not the OTEL_* ones. 2. resource_manager.flush() → _score_ingestion_queue.join() — no timeout. If we emit scores (/chat does, via _emit_quality_scores) and Langfuse is slow, .join() blocks indefinitely on the handler thread. This is unfixable without the SDK's upcoming v4.1 span_exporter= kwarg. Additionally, LANGFUSE_FLUSH_AT=1 was counterproductive: it sets max_export_batch_size=1, making BatchSpanProcessor.on_end() trigger synchronous export on the handler thread for every span — moving the blocking HTTP upload onto the user-facing request path, which is what the background thread exists to prevent. Fix: - lambda.tf: LANGFUSE_TIMEOUT=2 (caps retry loop); LANGFUSE_FLUSH_INTERVAL =1 (frequent background flush); remove OTEL_BSP_EXPORT_TIMEOUT, OTEL _EXPORTER_OTLP_TIMEOUT, LANGFUSE_FLUSH_AT. LANGFUSE_TRACING_ENABLED stays "false" in Terraform — flip to true via CLI after apply once production latency is verified. - api.py: remove three explicit langfuse_flush() calls from /chat and /chat/sync handlers. The background batch processor drains spans + scores on its own thread; explicit flush was the path through the unbounded _score_ingestion_queue.join(). Tests: two tests that mocked langfuse_flush() were deleted — they asserted behaviour we deliberately removed. 150 tests pass. ruff, terraform validate clean. Docs: ADR-0005 revision expanded with source-level citations (SDK line numbers) so the next person who touches this doesn't redo the research. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
ikuzuki
added a commit
that referenced
this pull request
Apr 21, 2026
Three rounds of env-var tuning (#133 / #137 / #138) failed to cap the request-path hang at <60s. Test on 2026-04-21 with LANGFUSE_TIMEOUT=2 confirmed live still hit Lambda's 60s timeout on /team; the actual blocking call was not root-caused. - Drop LANGFUSE_TIMEOUT and LANGFUSE_FLUSH_INTERVAL from lambda.tf — both are no-ops when tracing is disabled, keeping them was cargo. - Rewrite the comment next to LANGFUSE_TRACING_ENABLED="false" to reflect the parked decision rather than the stale "flip via CLI after apply" plan. Enrichment services retain tracing; they run in normal Lambda (no LWA, no streaming) and don't show this class of hang. Re-entry path is either a local reproduction with a debugger attached or switching to the ADOT Lambda Extension — not more env-var guessing. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Corrects #137's Langfuse re-enablement, which post-deploy caused
Task timed out after 60 secondson every/teamrequest and forced an emergency kill-switch flip. The fix is two env-var corrections plus removing explicitlangfuse_flush()from the request path.Root cause (source-code level, see ADR-0005 revision)
Langfuse 4.x has two separate blocking surfaces on the response thread. #137 addressed neither.
1. OTLPSpanExporter retry loop. Exponential backoff (1+2+4+8+16+32 ≈ 63s). Bounded only by the exporter's
timeout=ctor arg, which Langfuse wires toLANGFUSE_TIMEOUT(seconds, default 5). TheOTEL_BSP_EXPORT_TIMEOUTandOTEL_EXPORTER_OTLP_TIMEOUTenv vars I set in #137 are documented-ignored by OpenTelemetry's OTLP HTTP exporter:2.
resource_manager.flush()'s queue joins.Langfuse().flush()calls_score_ingestion_queue.join()with no timeout./chatemits scores via_emit_quality_scores, so if Langfuse Cloud is slow,.join()blocks the handler thread indefinitely. Not configurable in 4.0.6 (Langfuse PR #1618 ships aspan_exporter=kwarg that works around this in 4.1+).Bonus footgun:
LANGFUSE_FLUSH_AT=1(which I thought reduced queue pressure) actually setsmax_export_batch_size=1onBatchSpanProcessor— which makeson_end()trigger synchronous export on the handler thread for every span. Moves the blocking upload onto the user-facing request path, the exact behaviour the background thread exists to prevent.Fix
lambda.tf
LANGFUSE_TIMEOUT="2"— caps the retry loop at ~2s per flush.LANGFUSE_FLUSH_INTERVAL="1"— background thread drains queue every 1s.OTEL_BSP_EXPORT_TIMEOUT,OTEL_EXPORTER_OTLP_TIMEOUT(ignored).LANGFUSE_FLUSH_AT(counterproductive).LANGFUSE_TRACING_ENABLED="false"— kill-switch stays while we verify the fix. Flip to"true"via CLI after apply; Terraform default promotes only after confirmed.api.py
langfuse_flush()calls from/chatand/chat/synchandlers. The background batch processor drains spans + scores between requests on its own thread; explicit flush was what fell through into the unbounded_score_ingestion_queue.join().flush as langfuse_flushimport.Tests
langfuse_flush()— we deliberately removed that behaviour. Comment in place where they used to live points to the ADR revision.Docs
langfuse/_client/client.py:269,span_processor.py:108-112,resource_manager.py:430, the OTel__init__.py:174-224retry loop)span_exporter=in Langfuse 4.1+ as the cleaner escape hatchRollout plan
terraform apply— changes the env vars, Langfuse stays OFF./team+/chatstill work on the deployed (untraced) Lambda.LANGFUSE_TRACING_ENABLED=trueon the live Lambda./team+/chat, measure latency. Should be <3s even if Langfuse is slow.LANGFUSE_TRACING_ENABLED="true"in Terraform.Why this time will work (confidence check)
The research for #137 was docs-level and got the wrong answer. This time the research was source-level — agent traced the exact call chain in the actual installed
langfuse==4.0.6and OpenTelemetry SDK. The OTEL exporter source literally has a comment saying the timeout env var is ignored.LANGFUSE_TIMEOUT=2is wired straight into the exporter ctor — verifiable by reading the SDK. The unbounded.join()path on_score_ingestion_queueis also visible in the SDK source.The remaining risk is the background batch processor dropping spans when the Lambda freezes mid-batch. That's the accepted trade-off at our scale (1-2 rpm); if the Langfuse UI shows visible drop rate, ADOT is the next step.
Test plan
/teamreturns in <3s even with tracing on/chatstreams end-to-end without handler hang🤖 Generated with Claude Code