Skip to content

fix(openai): capture trailing streamed usage chunk after finish_reason#780

Open
valentin-ib wants to merge 2 commits into
agentjido:mainfrom
valentin-ib:usage-after-finish-reason
Open

fix(openai): capture trailing streamed usage chunk after finish_reason#780
valentin-ib wants to merge 2 commits into
agentjido:mainfrom
valentin-ib:usage-after-finish-reason

Conversation

@valentin-ib

@valentin-ib valentin-ib commented Jun 18, 2026

Copy link
Copy Markdown

Description

With stream_options: {include_usage: true}, Azure OpenAI and OpenAI-compatible gateways (e.g. LiteLLM) send token usage in a separate SSE chunk that arrives after the finish_reason chunk, just before [DONE]:

data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}]}
data: {"choices":[{"index":0,"delta":{}}],"usage":{"prompt_tokens":12,"completion_tokens":7,...}}
data: [DONE]

default_decode_stream_event/2 flags the finish_reason chunk terminal?: true, so StreamServer finalizes the stream there. A consumer that reads ReqLLM.Response.usage/1 as soon as the stream "completes" then races the still-in-flight usage chunk and gets input_tokens: 0 / output_tokens: 0 (and therefore no cost). Non-streaming responses are unaffected — usage is in the single body. In practice this is deterministic: the usage chunk is a separate network frame, so the read almost always wins the race.

Fix

ReqLLM.Providers.OpenAI.ChatAPI.decode_stream_event/2 strips the terminal? flag off normal-completion finish_reason chunks, so the stream finalizes on [DONE] (or connection close) instead — by which point the trailing usage chunk has been accumulated into metadata.

Guards keep existing behavior intact:

  • Inline error chunks stay terminal. finish_reason: :error (and any meta carrying an :error key — how OpenAI-compatible gateways report mid-stream failures via data: {"error": ...}) keep terminal?, so failures still surface immediately instead of waiting for a [DONE] that won't come.
  • [DONE] and empty-choices usage chunks have no :finish_reason key, so they're untouched and keep their own terminal flag.

Scoped to the OpenAI ChatAPI driver; non-streaming and other providers are unaffected.

Alternative considered

The deeper root cause is that StreamServer snapshots metadata at the terminal chunk, before the trailing usage chunk merges. A fix there (defer finalization/metadata until the stream truly ends) would be more general but a larger change to the finalization lifecycle. This PR is the minimal, decoder-local fix — happy to take the broader approach instead if you'd prefer.

Type of Change

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality)
  • Breaking change (fix or feature causing existing functionality to change)
  • Documentation update

Breaking Changes

None. The public API and the Response struct are unchanged; streamed Response.usage is simply populated where it was previously zero.

Testing

  • Tests pass (mix test)
  • Quality checks pass (mix format / compile clean; CI runs full mix quality)

New unit tests in test/providers/openai_chat_streaming_usage_test.exs (pure decode tests — no live API / fixtures needed):

  • finish_reason chunk is no longer terminal
  • the trailing usage chunk yields usage
  • a combined finish_reason + usage chunk still yields usage
  • an empty-choices usage chunk keeps its terminal flag
  • [DONE] stays terminal
  • an inline error chunk stays terminal

Checklist

  • My code follows the project's style guidelines
  • I have added tests that prove my fix works
  • All new and existing tests pass
  • My commits follow conventional commit format
  • I have NOT edited CHANGELOG.md (auto-generated by git_ops)

Related Issues

Closes #781

@valentin-ib

Copy link
Copy Markdown
Author

For context, I identified this whilst setting up a token and cost usage dashboard, and noticed that the usage wasn't coming through (it was always 0). So wanted to raise it and check if anyone else thinks this is a reasonable amendment.

@valentin-ib valentin-ib force-pushed the usage-after-finish-reason branch from 71c59b1 to 87e3551 Compare June 18, 2026 15:09
@mikehostetler mikehostetler added the needs_work Changes requested before merge label Jun 21, 2026
1Steamwork1 and others added 2 commits June 21, 2026 10:19
Some OpenAI-compatible providers — notably Azure OpenAI and gateways like
LiteLLM — stream token usage in a SEPARATE chunk that arrives AFTER the
finish_reason chunk and just before [DONE], when stream_options.include_usage
is set:

    data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}]}
    data: {"choices":[{"index":0,"delta":{}}],"usage":{...}}
    data: [DONE]

default_decode_stream_event flags the finish_reason chunk terminal?: true,
which finalizes the stream there. A consumer that reads Response.usage right
after the stream "completes" then races the still-in-flight usage chunk, so
input/output tokens (and any cost derived from them) come back as zero.
Non-streaming responses carry usage in the single body and are unaffected.

ChatAPI.decode_stream_event/2 strips the terminal flag off normal-completion
finish_reason chunks so the stream finalizes on [DONE] (or connection close)
instead — by which point the usage chunk has been accumulated. Inline error
chunks (finish_reason: :error, or any chunk carrying an :error key) keep their
terminal flag so failures still surface immediately. [DONE] and empty-choices
usage chunks have no :finish_reason key and are untouched.

Adds regression tests for the chunk ordering, the combined finish_reason+usage
chunk, empty-choices usage, and error-chunk termination.
@mikehostetler mikehostetler force-pushed the usage-after-finish-reason branch from 87e3551 to 0d340dc Compare June 21, 2026 15:29
@mikehostetler mikehostetler added ready_to_merge and removed needs_work Changes requested before merge labels Jun 21, 2026

Copy link
Copy Markdown
Contributor

Thanks for the detailed repro. I dug into this against the latest PR head: after rebasing, the decoder-level claim in the original description no longer matches current main because default_decode_stream_event/2 already emits the finish_reason meta chunk without terminal?.

I removed the redundant ChatAPI override and added an end-to-end regression that replays the ordering from #781: content, then finish_reason: "stop", then a separate usage chunk with non-empty choices, then [DONE]. That test verifies StreamResponse.to_response/1 returns the trailing usage as non-zero Response.usage.

Could you please test the latest PR branch against your original Azure OpenAI / LiteLLM setup or token/cost dashboard repro? The simulated SSE path is covered and CI is green, but I want confirmation that the real provider path that produced zero usage is fixed before we treat #781 as fully closed.

@mikehostetler mikehostetler added needs_work Changes requested before merge and removed ready_to_merge labels Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs_work Changes requested before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Streamed token usage lost on Azure/LiteLLM (usage chunk arrives after finish_reason)

3 participants