fix(sync): correct Phoenix span extraction for multi-span traces and tool-calling agents#273
fix(sync): correct Phoenix span extraction for multi-span traces and tool-calling agents#273evduester wants to merge 1 commit into
Conversation
…tool-calling agents
Two independent fixes to `PhoenixSync` that affect guideline generation accuracy.
Fix 1 — trace-level dedup and representative span selection:
Each agent run emits one Phoenix span per LLM call, with every subsequent span
re-including all prior messages (cumulative context window). Processing all spans
caused the same conversation steps to be analysed multiple times with growing
context, inflating guideline counts and skewing results. The fix deduplicates at
trace level (`_get_processed_trace_ids`) and picks a single representative span
per trace — the last by `start_time`, which holds the most complete message
history (`_select_representative_spans`).
Fix 2 — complete message and tool extraction from OpenInference spans:
The Phoenix REST API returns OpenInference attributes as flat indexed keys
(`llm.input_messages.{i}.message.*`, `llm.tools.{i}.tool.json_schema`) rather
than nested lists. The previous extraction code missed `tool_call_id` on tool
messages and `tool_calls` on assistant messages, producing incomplete trajectories
where tool-calling steps appeared as `content: "None"` with no linkage between
assistant calls and tool results. Guidelines generated from such trajectories
lacked tool-use context. The fix adds:
- `tool_call_id` extraction in both message loops
- an indexed-attribute reader for `llm.input_messages.{i}.*` and
`llm.output_messages.{i}.*` (including nested `tool_calls.{j}.*`)
- `_extract_tools_from_span`: parses `llm.tools.{i}.tool.json_schema` and the
list-mode `{"tool.json_schema": "..."}` format into OpenAI tool dicts
- `_convert_openinference_tool_calls`: converts OpenInference tool_call dicts to
OpenAI format (`tool_call.function.name` → `function.name`, etc.)
- `_extract_trajectory`: propagates `tool_calls` and `tool_call_id` from the
extracted message dict, choosing the right assembly path per message format
Unit tests updated: entity metadata mocks now include `trace_id` to match the
new trace-level dedup logic.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthrough
ChangesPhoenixSync trace deduplication and tool handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/unit/test_phoenix_sync.py (1)
726-766: ⚖️ Poor tradeoffConsider adding unit tests for new helper methods.
The existing
TestGetProcessedSpanIdsclass tests_get_processed_span_ids(), but the new analogous_get_processed_trace_ids()method lacks dedicated tests. Similarly,_select_representative_spans(),_extract_tools_from_span(), and_convert_openinference_tool_calls()are only covered indirectly throughsync()integration tests.Adding targeted unit tests would improve coverage for edge cases like:
_select_representative_spans()with missingstart_timeor multiple spans per trace_extract_tools_from_span()for each of the three attribute conventions_convert_openinference_tool_calls()format conversion and passthroughAs per coding guidelines: "All new features need tests."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_phoenix_sync.py` around lines 726 - 766, Add dedicated unit test classes for the new helper methods that currently lack direct test coverage. Create a TestGetProcessedTraceIds class similar to the existing TestGetProcessedSpanIds to test the _get_processed_trace_ids() method with scenarios for empty results, existing entities, and namespace not found exceptions. Add a TestSelectRepresentativeSpans class to test edge cases including missing start_time attributes and multiple spans per trace. Add a TestExtractToolsFromSpan class to test the method for each of the three supported attribute conventions. Add a TestConvertOpeninferenceToolCalls class to test format conversion logic and passthrough behavior. Ensure each test method follows the existing naming conventions and includes appropriate assertions and mocking setup.Source: Coding guidelines
altk_evolve/sync/phoenix_sync.py (1)
529-544: 💤 Low valueRedundant JSON schema parsing in fallback block.
Lines 537-543 attempt to parse
llm.tools.{i}.tool.json_schemaagain, but this same key was already read and parsed (or failed) at lines 521-527. If parsing succeeded, we alreadycontinued; if it failed or the key was absent, re-reading yields the same result.Consider removing this redundant block or clarifying the intent if the fallback is meant to handle a different attribute.
♻️ Proposed fix
# Fall back to building from name/description/parameters parts name = attrs.get(f"llm.tools.{i}.tool.name") if not name: continue tool: dict = {"type": "function", "function": {"name": name}} description = attrs.get(f"llm.tools.{i}.tool.description") if description: tool["function"]["description"] = description - json_schema = attrs.get(f"llm.tools.{i}.tool.json_schema") - if json_schema: - try: - schema = json.loads(json_schema) if isinstance(json_schema, str) else json_schema - tool["function"]["parameters"] = schema - except (json.JSONDecodeError, Exception): - pass + # parameters could come from a separate .parameters key if schema parsing failed above + parameters_str = attrs.get(f"llm.tools.{i}.tool.parameters") + if parameters_str: + try: + tool["function"]["parameters"] = json.loads(parameters_str) if isinstance(parameters_str, str) else parameters_str + except (json.JSONDecodeError, Exception): + pass tools.append(tool)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@altk_evolve/sync/phoenix_sync.py` around lines 529 - 544, The json_schema parsing block in the fallback section (lines 537-543) is redundant because the same llm.tools.{i}.tool.json_schema attribute was already read and parsed earlier in the code flow at lines 521-527. Since a successful parse would have already continued to the next iteration, re-attempting to parse the same attribute in the fallback will only produce the same result. Remove the redundant json_schema parsing block (the try-except block that reads and parses json_schema and assigns it to tool["function"]["parameters"]) from the fallback section to eliminate the duplicate logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@altk_evolve/sync/phoenix_sync.py`:
- Around line 529-544: The json_schema parsing block in the fallback section
(lines 537-543) is redundant because the same llm.tools.{i}.tool.json_schema
attribute was already read and parsed earlier in the code flow at lines 521-527.
Since a successful parse would have already continued to the next iteration,
re-attempting to parse the same attribute in the fallback will only produce the
same result. Remove the redundant json_schema parsing block (the try-except
block that reads and parses json_schema and assigns it to
tool["function"]["parameters"]) from the fallback section to eliminate the
duplicate logic.
In `@tests/unit/test_phoenix_sync.py`:
- Around line 726-766: Add dedicated unit test classes for the new helper
methods that currently lack direct test coverage. Create a
TestGetProcessedTraceIds class similar to the existing TestGetProcessedSpanIds
to test the _get_processed_trace_ids() method with scenarios for empty results,
existing entities, and namespace not found exceptions. Add a
TestSelectRepresentativeSpans class to test edge cases including missing
start_time attributes and multiple spans per trace. Add a
TestExtractToolsFromSpan class to test the method for each of the three
supported attribute conventions. Add a TestConvertOpeninferenceToolCalls class
to test format conversion logic and passthrough behavior. Ensure each test
method follows the existing naming conventions and includes appropriate
assertions and mocking setup.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4fe9522f-7f2d-4812-bfe7-f592a2818e09
📒 Files selected for processing (2)
altk_evolve/sync/phoenix_sync.pytests/unit/test_phoenix_sync.py
Summary
Fix 1 — trace-level dedup: Each agent run emits one Phoenix span per LLM call, with every subsequent span re-including all prior messages (cumulative context window). Processing all spans caused the same conversation steps to be analysed multiple times, inflating guideline counts. Now deduplicates at trace level and picks a single representative span per trace — the last by
start_time, which has the most complete message history.Fix 2 — complete message and tool extraction: The Phoenix REST API returns OpenInference attributes as flat indexed keys (
llm.input_messages.{i}.message.*,llm.tools.{i}.tool.json_schema) rather than nested lists. The old extraction missedtool_call_idon tool messages andtool_callson assistant messages.Before the fix, a tool-calling agent trajectory looked like this — the assistant call and tool results are completely unlinked:
{"role": "assistant", "content": "None"}, {"role": "tool", "content": "20"}, {"role": "tool", "content": "25"}After the fix, the trajectory is complete and valid:
{"role": "assistant", "tool_calls": [{"id": "call_abc", "type": "function", "function": {"name": "multiply", "arguments": "{\"a\": 10, \"b\": 2}"}}]}, {"role": "tool", "tool_call_id": "call_abc", "content": "20"}, {"role": "tool", "tool_call_id": "call_def", "content": "25"}Guidelines generated from the broken trajectories lacked tool-use context entirely. Fixed by adding an indexed-attribute reader for
llm.input_messages.{i}.*/llm.output_messages.{i}.*(including nestedtool_calls.{j}.*),_extract_tools_from_spanto parsellm.tools.{i}.tool.json_schema, and_convert_openinference_tool_callsto convert OpenInference tool_call dicts to OpenAI format.Test plan
uv run pytest tests/unit/test_phoenix_sync.py -v— all 46 tests passevolve sync phoenixagainst a Phoenix project with a tool-calling agent and verify the trajectory debug file shows completetool_calls/tool_call_idfields and correct tool definitionsskippedcount in sync output reflects unique traces, not individual spans🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
Bug Fixes