feat: expose doc_id in query responses#3029
Conversation
Carry full_doc_id from the storage layer through the retrieval pipeline so it appears as doc_id in query response references and chunks. - operate.py: extract full_doc_id in _get_vector_context() and propagate through all three branches of _merge_all_chunks() - utils.py: add doc_id to convert_to_user_format() chunks and to generate_reference_list_from_chunks() references - query_routes.py: add doc_id field to ReferenceItem model Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a08d38342
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not file_path_to_doc_id.get(file_path): | ||
| file_path_to_doc_id[file_path] = chunk.get("full_doc_id", "") |
There was a problem hiding this comment.
Preserve separate doc IDs per reference
When query results contain chunks from multiple documents that share the same file_path, this stores only the first full_doc_id for that path. insert(..., file_paths=...) allows repeated paths for different document IDs, and the rest of this function still groups references and assigns reference_ids solely by file_path, so the returned reference can include chunks from several docs while advertising a single, sometimes wrong doc_id. Clients that use the new field to fetch or manage the referenced document will miss the other chunks; group references by (file_path, full_doc_id) or return all doc IDs for the path.
Useful? React with 👍 / 👎.
|
@codex review |
|
To use Codex here, create a Codex account and connect to github. |
|
never mind |
Allow users to specify custom document IDs via REST API: - /documents/upload: new doc_id query parameter - /documents/text: new doc_id field in InsertTextRequest - /documents/texts: new doc_ids field in InsertTextsRequest Pydantic validators ensure doc_ids are non-empty, non-whitespace, unique within the request, and match texts length. Duplicate checks against doc_status storage use batch get_by_ids to avoid N+1 queries. Also fix a 500 error when Pydantic model_validator errors contain unserializable objects in ctx['error'] by converting them to strings before JSON encoding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Users can now bring their own doc_id via the API instead of relying on auto-generated MD5 hashes. This is especially useful when integrating an existing knowledge base — keeping document IDs consistent across systems. |
Summary
Carry
full_doc_idfrom the storage layer through the retrieval pipelineso it appears as
doc_idin all query response references and chunks.Problem
full_doc_idis already stored in both KV store and vector database,but the retrieval pipeline drops it at multiple points.
Solution
4-point pass-through (17 lines added, 3 files changed):
operate.py: extractfull_doc_idin_get_vector_context()andpropagate through all three branches of
_merge_all_chunks()utils.py: adddoc_idtoconvert_to_user_format()chunks andgenerate_reference_list_from_chunks()referencesquery_routes.py: add optionaldoc_idfield toReferenceItemmodeldoc_idisOptional[str], fully backward compatible.All three query endpoints (
/query,/query/stream,/query/data) gaindoc_id.