Skip to content

feat: expose doc_id in query responses#3029

Open
eeeetttt wants to merge 3 commits intoHKUDS:mainfrom
eeeetttt:feat/expose-doc-id-in-query-responses
Open

feat: expose doc_id in query responses#3029
eeeetttt wants to merge 3 commits intoHKUDS:mainfrom
eeeetttt:feat/expose-doc-id-in-query-responses

Conversation

@eeeetttt
Copy link
Copy Markdown

@eeeetttt eeeetttt commented May 7, 2026

Summary

Carry full_doc_id from the storage layer through the retrieval pipeline
so it appears as doc_id in all query response references and chunks.

Problem

full_doc_id is already stored in both KV store and vector database,
but the retrieval pipeline drops it at multiple points.

Solution

4-point pass-through (17 lines added, 3 files changed):

  • operate.py: extract full_doc_id in _get_vector_context() and
    propagate through all three branches of _merge_all_chunks()
  • utils.py: add doc_id to convert_to_user_format() chunks and
    generate_reference_list_from_chunks() references
  • query_routes.py: add optional doc_id field to ReferenceItem model

doc_id is Optional[str], fully backward compatible.
All three query endpoints (/query, /query/stream, /query/data) gain doc_id.

Carry full_doc_id from the storage layer through the retrieval pipeline
so it appears as doc_id in query response references and chunks.

- operate.py: extract full_doc_id in _get_vector_context() and propagate
  through all three branches of _merge_all_chunks()
- utils.py: add doc_id to convert_to_user_format() chunks and to
  generate_reference_list_from_chunks() references
- query_routes.py: add doc_id field to ReferenceItem model

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@danielaskdd
Copy link
Copy Markdown
Collaborator

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a08d38342

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lightrag/utils.py
Comment on lines +3370 to +3371
if not file_path_to_doc_id.get(file_path):
file_path_to_doc_id[file_path] = chunk.get("full_doc_id", "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve separate doc IDs per reference

When query results contain chunks from multiple documents that share the same file_path, this stores only the first full_doc_id for that path. insert(..., file_paths=...) allows repeated paths for different document IDs, and the rest of this function still groups references and assigns reference_ids solely by file_path, so the returned reference can include chunks from several docs while advertising a single, sometimes wrong doc_id. Clients that use the new field to fetch or manage the referenced document will miss the other chunks; group references by (file_path, full_doc_id) or return all doc IDs for the path.

Useful? React with 👍 / 👎.

@eeeetttt
Copy link
Copy Markdown
Author

eeeetttt commented May 7, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@eeeetttt
Copy link
Copy Markdown
Author

eeeetttt commented May 7, 2026

never mind

Allow users to specify custom document IDs via REST API:

- /documents/upload: new doc_id query parameter
- /documents/text: new doc_id field in InsertTextRequest
- /documents/texts: new doc_ids field in InsertTextsRequest

Pydantic validators ensure doc_ids are non-empty, non-whitespace,
unique within the request, and match texts length. Duplicate checks
against doc_status storage use batch get_by_ids to avoid N+1 queries.

Also fix a 500 error when Pydantic model_validator errors contain
unserializable objects in ctx['error'] by converting them to strings
before JSON encoding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@eeeetttt
Copy link
Copy Markdown
Author

eeeetttt commented May 8, 2026

Users can now bring their own doc_id via the API instead of relying on auto-generated MD5 hashes. This is especially useful when integrating an existing knowledge base — keeping document IDs consistent across systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants