feat: expose doc_id in query responses by eeeetttt · Pull Request #3029 · HKUDS/LightRAG

eeeetttt · 2026-05-07T06:51:58Z

Summary

Carry full_doc_id from the storage layer through the retrieval pipeline
so it appears as doc_id in all query response references and chunks.

Problem

full_doc_id is already stored in both KV store and vector database,
but the retrieval pipeline drops it at multiple points.

Solution

4-point pass-through (17 lines added, 3 files changed):

operate.py: extract full_doc_id in _get_vector_context() and
propagate through all three branches of _merge_all_chunks()
utils.py: add doc_id to convert_to_user_format() chunks and
generate_reference_list_from_chunks() references
query_routes.py: add optional doc_id field to ReferenceItem model

doc_id is Optional[str], fully backward compatible.
All three query endpoints (/query, /query/stream, /query/data) gain doc_id.

Carry full_doc_id from the storage layer through the retrieval pipeline so it appears as doc_id in query response references and chunks. - operate.py: extract full_doc_id in _get_vector_context() and propagate through all three branches of _merge_all_chunks() - utils.py: add doc_id to convert_to_user_format() chunks and to generate_reference_list_from_chunks() references - query_routes.py: add doc_id field to ReferenceItem model Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

danielaskdd · 2026-05-07T07:51:01Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a08d38342

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-07T07:53:58Z

+            if not file_path_to_doc_id.get(file_path):
+                file_path_to_doc_id[file_path] = chunk.get("full_doc_id", "")


Preserve separate doc IDs per reference

When query results contain chunks from multiple documents that share the same file_path, this stores only the first full_doc_id for that path. insert(..., file_paths=...) allows repeated paths for different document IDs, and the rest of this function still groups references and assigns reference_ids solely by file_path, so the returned reference can include chunks from several docs while advertising a single, sometimes wrong doc_id. Clients that use the new field to fetch or manage the referenced document will miss the other chunks; group references by (file_path, full_doc_id) or return all doc IDs for the path.

Useful? React with 👍 / 👎.

eeeetttt · 2026-05-07T08:00:47Z

@codex review

chatgpt-codex-connector · 2026-05-07T08:00:55Z

To use Codex here, create a Codex account and connect to github.

eeeetttt · 2026-05-07T08:01:24Z

never mind

Allow users to specify custom document IDs via REST API: - /documents/upload: new doc_id query parameter - /documents/text: new doc_id field in InsertTextRequest - /documents/texts: new doc_ids field in InsertTextsRequest Pydantic validators ensure doc_ids are non-empty, non-whitespace, unique within the request, and match texts length. Duplicate checks against doc_status storage use batch get_by_ids to avoid N+1 queries. Also fix a 500 error when Pydantic model_validator errors contain unserializable objects in ctx['error'] by converting them to strings before JSON encoding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

eeeetttt · 2026-05-08T01:45:59Z

Users can now bring their own doc_id via the API instead of relying on auto-generated MD5 hashes. This is especially useful when integrating an existing knowledge base — keeping document IDs consistent across systems.

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

fix: ruff-format style

f4dc963

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose doc_id in query responses#3029

feat: expose doc_id in query responses#3029
eeeetttt wants to merge 3 commits intoHKUDS:mainfrom
eeeetttt:feat/expose-doc-id-in-query-responses

eeeetttt commented May 7, 2026

Uh oh!

danielaskdd commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Uh oh!

eeeetttt commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 7, 2026

Uh oh!

eeeetttt commented May 7, 2026

Uh oh!

eeeetttt commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if not file_path_to_doc_id.get(file_path):
		file_path_to_doc_id[file_path] = chunk.get("full_doc_id", "")

Conversation

eeeetttt commented May 7, 2026

Summary

Problem

Solution

Uh oh!

danielaskdd commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

eeeetttt commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 7, 2026

Uh oh!

eeeetttt commented May 7, 2026

Uh oh!

eeeetttt commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants