fix(pipeline): preserve process_options in doc_status metadata across transitions by danielaskdd · Pull Request #3017 · HKUDS/LightRAG

danielaskdd · 2026-05-04T19:09:16Z

Summary

P2 fix: doc_status.metadata.process_options was only set at the initial PENDING insert and got clobbered by every subsequent state-machine upsert. doc_status storage backends replace the entire metadata blob on each upsert, so the moment the pipeline advanced the document to PARSING / ANALYZING / PROCESSING / PROCESSED / FAILED, admin / list APIs lost the per-document strategy that the new API contract says they can surface.

This stacks on top of #3013, #3014, #3015, and #3016; only the new fix(pipeline): preserve process_options ... commit is unique to this PR.

Fix

New helper doc_status_transition_metadata(status_doc, *, extra=None) in lightrag/utils_pipeline.py. Builds the metadata payload to upsert by:
1. Carrying forward keys listed in _DOC_STATUS_METADATA_CARRY_OVER_KEYS (today only process_options) from the loaded status_doc.metadata.
2. Layering in any transition-specific extra= fields (processing_start_time / processing_end_time / extraction meta).
Future long-lived metadata fields can be added by extending the tuple — no per-site touch needed.
Wired the helper into every state-transition upsert in lightrag/pipeline.py:
- PENDING-reset in _validate_and_fix_document_consistency
- Inline path: PARSING, ANALYZING, PROCESSING, PROCESSED, FAILED-extract, FAILED-merge
- Worker path: PARSING, ANALYZING, FAILED-parse
- _mark_duplicate_after_parse (content-hash duplicate record)
Sites that did not previously write a metadata field now do, so carry-over behaviour is consistent regardless of which state the document is in.

Why explicit carry-over instead of storage-layer merge?

Two alternatives were considered and rejected:

Storage-layer merge (treat metadata as a partial update on every upsert): would change the upsert contract globally and silently affect any future code that wants to clear a metadata field by upserting without it. Brittle.
Re-read existing record before each upsert: doubles the storage round-trips for a hot loop and leaves a TOCTOU window between the read and the write.

Explicit carry-over via the helper keeps the intent visible at every call site and adds zero extra storage reads — status_doc is already loaded at the top of process_document.

Test plan

ruff check lightrag tests passes
pytest tests — 1089 passed, 1 skipped, 1 xfailed (2 new regressions, no existing failures)
New regression tests in tests/test_pipeline_release_closure.py:
- test_doc_status_metadata_carry_over_helper exercises the helper in isolation: carry-over alone, carry-over + extras, missing metadata, empty / None process_options
- test_doc_status_metadata_survives_processed_transition enqueues a document with process_options='iet!' and runs the full pipeline to PROCESSED, asserting the final doc_status.metadata.process_options is still 'iet!'

Compatibility

No HTTP / Python public-API breakage. Pure correctness fix that brings observed behaviour in line with the documented metadata contract.

🤖 Generated with Claude Code

…ocess_options When ``apipeline_process_enqueue_documents`` picks up a half-processed document whose content is already extracted into ``full_docs`` (raw content or LightRAG blocks file present), redo the post-extraction stages cleanly under the *current* ``process_options`` rather than mixing stale and fresh chunks/entities. - New ``LightRAG._purge_doc_chunks_and_kg(doc_id, chunk_ids)`` helper removes a document's chunks from ``chunks_vdb`` / ``text_chunks``, classifies its entity / relation contributions into delete-outright vs rebuild-from-remaining, applies the corresponding cleanup, and rebuilds entries that other documents still source. Does NOT touch ``doc_status`` / ``full_docs`` / ``llm_response_cache`` / pipeline busy state — it is the focused KG-cleanup core suitable for both deletion and resume callers. ``adelete_by_doc_id`` remains unchanged for now (deduplicating it can be a future PR). - ``process_document`` gains a resume guard at the convergence point of the worker-driven and inline parse paths. When content is already extracted, it warns on engine mismatch (extracted content is the source of truth — switching engines requires delete + re-upload), purges any stale chunks recorded in ``chunks_list`` via the new helper, and resets ``status_doc.chunks_list`` / ``chunks_count`` so subsequent state-machine upserts do not re-write stale IDs. - ``parse_native`` already returns existing content for format=lightrag and format=raw without re-parsing, so the resume branch reuses the existing parse-stage dispatch unchanged. - New regression tests: - ``_purge_doc_chunks_and_kg`` is a no-op for empty chunk_ids. - ``_purge_doc_chunks_and_kg`` clears chunks_vdb / text_chunks for a document with no graph contributions yet. - The pipeline calls the purge helper with the previous run's chunk IDs when resuming an already-extracted document. - The pipeline skips the purge when ``chunks_list`` is empty. - ``test_extract_failure_before_chunking_preserves_previous_chunk_snapshot`` renamed to ``..._clears_stale_chunk_snapshot`` and inverted: the previous snapshot is now intentionally not preserved across resume + failure, matching the documented "已抽取文档一律删旧 chunks 重做" rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ss transitions doc_status storage backends treat the ``metadata`` field as an opaque blob and **replace** it on every upsert, so the ``metadata.process_options`` mirror seeded at PENDING was getting clobbered as soon as the doc transitioned to PARSING / ANALYZING / PROCESSING / PROCESSED / FAILED. Admin / list APIs that read ``doc_status.metadata`` per the new API contract were therefore unable to surface the per-document strategy after processing started. This fix carries ``process_options`` (and any future long-lived metadata fields) explicitly through every state-machine transition by: - Adding ``doc_status_transition_metadata(status_doc, *, extra=None)`` in ``lightrag/utils_pipeline.py``. It builds the metadata payload to upsert by carrying over the keys listed in ``_DOC_STATUS_METADATA_CARRY_OVER_KEYS`` (currently ``("process_options",)``) from the loaded ``status_doc.metadata``, then layering in any transition-specific ``extra=`` fields (``processing_start_time`` / ``processing_end_time`` / extraction meta). Future long-lived fields can be added by extending the tuple. - Replacing every state-transition upsert in ``apipeline_process_enqueue_documents`` (PENDING-reset, inline PARSING / ANALYZING / PROCESSING / PROCESSED / FAILED, worker-path PARSING / ANALYZING / FAILED, and ``_mark_duplicate_after_parse``'s content-hash duplicate record) to call the helper. Sites that did not previously write a ``metadata`` field now do, so the carry-over is consistent regardless of state. - Adding two regression tests: - ``test_doc_status_metadata_carry_over_helper`` exercises the helper in isolation: carry-over alone, carry-over + extras, missing metadata, empty / None process_options. - ``test_doc_status_metadata_survives_processed_transition`` enqueues a document with ``process_options='iet!'`` and runs the full pipeline to PROCESSED, asserting that the final ``doc_status.metadata.process_options`` is still ``'iet!'``. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

danielaskdd and others added 3 commits May 5, 2026 02:41

Fix lintings

5280ac2

danielaskdd force-pushed the fix/preserve-process-options-metadata branch from 204221c to 5280ac2 Compare May 4, 2026 19:17

Merge branch 'dev' into fix/preserve-process-options-metadata

eb0419a

danielaskdd merged commit 5f525d0 into HKUDS:dev May 5, 2026
2 of 3 checks passed

danielaskdd deleted the fix/preserve-process-options-metadata branch May 5, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pipeline): preserve process_options in doc_status metadata across transitions#3017

fix(pipeline): preserve process_options in doc_status metadata across transitions#3017
danielaskdd merged 4 commits intoHKUDS:devfrom
danielaskdd:fix/preserve-process-options-metadata

danielaskdd commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielaskdd commented May 4, 2026

Summary

Fix

Why explicit carry-over instead of storage-layer merge?

Test plan

Compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant