fix(pipeline): preserve process_options in doc_status metadata across transitions#3017
Merged
danielaskdd merged 4 commits intoHKUDS:devfrom May 5, 2026
Merged
Conversation
…ocess_options
When ``apipeline_process_enqueue_documents`` picks up a half-processed
document whose content is already extracted into ``full_docs`` (raw
content or LightRAG blocks file present), redo the post-extraction
stages cleanly under the *current* ``process_options`` rather than
mixing stale and fresh chunks/entities.
- New ``LightRAG._purge_doc_chunks_and_kg(doc_id, chunk_ids)`` helper
removes a document's chunks from ``chunks_vdb`` / ``text_chunks``,
classifies its entity / relation contributions into delete-outright
vs rebuild-from-remaining, applies the corresponding cleanup, and
rebuilds entries that other documents still source. Does NOT touch
``doc_status`` / ``full_docs`` / ``llm_response_cache`` / pipeline
busy state — it is the focused KG-cleanup core suitable for both
deletion and resume callers. ``adelete_by_doc_id`` remains unchanged
for now (deduplicating it can be a future PR).
- ``process_document`` gains a resume guard at the convergence point of
the worker-driven and inline parse paths. When content is already
extracted, it warns on engine mismatch (extracted content is the
source of truth — switching engines requires delete + re-upload),
purges any stale chunks recorded in ``chunks_list`` via the new
helper, and resets ``status_doc.chunks_list`` / ``chunks_count`` so
subsequent state-machine upserts do not re-write stale IDs.
- ``parse_native`` already returns existing content for format=lightrag
and format=raw without re-parsing, so the resume branch reuses the
existing parse-stage dispatch unchanged.
- New regression tests:
- ``_purge_doc_chunks_and_kg`` is a no-op for empty chunk_ids.
- ``_purge_doc_chunks_and_kg`` clears chunks_vdb / text_chunks for a
document with no graph contributions yet.
- The pipeline calls the purge helper with the previous run's chunk
IDs when resuming an already-extracted document.
- The pipeline skips the purge when ``chunks_list`` is empty.
- ``test_extract_failure_before_chunking_preserves_previous_chunk_snapshot``
renamed to ``..._clears_stale_chunk_snapshot`` and inverted: the
previous snapshot is now intentionally not preserved across resume +
failure, matching the documented "已抽取文档一律删旧 chunks 重做" rule.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ss transitions
doc_status storage backends treat the ``metadata`` field as an opaque
blob and **replace** it on every upsert, so the
``metadata.process_options`` mirror seeded at PENDING was getting
clobbered as soon as the doc transitioned to PARSING / ANALYZING /
PROCESSING / PROCESSED / FAILED. Admin / list APIs that read
``doc_status.metadata`` per the new API contract were therefore
unable to surface the per-document strategy after processing started.
This fix carries ``process_options`` (and any future long-lived metadata
fields) explicitly through every state-machine transition by:
- Adding ``doc_status_transition_metadata(status_doc, *, extra=None)``
in ``lightrag/utils_pipeline.py``. It builds the metadata payload to
upsert by carrying over the keys listed in
``_DOC_STATUS_METADATA_CARRY_OVER_KEYS`` (currently
``("process_options",)``) from the loaded ``status_doc.metadata``,
then layering in any transition-specific ``extra=`` fields
(``processing_start_time`` / ``processing_end_time`` / extraction
meta). Future long-lived fields can be added by extending the tuple.
- Replacing every state-transition upsert in
``apipeline_process_enqueue_documents`` (PENDING-reset, inline
PARSING / ANALYZING / PROCESSING / PROCESSED / FAILED, worker-path
PARSING / ANALYZING / FAILED, and ``_mark_duplicate_after_parse``'s
content-hash duplicate record) to call the helper. Sites that did
not previously write a ``metadata`` field now do, so the carry-over
is consistent regardless of state.
- Adding two regression tests:
- ``test_doc_status_metadata_carry_over_helper`` exercises the helper
in isolation: carry-over alone, carry-over + extras, missing
metadata, empty / None process_options.
- ``test_doc_status_metadata_survives_processed_transition`` enqueues
a document with ``process_options='iet!'`` and runs the full
pipeline to PROCESSED, asserting that the final
``doc_status.metadata.process_options`` is still ``'iet!'``.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
204221c to
5280ac2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
P2 fix:
doc_status.metadata.process_optionswas only set at the initial PENDING insert and got clobbered by every subsequent state-machine upsert.doc_statusstorage backends replace the entiremetadatablob on eachupsert, so the moment the pipeline advanced the document to PARSING / ANALYZING / PROCESSING / PROCESSED / FAILED, admin / list APIs lost the per-document strategy that the new API contract says they can surface.This stacks on top of #3013, #3014, #3015, and #3016; only the new
fix(pipeline): preserve process_options ...commit is unique to this PR.Fix
New helper
doc_status_transition_metadata(status_doc, *, extra=None)in lightrag/utils_pipeline.py. Builds themetadatapayload to upsert by:_DOC_STATUS_METADATA_CARRY_OVER_KEYS(today onlyprocess_options) from the loadedstatus_doc.metadata.extra=fields (processing_start_time/processing_end_time/ extraction meta).Future long-lived metadata fields can be added by extending the tuple — no per-site touch needed.
Wired the helper into every state-transition upsert in lightrag/pipeline.py:
_validate_and_fix_document_consistency_mark_duplicate_after_parse(content-hash duplicate record)Sites that did not previously write a
metadatafield now do, so carry-over behaviour is consistent regardless of which state the document is in.Why explicit carry-over instead of storage-layer merge?
Two alternatives were considered and rejected:
metadataas a partial update on every upsert): would change the upsert contract globally and silently affect any future code that wants to clear a metadata field by upserting without it. Brittle.Explicit carry-over via the helper keeps the intent visible at every call site and adds zero extra storage reads —
status_docis already loaded at the top ofprocess_document.Test plan
ruff check lightrag testspassespytest tests— 1089 passed, 1 skipped, 1 xfailed (2 new regressions, no existing failures)test_doc_status_metadata_carry_over_helperexercises the helper in isolation: carry-over alone, carry-over + extras, missing metadata, empty / Noneprocess_optionstest_doc_status_metadata_survives_processed_transitionenqueues a document withprocess_options='iet!'and runs the full pipeline to PROCESSED, asserting the finaldoc_status.metadata.process_optionsis still'iet!'Compatibility
No HTTP / Python public-API breakage. Pure correctness fix that brings observed behaviour in line with the documented metadata contract.
🤖 Generated with Claude Code