Skip to content

fix(pipeline): preserve process_options in doc_status metadata across transitions#3017

Merged
danielaskdd merged 4 commits intoHKUDS:devfrom
danielaskdd:fix/preserve-process-options-metadata
May 5, 2026
Merged

fix(pipeline): preserve process_options in doc_status metadata across transitions#3017
danielaskdd merged 4 commits intoHKUDS:devfrom
danielaskdd:fix/preserve-process-options-metadata

Conversation

@danielaskdd
Copy link
Copy Markdown
Collaborator

Summary

P2 fix: doc_status.metadata.process_options was only set at the initial PENDING insert and got clobbered by every subsequent state-machine upsert. doc_status storage backends replace the entire metadata blob on each upsert, so the moment the pipeline advanced the document to PARSING / ANALYZING / PROCESSING / PROCESSED / FAILED, admin / list APIs lost the per-document strategy that the new API contract says they can surface.

This stacks on top of #3013, #3014, #3015, and #3016; only the new fix(pipeline): preserve process_options ... commit is unique to this PR.

Fix

  • New helper doc_status_transition_metadata(status_doc, *, extra=None) in lightrag/utils_pipeline.py. Builds the metadata payload to upsert by:

    1. Carrying forward keys listed in _DOC_STATUS_METADATA_CARRY_OVER_KEYS (today only process_options) from the loaded status_doc.metadata.
    2. Layering in any transition-specific extra= fields (processing_start_time / processing_end_time / extraction meta).

    Future long-lived metadata fields can be added by extending the tuple — no per-site touch needed.

  • Wired the helper into every state-transition upsert in lightrag/pipeline.py:

    • PENDING-reset in _validate_and_fix_document_consistency
    • Inline path: PARSING, ANALYZING, PROCESSING, PROCESSED, FAILED-extract, FAILED-merge
    • Worker path: PARSING, ANALYZING, FAILED-parse
    • _mark_duplicate_after_parse (content-hash duplicate record)

    Sites that did not previously write a metadata field now do, so carry-over behaviour is consistent regardless of which state the document is in.

Why explicit carry-over instead of storage-layer merge?

Two alternatives were considered and rejected:

  • Storage-layer merge (treat metadata as a partial update on every upsert): would change the upsert contract globally and silently affect any future code that wants to clear a metadata field by upserting without it. Brittle.
  • Re-read existing record before each upsert: doubles the storage round-trips for a hot loop and leaves a TOCTOU window between the read and the write.

Explicit carry-over via the helper keeps the intent visible at every call site and adds zero extra storage reads — status_doc is already loaded at the top of process_document.

Test plan

  • ruff check lightrag tests passes
  • pytest tests1089 passed, 1 skipped, 1 xfailed (2 new regressions, no existing failures)
  • New regression tests in tests/test_pipeline_release_closure.py:
    • test_doc_status_metadata_carry_over_helper exercises the helper in isolation: carry-over alone, carry-over + extras, missing metadata, empty / None process_options
    • test_doc_status_metadata_survives_processed_transition enqueues a document with process_options='iet!' and runs the full pipeline to PROCESSED, asserting the final doc_status.metadata.process_options is still 'iet!'

Compatibility

No HTTP / Python public-API breakage. Pure correctness fix that brings observed behaviour in line with the documented metadata contract.

🤖 Generated with Claude Code

danielaskdd and others added 3 commits May 5, 2026 02:41
…ocess_options

When ``apipeline_process_enqueue_documents`` picks up a half-processed
document whose content is already extracted into ``full_docs`` (raw
content or LightRAG blocks file present), redo the post-extraction
stages cleanly under the *current* ``process_options`` rather than
mixing stale and fresh chunks/entities.

- New ``LightRAG._purge_doc_chunks_and_kg(doc_id, chunk_ids)`` helper
  removes a document's chunks from ``chunks_vdb`` / ``text_chunks``,
  classifies its entity / relation contributions into delete-outright
  vs rebuild-from-remaining, applies the corresponding cleanup, and
  rebuilds entries that other documents still source.  Does NOT touch
  ``doc_status`` / ``full_docs`` / ``llm_response_cache`` / pipeline
  busy state — it is the focused KG-cleanup core suitable for both
  deletion and resume callers.  ``adelete_by_doc_id`` remains unchanged
  for now (deduplicating it can be a future PR).
- ``process_document`` gains a resume guard at the convergence point of
  the worker-driven and inline parse paths.  When content is already
  extracted, it warns on engine mismatch (extracted content is the
  source of truth — switching engines requires delete + re-upload),
  purges any stale chunks recorded in ``chunks_list`` via the new
  helper, and resets ``status_doc.chunks_list`` / ``chunks_count`` so
  subsequent state-machine upserts do not re-write stale IDs.
- ``parse_native`` already returns existing content for format=lightrag
  and format=raw without re-parsing, so the resume branch reuses the
  existing parse-stage dispatch unchanged.
- New regression tests:
  - ``_purge_doc_chunks_and_kg`` is a no-op for empty chunk_ids.
  - ``_purge_doc_chunks_and_kg`` clears chunks_vdb / text_chunks for a
    document with no graph contributions yet.
  - The pipeline calls the purge helper with the previous run's chunk
    IDs when resuming an already-extracted document.
  - The pipeline skips the purge when ``chunks_list`` is empty.
- ``test_extract_failure_before_chunking_preserves_previous_chunk_snapshot``
  renamed to ``..._clears_stale_chunk_snapshot`` and inverted: the
  previous snapshot is now intentionally not preserved across resume +
  failure, matching the documented "已抽取文档一律删旧 chunks 重做" rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ss transitions

doc_status storage backends treat the ``metadata`` field as an opaque
blob and **replace** it on every upsert, so the
``metadata.process_options`` mirror seeded at PENDING was getting
clobbered as soon as the doc transitioned to PARSING / ANALYZING /
PROCESSING / PROCESSED / FAILED.  Admin / list APIs that read
``doc_status.metadata`` per the new API contract were therefore
unable to surface the per-document strategy after processing started.

This fix carries ``process_options`` (and any future long-lived metadata
fields) explicitly through every state-machine transition by:

- Adding ``doc_status_transition_metadata(status_doc, *, extra=None)``
  in ``lightrag/utils_pipeline.py``.  It builds the metadata payload to
  upsert by carrying over the keys listed in
  ``_DOC_STATUS_METADATA_CARRY_OVER_KEYS`` (currently
  ``("process_options",)``) from the loaded ``status_doc.metadata``,
  then layering in any transition-specific ``extra=`` fields
  (``processing_start_time`` / ``processing_end_time`` / extraction
  meta).  Future long-lived fields can be added by extending the tuple.
- Replacing every state-transition upsert in
  ``apipeline_process_enqueue_documents`` (PENDING-reset, inline
  PARSING / ANALYZING / PROCESSING / PROCESSED / FAILED, worker-path
  PARSING / ANALYZING / FAILED, and ``_mark_duplicate_after_parse``'s
  content-hash duplicate record) to call the helper.  Sites that did
  not previously write a ``metadata`` field now do, so the carry-over
  is consistent regardless of state.
- Adding two regression tests:
  - ``test_doc_status_metadata_carry_over_helper`` exercises the helper
    in isolation: carry-over alone, carry-over + extras, missing
    metadata, empty / None process_options.
  - ``test_doc_status_metadata_survives_processed_transition`` enqueues
    a document with ``process_options='iet!'`` and runs the full
    pipeline to PROCESSED, asserting that the final
    ``doc_status.metadata.process_options`` is still ``'iet!'``.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@danielaskdd danielaskdd force-pushed the fix/preserve-process-options-metadata branch from 204221c to 5280ac2 Compare May 4, 2026 19:17
@danielaskdd danielaskdd merged commit 5f525d0 into HKUDS:dev May 5, 2026
2 of 3 checks passed
@danielaskdd danielaskdd deleted the fix/preserve-process-options-metadata branch May 5, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant