fix(pipeline): apply S chunking option to native parsed output#3016
Closed
danielaskdd wants to merge 5 commits intoHKUDS:devfrom
Closed
fix(pipeline): apply S chunking option to native parsed output#3016danielaskdd wants to merge 5 commits intoHKUDS:devfrom
danielaskdd wants to merge 5 commits intoHKUDS:devfrom
Conversation
…sename dedup - introduce process_options string (i/t/e/!/F/R/S) for per-document multimodal, KG and chunking control - add filename hint support for [ENGINE-OPTIONS], [OPTIONS] and [ENGINE] forms - extend LIGHTRAG_PARSER rules with engine-options suffix for default processing options - add canonical_basename field for stable dedup and doc_id generation while preserving user-visible file_path with hints - deprecate addon_params["enable_multimodal_pipeline"] in favor of per-document process_options - update analyze_multimodal to gate VLM analysis by process_options and log opt-in/sidecar mismatches - skip entity/relation extraction when process_options "!" is set, keeping chunks for naive/mix retrieval - add chunking strategy selection (F/R/S) with fallback logging for unstructured legacy paths - persist process_options and canonical_basename through full_docs and doc_status metadata - update document routing, pipeline enqueue, storage backends and tests for new fields and dedup logic
…lyze Tightens upload/scan/enqueue concurrency rules and makes analyze_multimodal idempotent so users can incrementally enable i/t/e modalities without re-running the VLM on already-analyzed sidecar items. - pipeline_status gains a ``scanning`` flag; the /documents/scan endpoint acquires it synchronously before scheduling the background task and refuses with status="scanning_skipped_pipeline_busy" when the pipeline is busy or another scan is already in flight - /documents/upload, /documents/text, /documents/texts now reject with HTTP 409 while pipeline_status['busy'] or ['scanning'] is set - Strict name pre-check on upload: same-canonical-basename in INPUT directory or doc_status now raises 409 instead of returning a status="duplicated" 200 payload; clients must DELETE the existing record before re-uploading - apipeline_enqueue_documents adds a last-line RuntimeError guard for busy/scanning state; the reprocess_existing_non_processed parameter is removed from this and pipeline_enqueue_file / pipeline_index_files (recovery of half-finished documents will be handled by a future pipeline-resume branch instead of re-enqueueing) - analyze_multimodal drops the meta.analyze_time early-return; per-item llm_analyze_result presence is checked instead so re-running with new i/t/e options only analyzes the newly enabled modalities; analyze_time becomes the timestamp of the most recent successful pass - WebUI UploadDocumentsDialog maps HTTP 409 with "already contains" / "Status:" detail back to the duplicate-file UI affordance, surfacing other 409 reasons (busy/scanning) verbatim from the server - WebUI lightrag.ts type aligned: DocActionResponse drops 'duplicated', ScanResponse adds 'scanning_skipped_pipeline_busy' - InsertResponse Pydantic Literal narrowed to remove the now-unreachable "duplicated" value - docs/FileProcessingConfiguration-zh.md adds a "并发与重入约束" section and a "流水线启动时的续跑规则" section - New regression tests for busy/scanning rejection at every layer, scanning flag acquire/release lifecycle, and analyze_multimodal per-item idempotency BREAKING CHANGES - HTTP: same-name conflicts on upload/text/texts now return 409 instead of a 200 status="duplicated" payload; clients reading the response status field must catch the 409 error path - Python API: apipeline_enqueue_documents / pipeline_enqueue_file / pipeline_index_files no longer accept reprocess_existing_non_processed - Python API: apipeline_enqueue_documents raises RuntimeError when called while another pipeline run or scan is in progress Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ocess_options
When ``apipeline_process_enqueue_documents`` picks up a half-processed
document whose content is already extracted into ``full_docs`` (raw
content or LightRAG blocks file present), redo the post-extraction
stages cleanly under the *current* ``process_options`` rather than
mixing stale and fresh chunks/entities.
- New ``LightRAG._purge_doc_chunks_and_kg(doc_id, chunk_ids)`` helper
removes a document's chunks from ``chunks_vdb`` / ``text_chunks``,
classifies its entity / relation contributions into delete-outright
vs rebuild-from-remaining, applies the corresponding cleanup, and
rebuilds entries that other documents still source. Does NOT touch
``doc_status`` / ``full_docs`` / ``llm_response_cache`` / pipeline
busy state — it is the focused KG-cleanup core suitable for both
deletion and resume callers. ``adelete_by_doc_id`` remains unchanged
for now (deduplicating it can be a future PR).
- ``process_document`` gains a resume guard at the convergence point of
the worker-driven and inline parse paths. When content is already
extracted, it warns on engine mismatch (extracted content is the
source of truth — switching engines requires delete + re-upload),
purges any stale chunks recorded in ``chunks_list`` via the new
helper, and resets ``status_doc.chunks_list`` / ``chunks_count`` so
subsequent state-machine upserts do not re-write stale IDs.
- ``parse_native`` already returns existing content for format=lightrag
and format=raw without re-parsing, so the resume branch reuses the
existing parse-stage dispatch unchanged.
- New regression tests:
- ``_purge_doc_chunks_and_kg`` is a no-op for empty chunk_ids.
- ``_purge_doc_chunks_and_kg`` clears chunks_vdb / text_chunks for a
document with no graph contributions yet.
- The pipeline calls the purge helper with the previous run's chunk
IDs when resuming an already-extracted document.
- The pipeline skips the purge when ``chunks_list`` is empty.
- ``test_extract_failure_before_chunking_preserves_previous_chunk_snapshot``
renamed to ``..._clears_stale_chunk_snapshot`` and inverted: the
previous snapshot is now intentionally not preserved across resume +
failure, matching the documented "已抽取文档一律删旧 chunks 重做" rule.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously the documented ``S`` (heading-driven) chunking option never
took effect for the native / mineru / docling path: ``parse_native()``
writes a structured ``*.blocks.jsonl`` and returns the merged plain
text as ``parsed_data['content']``, so ``parse_interchange_jsonl``
returned ``None`` against that merged text and the chunking branch
unconditionally fell through to fixed chunking with a "no structured
interchange output is available" warning.
This change adds a heading-aware chunker over the already-written
blocks file and wires the ``S`` mode to it when ``parsed_data`` carries
a ``blocks_path``.
- New ``chunk_lightrag_blocks_by_heading`` helper in
``lightrag/utils_pipeline.py``: groups consecutive content blocks by
their ``(level, heading, parent_headings)`` key into a chunk with the
heading rendered as a "Heading: <title>" preface, then splits any
group whose accumulated tokens exceed ``max_tokens`` along a sliding
window with overlap (heading boundaries are hard splits).
- ``process_document`` chunking branch in ``lightrag/pipeline.py``
gains a new path: when ``parse_interchange_jsonl`` declines AND
``process_options.chunking == 'S'`` AND ``parsed_data['blocks_path']``
is non-empty, invoke the heading chunker and tag
``extraction_meta['chunking_method'] = 'heading_driven'``. If the
heading chunker returns no chunks (corrupt or trivially empty blocks
file) the warning + fixed-chunking fallback is preserved.
- ``R`` mode and ``S`` mode without a structured blocks file remain on
the existing fixed-chunking fallback with the original warning.
- New regression tests in ``tests/test_pipeline_release_closure.py``:
- heading boundary creates a new chunk; consecutive blocks under the
same heading concatenate into a single chunk with the heading
preface
- oversize single-heading group is split into multiple chunks with
overlap, never crossing into another heading
- missing blocks file returns an empty list (caller falls back to F)
- end-to-end: pipeline invokes the heading chunker for a native
``S``-tagged document with a real blocks file
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
P2 fix: the documented
S(heading-driven) chunking option never took effect for the native / mineru / docling structured-output path.parse_native()writes a*.blocks.jsonlwith full heading metadata and returns the merged plain text asparsed_data['content']for backwards compatibility. The chunking code attempted to re-discover structure by passing that merged text throughparse_interchange_jsonl— which only succeeds for documents that ship JSONL inline as their content. For the native path (the very caseSwas designed for) detection always failed,Salways loggedno structured interchange output is available; falling back to fixed chunking ('F'), and heading-driven chunking was effectively dead code.This stacks on top of #3013, #3014, and #3015; only the new
fix(pipeline): apply S chunking option to native parsed outputcommit is unique to this PR.Fix
New helper
chunk_lightrag_blocks_by_headingin lightrag/utils_pipeline.py. Reads the*.blocks.jsonldirectly:(level, heading, parent_headings)so a heading change is a hard chunk boundary.Heading: <title>\n\n<body>so retrieval contexts stay self-describing.max_tokensalong a sliding window withoverlap_tokensoverlap, never crossing into another heading.parse_interchange_jsonloutput (chunk_id,chunk_order_index,content,content_type='body',tokens,table_chunk_role='none', plusheading/parent_headings/level).process_documentchunking branch in lightrag/pipeline.py. Added a new path between the existing interchange-detection branch and the fixed-chunking fallback:parse_interchange_jsonldeclines ANDprocess_options.chunking == 'S'ANDparsed_data['blocks_path']is non-empty → invoke the heading chunker; tagextraction_meta['chunking_method'] = 'heading_driven'.Fallback semantics preserved:
Rmode andSwithout a structured blocks file still fall through to fixed chunking with the original warning.F(default) behaviour unchanged.Test plan
ruff check lightrag testspassespytest tests— 1087 passed, 1 skipped, 1 xfailed (4 new regressions, no existing failures)test_chunk_lightrag_blocks_by_heading_groups_consecutive_blocks— heading boundary creates a new chunk; consecutive blocks under the same heading concatenate with the heading prefacetest_chunk_lightrag_blocks_by_heading_splits_oversize_group— oversize single-heading group is split into multiple chunks with overlap, never crossing into another headingtest_chunk_lightrag_blocks_by_heading_returns_empty_for_missing_file— missing blocks file returns[](caller falls back toF)test_pipeline_uses_heading_chunker_for_native_S_with_blocks— end-to-end: pipeline invokes the heading chunker for a nativeS-tagged document with a real blocks fileCompatibility
No HTTP / Python public-API breakage. This is a pure correctness fix for an option that was already documented but didn't work for its primary use case.
🤖 Generated with Claude Code