fix(pipeline): apply S chunking option to native parsed output by danielaskdd · Pull Request #3016 · HKUDS/LightRAG

danielaskdd · 2026-05-04T18:56:28Z

Summary

P2 fix: the documented S (heading-driven) chunking option never took effect for the native / mineru / docling structured-output path.

parse_native() writes a *.blocks.jsonl with full heading metadata and returns the merged plain text as parsed_data['content'] for backwards compatibility. The chunking code attempted to re-discover structure by passing that merged text through parse_interchange_jsonl — which only succeeds for documents that ship JSONL inline as their content. For the native path (the very case S was designed for) detection always failed, S always logged no structured interchange output is available; falling back to fixed chunking ('F'), and heading-driven chunking was effectively dead code.

This stacks on top of #3013, #3014, and #3015; only the new fix(pipeline): apply S chunking option to native parsed output commit is unique to this PR.

Fix

New helper chunk_lightrag_blocks_by_heading in lightrag/utils_pipeline.py. Reads the *.blocks.jsonl directly:
- Groups consecutive content blocks by (level, heading, parent_headings) so a heading change is a hard chunk boundary.
- Renders Heading: <title>\n\n<body> so retrieval contexts stay self-describing.
- Splits any group whose tokens exceed max_tokens along a sliding window with overlap_tokens overlap, never crossing into another heading.
- Returns chunks shaped like parse_interchange_jsonl output (chunk_id, chunk_order_index, content, content_type='body', tokens, table_chunk_role='none', plus heading / parent_headings / level).
process_document chunking branch in lightrag/pipeline.py. Added a new path between the existing interchange-detection branch and the fixed-chunking fallback:
parse_interchange_jsonl declines AND process_options.chunking == 'S' AND parsed_data['blocks_path'] is non-empty → invoke the heading chunker; tag extraction_meta['chunking_method'] = 'heading_driven'.
Fallback semantics preserved:
- R mode and S without a structured blocks file still fall through to fixed chunking with the original warning.
- F (default) behaviour unchanged.
- If the heading chunker returns no chunks (corrupt / trivially empty blocks file), the pipeline emits a more specific warning and falls back to fixed chunking.

Test plan

ruff check lightrag tests passes
pytest tests — 1087 passed, 1 skipped, 1 xfailed (4 new regressions, no existing failures)
New regression tests in tests/test_pipeline_release_closure.py:
- test_chunk_lightrag_blocks_by_heading_groups_consecutive_blocks — heading boundary creates a new chunk; consecutive blocks under the same heading concatenate with the heading preface
- test_chunk_lightrag_blocks_by_heading_splits_oversize_group — oversize single-heading group is split into multiple chunks with overlap, never crossing into another heading
- test_chunk_lightrag_blocks_by_heading_returns_empty_for_missing_file — missing blocks file returns [] (caller falls back to F)
- test_pipeline_uses_heading_chunker_for_native_S_with_blocks — end-to-end: pipeline invokes the heading chunker for a native S-tagged document with a real blocks file

Compatibility

No HTTP / Python public-API breakage. This is a pure correctness fix for an option that was already documented but didn't work for its primary use case.

🤖 Generated with Claude Code

…sename dedup - introduce process_options string (i/t/e/!/F/R/S) for per-document multimodal, KG and chunking control - add filename hint support for [ENGINE-OPTIONS], [OPTIONS] and [ENGINE] forms - extend LIGHTRAG_PARSER rules with engine-options suffix for default processing options - add canonical_basename field for stable dedup and doc_id generation while preserving user-visible file_path with hints - deprecate addon_params["enable_multimodal_pipeline"] in favor of per-document process_options - update analyze_multimodal to gate VLM analysis by process_options and log opt-in/sidecar mismatches - skip entity/relation extraction when process_options "!" is set, keeping chunks for naive/mix retrieval - add chunking strategy selection (F/R/S) with fallback logging for unstructured legacy paths - persist process_options and canonical_basename through full_docs and doc_status metadata - update document routing, pipeline enqueue, storage backends and tests for new fields and dedup logic

…lyze Tightens upload/scan/enqueue concurrency rules and makes analyze_multimodal idempotent so users can incrementally enable i/t/e modalities without re-running the VLM on already-analyzed sidecar items. - pipeline_status gains a ``scanning`` flag; the /documents/scan endpoint acquires it synchronously before scheduling the background task and refuses with status="scanning_skipped_pipeline_busy" when the pipeline is busy or another scan is already in flight - /documents/upload, /documents/text, /documents/texts now reject with HTTP 409 while pipeline_status['busy'] or ['scanning'] is set - Strict name pre-check on upload: same-canonical-basename in INPUT directory or doc_status now raises 409 instead of returning a status="duplicated" 200 payload; clients must DELETE the existing record before re-uploading - apipeline_enqueue_documents adds a last-line RuntimeError guard for busy/scanning state; the reprocess_existing_non_processed parameter is removed from this and pipeline_enqueue_file / pipeline_index_files (recovery of half-finished documents will be handled by a future pipeline-resume branch instead of re-enqueueing) - analyze_multimodal drops the meta.analyze_time early-return; per-item llm_analyze_result presence is checked instead so re-running with new i/t/e options only analyzes the newly enabled modalities; analyze_time becomes the timestamp of the most recent successful pass - WebUI UploadDocumentsDialog maps HTTP 409 with "already contains" / "Status:" detail back to the duplicate-file UI affordance, surfacing other 409 reasons (busy/scanning) verbatim from the server - WebUI lightrag.ts type aligned: DocActionResponse drops 'duplicated', ScanResponse adds 'scanning_skipped_pipeline_busy' - InsertResponse Pydantic Literal narrowed to remove the now-unreachable "duplicated" value - docs/FileProcessingConfiguration-zh.md adds a "并发与重入约束" section and a "流水线启动时的续跑规则" section - New regression tests for busy/scanning rejection at every layer, scanning flag acquire/release lifecycle, and analyze_multimodal per-item idempotency BREAKING CHANGES - HTTP: same-name conflicts on upload/text/texts now return 409 instead of a 200 status="duplicated" payload; clients reading the response status field must catch the 409 error path - Python API: apipeline_enqueue_documents / pipeline_enqueue_file / pipeline_index_files no longer accept reprocess_existing_non_processed - Python API: apipeline_enqueue_documents raises RuntimeError when called while another pipeline run or scan is in progress Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ocess_options When ``apipeline_process_enqueue_documents`` picks up a half-processed document whose content is already extracted into ``full_docs`` (raw content or LightRAG blocks file present), redo the post-extraction stages cleanly under the *current* ``process_options`` rather than mixing stale and fresh chunks/entities. - New ``LightRAG._purge_doc_chunks_and_kg(doc_id, chunk_ids)`` helper removes a document's chunks from ``chunks_vdb`` / ``text_chunks``, classifies its entity / relation contributions into delete-outright vs rebuild-from-remaining, applies the corresponding cleanup, and rebuilds entries that other documents still source. Does NOT touch ``doc_status`` / ``full_docs`` / ``llm_response_cache`` / pipeline busy state — it is the focused KG-cleanup core suitable for both deletion and resume callers. ``adelete_by_doc_id`` remains unchanged for now (deduplicating it can be a future PR). - ``process_document`` gains a resume guard at the convergence point of the worker-driven and inline parse paths. When content is already extracted, it warns on engine mismatch (extracted content is the source of truth — switching engines requires delete + re-upload), purges any stale chunks recorded in ``chunks_list`` via the new helper, and resets ``status_doc.chunks_list`` / ``chunks_count`` so subsequent state-machine upserts do not re-write stale IDs. - ``parse_native`` already returns existing content for format=lightrag and format=raw without re-parsing, so the resume branch reuses the existing parse-stage dispatch unchanged. - New regression tests: - ``_purge_doc_chunks_and_kg`` is a no-op for empty chunk_ids. - ``_purge_doc_chunks_and_kg`` clears chunks_vdb / text_chunks for a document with no graph contributions yet. - The pipeline calls the purge helper with the previous run's chunk IDs when resuming an already-extracted document. - The pipeline skips the purge when ``chunks_list`` is empty. - ``test_extract_failure_before_chunking_preserves_previous_chunk_snapshot`` renamed to ``..._clears_stale_chunk_snapshot`` and inverted: the previous snapshot is now intentionally not preserved across resume + failure, matching the documented "已抽取文档一律删旧 chunks 重做" rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Previously the documented ``S`` (heading-driven) chunking option never took effect for the native / mineru / docling path: ``parse_native()`` writes a structured ``*.blocks.jsonl`` and returns the merged plain text as ``parsed_data['content']``, so ``parse_interchange_jsonl`` returned ``None`` against that merged text and the chunking branch unconditionally fell through to fixed chunking with a "no structured interchange output is available" warning. This change adds a heading-aware chunker over the already-written blocks file and wires the ``S`` mode to it when ``parsed_data`` carries a ``blocks_path``. - New ``chunk_lightrag_blocks_by_heading`` helper in ``lightrag/utils_pipeline.py``: groups consecutive content blocks by their ``(level, heading, parent_headings)`` key into a chunk with the heading rendered as a "Heading: <title>" preface, then splits any group whose accumulated tokens exceed ``max_tokens`` along a sliding window with overlap (heading boundaries are hard splits). - ``process_document`` chunking branch in ``lightrag/pipeline.py`` gains a new path: when ``parse_interchange_jsonl`` declines AND ``process_options.chunking == 'S'`` AND ``parsed_data['blocks_path']`` is non-empty, invoke the heading chunker and tag ``extraction_meta['chunking_method'] = 'heading_driven'``. If the heading chunker returns no chunks (corrupt or trivially empty blocks file) the warning + fixed-chunking fallback is preserved. - ``R`` mode and ``S`` mode without a structured blocks file remain on the existing fixed-chunking fallback with the original warning. - New regression tests in ``tests/test_pipeline_release_closure.py``: - heading boundary creates a new chunk; consecutive blocks under the same heading concatenate into a single chunk with the heading preface - oversize single-heading group is split into multiple chunks with overlap, never crossing into another heading - missing blocks file returns an empty list (caller falls back to F) - end-to-end: pipeline invokes the heading chunker for a native ``S``-tagged document with a real blocks file Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

danielaskdd and others added 5 commits May 5, 2026 00:41

Fix lintings

7d654c1

danielaskdd mentioned this pull request May 4, 2026

fix(pipeline): preserve process_options in doc_status metadata across transitions #3017

Merged

3 tasks

danielaskdd marked this pull request as draft May 5, 2026 12:30

danielaskdd closed this May 7, 2026

danielaskdd deleted the fix/native-s-chunking branch May 7, 2026 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pipeline): apply S chunking option to native parsed output#3016

fix(pipeline): apply S chunking option to native parsed output#3016
danielaskdd wants to merge 5 commits intoHKUDS:devfrom
danielaskdd:fix/native-s-chunking

danielaskdd commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielaskdd commented May 4, 2026

Summary

Fix

Test plan

Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant