Skip to content

fix(pipeline): apply S chunking option to native parsed output#3016

Closed
danielaskdd wants to merge 5 commits intoHKUDS:devfrom
danielaskdd:fix/native-s-chunking
Closed

fix(pipeline): apply S chunking option to native parsed output#3016
danielaskdd wants to merge 5 commits intoHKUDS:devfrom
danielaskdd:fix/native-s-chunking

Conversation

@danielaskdd
Copy link
Copy Markdown
Collaborator

Summary

P2 fix: the documented S (heading-driven) chunking option never took effect for the native / mineru / docling structured-output path.

parse_native() writes a *.blocks.jsonl with full heading metadata and returns the merged plain text as parsed_data['content'] for backwards compatibility. The chunking code attempted to re-discover structure by passing that merged text through parse_interchange_jsonl — which only succeeds for documents that ship JSONL inline as their content. For the native path (the very case S was designed for) detection always failed, S always logged no structured interchange output is available; falling back to fixed chunking ('F'), and heading-driven chunking was effectively dead code.

This stacks on top of #3013, #3014, and #3015; only the new fix(pipeline): apply S chunking option to native parsed output commit is unique to this PR.

Fix

  • New helper chunk_lightrag_blocks_by_heading in lightrag/utils_pipeline.py. Reads the *.blocks.jsonl directly:

    • Groups consecutive content blocks by (level, heading, parent_headings) so a heading change is a hard chunk boundary.
    • Renders Heading: <title>\n\n<body> so retrieval contexts stay self-describing.
    • Splits any group whose tokens exceed max_tokens along a sliding window with overlap_tokens overlap, never crossing into another heading.
    • Returns chunks shaped like parse_interchange_jsonl output (chunk_id, chunk_order_index, content, content_type='body', tokens, table_chunk_role='none', plus heading / parent_headings / level).
  • process_document chunking branch in lightrag/pipeline.py. Added a new path between the existing interchange-detection branch and the fixed-chunking fallback:
    parse_interchange_jsonl declines AND process_options.chunking == 'S' AND parsed_data['blocks_path'] is non-empty → invoke the heading chunker; tag extraction_meta['chunking_method'] = 'heading_driven'.

  • Fallback semantics preserved:

    • R mode and S without a structured blocks file still fall through to fixed chunking with the original warning.
    • F (default) behaviour unchanged.
    • If the heading chunker returns no chunks (corrupt / trivially empty blocks file), the pipeline emits a more specific warning and falls back to fixed chunking.

Test plan

  • ruff check lightrag tests passes
  • pytest tests1087 passed, 1 skipped, 1 xfailed (4 new regressions, no existing failures)
  • New regression tests in tests/test_pipeline_release_closure.py:
    • test_chunk_lightrag_blocks_by_heading_groups_consecutive_blocks — heading boundary creates a new chunk; consecutive blocks under the same heading concatenate with the heading preface
    • test_chunk_lightrag_blocks_by_heading_splits_oversize_group — oversize single-heading group is split into multiple chunks with overlap, never crossing into another heading
    • test_chunk_lightrag_blocks_by_heading_returns_empty_for_missing_file — missing blocks file returns [] (caller falls back to F)
    • test_pipeline_uses_heading_chunker_for_native_S_with_blocks — end-to-end: pipeline invokes the heading chunker for a native S-tagged document with a real blocks file

Compatibility

No HTTP / Python public-API breakage. This is a pure correctness fix for an option that was already documented but didn't work for its primary use case.

🤖 Generated with Claude Code

danielaskdd and others added 5 commits May 5, 2026 00:41
…sename dedup

- introduce process_options string (i/t/e/!/F/R/S) for per-document multimodal, KG and chunking control
- add filename hint support for [ENGINE-OPTIONS], [OPTIONS] and [ENGINE] forms
- extend LIGHTRAG_PARSER rules with engine-options suffix for default processing options
- add canonical_basename field for stable dedup and doc_id generation while preserving user-visible file_path with hints
- deprecate addon_params["enable_multimodal_pipeline"] in favor of per-document process_options
- update analyze_multimodal to gate VLM analysis by process_options and log opt-in/sidecar mismatches
- skip entity/relation extraction when process_options "!" is set, keeping chunks for naive/mix retrieval
- add chunking strategy selection (F/R/S) with fallback logging for unstructured legacy paths
- persist process_options and canonical_basename through full_docs and doc_status metadata
- update document routing, pipeline enqueue, storage backends and tests for new fields and dedup logic
…lyze

Tightens upload/scan/enqueue concurrency rules and makes analyze_multimodal
idempotent so users can incrementally enable i/t/e modalities without
re-running the VLM on already-analyzed sidecar items.

- pipeline_status gains a ``scanning`` flag; the /documents/scan endpoint
  acquires it synchronously before scheduling the background task and
  refuses with status="scanning_skipped_pipeline_busy" when the pipeline
  is busy or another scan is already in flight
- /documents/upload, /documents/text, /documents/texts now reject with
  HTTP 409 while pipeline_status['busy'] or ['scanning'] is set
- Strict name pre-check on upload: same-canonical-basename in INPUT
  directory or doc_status now raises 409 instead of returning a
  status="duplicated" 200 payload; clients must DELETE the existing
  record before re-uploading
- apipeline_enqueue_documents adds a last-line RuntimeError guard for
  busy/scanning state; the reprocess_existing_non_processed parameter is
  removed from this and pipeline_enqueue_file / pipeline_index_files
  (recovery of half-finished documents will be handled by a future
  pipeline-resume branch instead of re-enqueueing)
- analyze_multimodal drops the meta.analyze_time early-return; per-item
  llm_analyze_result presence is checked instead so re-running with new
  i/t/e options only analyzes the newly enabled modalities; analyze_time
  becomes the timestamp of the most recent successful pass
- WebUI UploadDocumentsDialog maps HTTP 409 with "already contains" /
  "Status:" detail back to the duplicate-file UI affordance, surfacing
  other 409 reasons (busy/scanning) verbatim from the server
- WebUI lightrag.ts type aligned: DocActionResponse drops 'duplicated',
  ScanResponse adds 'scanning_skipped_pipeline_busy'
- InsertResponse Pydantic Literal narrowed to remove the now-unreachable
  "duplicated" value
- docs/FileProcessingConfiguration-zh.md adds a "并发与重入约束" section
  and a "流水线启动时的续跑规则" section
- New regression tests for busy/scanning rejection at every layer,
  scanning flag acquire/release lifecycle, and analyze_multimodal
  per-item idempotency

BREAKING CHANGES
- HTTP: same-name conflicts on upload/text/texts now return 409 instead
  of a 200 status="duplicated" payload; clients reading the response
  status field must catch the 409 error path
- Python API: apipeline_enqueue_documents / pipeline_enqueue_file /
  pipeline_index_files no longer accept reprocess_existing_non_processed
- Python API: apipeline_enqueue_documents raises RuntimeError when called
  while another pipeline run or scan is in progress

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ocess_options

When ``apipeline_process_enqueue_documents`` picks up a half-processed
document whose content is already extracted into ``full_docs`` (raw
content or LightRAG blocks file present), redo the post-extraction
stages cleanly under the *current* ``process_options`` rather than
mixing stale and fresh chunks/entities.

- New ``LightRAG._purge_doc_chunks_and_kg(doc_id, chunk_ids)`` helper
  removes a document's chunks from ``chunks_vdb`` / ``text_chunks``,
  classifies its entity / relation contributions into delete-outright
  vs rebuild-from-remaining, applies the corresponding cleanup, and
  rebuilds entries that other documents still source.  Does NOT touch
  ``doc_status`` / ``full_docs`` / ``llm_response_cache`` / pipeline
  busy state — it is the focused KG-cleanup core suitable for both
  deletion and resume callers.  ``adelete_by_doc_id`` remains unchanged
  for now (deduplicating it can be a future PR).
- ``process_document`` gains a resume guard at the convergence point of
  the worker-driven and inline parse paths.  When content is already
  extracted, it warns on engine mismatch (extracted content is the
  source of truth — switching engines requires delete + re-upload),
  purges any stale chunks recorded in ``chunks_list`` via the new
  helper, and resets ``status_doc.chunks_list`` / ``chunks_count`` so
  subsequent state-machine upserts do not re-write stale IDs.
- ``parse_native`` already returns existing content for format=lightrag
  and format=raw without re-parsing, so the resume branch reuses the
  existing parse-stage dispatch unchanged.
- New regression tests:
  - ``_purge_doc_chunks_and_kg`` is a no-op for empty chunk_ids.
  - ``_purge_doc_chunks_and_kg`` clears chunks_vdb / text_chunks for a
    document with no graph contributions yet.
  - The pipeline calls the purge helper with the previous run's chunk
    IDs when resuming an already-extracted document.
  - The pipeline skips the purge when ``chunks_list`` is empty.
- ``test_extract_failure_before_chunking_preserves_previous_chunk_snapshot``
  renamed to ``..._clears_stale_chunk_snapshot`` and inverted: the
  previous snapshot is now intentionally not preserved across resume +
  failure, matching the documented "已抽取文档一律删旧 chunks 重做" rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously the documented ``S`` (heading-driven) chunking option never
took effect for the native / mineru / docling path: ``parse_native()``
writes a structured ``*.blocks.jsonl`` and returns the merged plain
text as ``parsed_data['content']``, so ``parse_interchange_jsonl``
returned ``None`` against that merged text and the chunking branch
unconditionally fell through to fixed chunking with a "no structured
interchange output is available" warning.

This change adds a heading-aware chunker over the already-written
blocks file and wires the ``S`` mode to it when ``parsed_data`` carries
a ``blocks_path``.

- New ``chunk_lightrag_blocks_by_heading`` helper in
  ``lightrag/utils_pipeline.py``: groups consecutive content blocks by
  their ``(level, heading, parent_headings)`` key into a chunk with the
  heading rendered as a "Heading: <title>" preface, then splits any
  group whose accumulated tokens exceed ``max_tokens`` along a sliding
  window with overlap (heading boundaries are hard splits).
- ``process_document`` chunking branch in ``lightrag/pipeline.py``
  gains a new path: when ``parse_interchange_jsonl`` declines AND
  ``process_options.chunking == 'S'`` AND ``parsed_data['blocks_path']``
  is non-empty, invoke the heading chunker and tag
  ``extraction_meta['chunking_method'] = 'heading_driven'``.  If the
  heading chunker returns no chunks (corrupt or trivially empty blocks
  file) the warning + fixed-chunking fallback is preserved.
- ``R`` mode and ``S`` mode without a structured blocks file remain on
  the existing fixed-chunking fallback with the original warning.
- New regression tests in ``tests/test_pipeline_release_closure.py``:
  - heading boundary creates a new chunk; consecutive blocks under the
    same heading concatenate into a single chunk with the heading
    preface
  - oversize single-heading group is split into multiple chunks with
    overlap, never crossing into another heading
  - missing blocks file returns an empty list (caller falls back to F)
  - end-to-end: pipeline invokes the heading chunker for a native
    ``S``-tagged document with a real blocks file

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@danielaskdd danielaskdd marked this pull request as draft May 5, 2026 12:30
@danielaskdd danielaskdd closed this May 7, 2026
@danielaskdd danielaskdd deleted the fix/native-s-chunking branch May 7, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant