From 8ba66dd62a77bbd95c7e0262e53d61fd4f0bcaa9 Mon Sep 17 00:00:00 2001 From: liuxiaoyusky <49766325+liuxiaoyusky@users.noreply.github.com> Date: Mon, 2 Mar 2026 15:31:40 +0800 Subject: [PATCH] Fix: respect user-configured chunk_token_num for MinerU/docling/paddleocr parsers (#13234) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary When using MinerU, docling, TCADP, or paddleocr as the PDF parser with the General (naive) chunk method, the user-configured `chunk_token_num` is **unconditionally overwritten to 0** at [rag/app/naive.py#L858-L859](https://github.com/infiniflow/ragflow/blob/main/rag/app/naive.py#L858-L859), effectively disabling chunk merging regardless of what the user sets in the UI. ### Problem A user sets `chunk_token_num = 2048` in the dataset configuration UI, expecting small parser blocks to be merged into larger chunks. However, this line: ```python if name in ["tcadp", "docling", "mineru", "paddleocr"]: parser_config["chunk_token_num"] = 0 ``` silently overrides the user's setting. As a result, every MinerU output block becomes its own chunk. For short documents (e.g. a 3-page PDF fund factsheet parsed by MinerU), this produces **47 tiny chunks** — some as small as 11 characters (`"July 2025"`) or 15 characters (`"CIES Eligible"`). This severely degrades retrieval quality: vector embeddings of such short fragments have minimal semantic value, and keyword search produces excessive noise. ### Fix Only apply the `chunk_token_num = 0` override when the user has **not** explicitly configured a positive value: ```python if name in ["tcadp", "docling", "mineru", "paddleocr"]: if int(parser_config.get("chunk_token_num", 0)) <= 0: parser_config["chunk_token_num"] = 0 ``` This preserves the original default behavior (no merging) while respecting the user's explicit configuration. ### Before / After (MinerU, 3-page PDF, chunk_token_num=2048) | | Before | After | |---|---|---| | Chunks produced | 47 | ~8 (merged by token limit) | | Smallest chunk | 11 chars | ~500 chars | | User setting respected | No | Yes | ## Test plan - [ ] Parse a PDF with MinerU and `chunk_token_num = 2048` → verify chunks are merged up to token limit - [ ] Parse a PDF with MinerU and `chunk_token_num = 0` (or default) → verify original behavior (no merging) - [ ] Parse a PDF with DeepDOC parser → verify no change in behavior (not affected by this code path) - [ ] Repeat with docling/paddleocr if available --- rag/app/naive.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/rag/app/naive.py b/rag/app/naive.py index ef84fa69cb..22606c3b32 100644 --- a/rag/app/naive.py +++ b/rag/app/naive.py @@ -881,7 +881,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", ca tables = append_context2table_image4pdf(sections, tables, image_context_size) if name in ["tcadp", "docling", "mineru", "paddleocr"]: - parser_config["chunk_token_num"] = 0 + if int(parser_config.get("chunk_token_num", 0)) <= 0: + parser_config["chunk_token_num"] = 0 res = tokenize_table(tables, doc, is_english) callback(0.8, "Finish parsing.")