From 8ba66dd62a77bbd95c7e0262e53d61fd4f0bcaa9 Mon Sep 17 00:00:00 2001
From: liuxiaoyusky <49766325+liuxiaoyusky@users.noreply.github.com>
Date: Mon, 2 Mar 2026 15:31:40 +0800
Subject: [PATCH] Fix: respect user-configured chunk_token_num for
 MinerU/docling/paddleocr parsers (#13234)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

When using MinerU, docling, TCADP, or paddleocr as the PDF parser with
the General (naive) chunk method, the user-configured `chunk_token_num`
is **unconditionally overwritten to 0** at
[rag/app/naive.py#L858-L859](https://github.com/infiniflow/ragflow/blob/main/rag/app/naive.py#L858-L859),
effectively disabling chunk merging regardless of what the user sets in
the UI.

### Problem

A user sets `chunk_token_num = 2048` in the dataset configuration UI,
expecting small parser blocks to be merged into larger chunks. However,
this line:

```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
    parser_config["chunk_token_num"] = 0
```

silently overrides the user's setting. As a result, every MinerU output
block becomes its own chunk. For short documents (e.g. a 3-page PDF fund
factsheet parsed by MinerU), this produces **47 tiny chunks** — some as
small as 11 characters (`"July 2025"`) or 15 characters (`"CIES
Eligible"`).

This severely degrades retrieval quality: vector embeddings of such
short fragments have minimal semantic value, and keyword search produces
excessive noise.

### Fix

Only apply the `chunk_token_num = 0` override when the user has **not**
explicitly configured a positive value:

```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
    if int(parser_config.get("chunk_token_num", 0)) <= 0:
        parser_config["chunk_token_num"] = 0
```

This preserves the original default behavior (no merging) while
respecting the user's explicit configuration.

### Before / After (MinerU, 3-page PDF, chunk_token_num=2048)

| | Before | After |
|---|---|---|
| Chunks produced | 47 | ~8 (merged by token limit) |
| Smallest chunk | 11 chars | ~500 chars |
| User setting respected | No | Yes |

## Test plan

- [ ] Parse a PDF with MinerU and `chunk_token_num = 2048` → verify
chunks are merged up to token limit
- [ ] Parse a PDF with MinerU and `chunk_token_num = 0` (or default) →
verify original behavior (no merging)
- [ ] Parse a PDF with DeepDOC parser → verify no change in behavior
(not affected by this code path)
- [ ] Repeat with docling/paddleocr if available
---
 rag/app/naive.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/rag/app/naive.py b/rag/app/naive.py
index ef84fa69cb..22606c3b32 100644
--- a/rag/app/naive.py
+++ b/rag/app/naive.py
@@ -881,7 +881,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", ca
             tables = append_context2table_image4pdf(sections, tables, image_context_size)
 
         if name in ["tcadp", "docling", "mineru", "paddleocr"]:
-            parser_config["chunk_token_num"] = 0
+            if int(parser_config.get("chunk_token_num", 0)) <= 0:
+                parser_config["chunk_token_num"] = 0
 
         res = tokenize_table(tables, doc, is_english)
         callback(0.8, "Finish parsing.")