Fix: respect user-configured chunk_token_num for MinerU/docling/paddleocr parsers (#13234)

## Summary

When using MinerU, docling, TCADP, or paddleocr as the PDF parser with
the General (naive) chunk method, the user-configured `chunk_token_num`
is **unconditionally overwritten to 0** at
[rag/app/naive.py#L858-L859](https://github.com/infiniflow/ragflow/blob/main/rag/app/naive.py#L858-L859),
effectively disabling chunk merging regardless of what the user sets in
the UI.

### Problem

A user sets `chunk_token_num = 2048` in the dataset configuration UI,
expecting small parser blocks to be merged into larger chunks. However,
this line:

```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
    parser_config["chunk_token_num"] = 0
```

silently overrides the user's setting. As a result, every MinerU output
block becomes its own chunk. For short documents (e.g. a 3-page PDF fund
factsheet parsed by MinerU), this produces **47 tiny chunks** — some as
small as 11 characters (`"July 2025"`) or 15 characters (`"CIES
Eligible"`).

This severely degrades retrieval quality: vector embeddings of such
short fragments have minimal semantic value, and keyword search produces
excessive noise.

### Fix

Only apply the `chunk_token_num = 0` override when the user has **not**
explicitly configured a positive value:

```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
    if int(parser_config.get("chunk_token_num", 0)) <= 0:
        parser_config["chunk_token_num"] = 0
```

This preserves the original default behavior (no merging) while
respecting the user's explicit configuration.

### Before / After (MinerU, 3-page PDF, chunk_token_num=2048)

| | Before | After |
|---|---|---|
| Chunks produced | 47 | ~8 (merged by token limit) |
| Smallest chunk | 11 chars | ~500 chars |
| User setting respected | No | Yes |

## Test plan

- [ ] Parse a PDF with MinerU and `chunk_token_num = 2048` → verify
chunks are merged up to token limit
- [ ] Parse a PDF with MinerU and `chunk_token_num = 0` (or default) →
verify original behavior (no merging)
- [ ] Parse a PDF with DeepDOC parser → verify no change in behavior
(not affected by this code path)
- [ ] Repeat with docling/paddleocr if available
This commit is contained in:
liuxiaoyusky
2026-03-02 15:31:40 +08:00
committed by GitHub
parent d430446e69
commit 8ba66dd62a

View File

@@ -881,7 +881,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", ca
tables = append_context2table_image4pdf(sections, tables, image_context_size)
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
parser_config["chunk_token_num"] = 0
if int(parser_config.get("chunk_token_num", 0)) <= 0:
parser_config["chunk_token_num"] = 0
res = tokenize_table(tables, doc, is_english)
callback(0.8, "Finish parsing.")