Fix: respect user-configured chunk_token_num for MinerU/docling/paddleocr parsers (#13234)

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-02 16:55:42 +08:00

## Summary

When using MinerU, docling, TCADP, or paddleocr as the PDF parser with
the General (naive) chunk method, the user-configured `chunk_token_num`
is **unconditionally overwritten to 0** at
[rag/app/naive.py#L858-L859](https://github.com/infiniflow/ragflow/blob/main/rag/app/naive.py#L858-L859),
effectively disabling chunk merging regardless of what the user sets in
the UI.

### Problem

A user sets `chunk_token_num = 2048` in the dataset configuration UI,
expecting small parser blocks to be merged into larger chunks. However,
this line:

```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
    parser_config["chunk_token_num"] = 0
```

silently overrides the user's setting. As a result, every MinerU output
block becomes its own chunk. For short documents (e.g. a 3-page PDF fund
factsheet parsed by MinerU), this produces **47 tiny chunks** — some as
small as 11 characters (`"July 2025"`) or 15 characters (`"CIES
Eligible"`).

This severely degrades retrieval quality: vector embeddings of such
short fragments have minimal semantic value, and keyword search produces
excessive noise.

### Fix

Only apply the `chunk_token_num = 0` override when the user has **not**
explicitly configured a positive value:

```python
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
    if int(parser_config.get("chunk_token_num", 0)) <= 0:
        parser_config["chunk_token_num"] = 0
```

This preserves the original default behavior (no merging) while
respecting the user's explicit configuration.

### Before / After (MinerU, 3-page PDF, chunk_token_num=2048)

| | Before | After |
|---|---|---|
| Chunks produced | 47 | ~8 (merged by token limit) |
| Smallest chunk | 11 chars | ~500 chars |
| User setting respected | No | Yes |

## Test plan

- [ ] Parse a PDF with MinerU and `chunk_token_num = 2048` → verify
chunks are merged up to token limit
- [ ] Parse a PDF with MinerU and `chunk_token_num = 0` (or default) →
verify original behavior (no merging)
- [ ] Parse a PDF with DeepDOC parser → verify no change in behavior
(not affected by this code path)
- [ ] Repeat with docling/paddleocr if available

This commit is contained in:

liuxiaoyusky

2026-03-02 15:31:40 +08:00

committed by

GitHub

parent d430446e69

commit 8ba66dd62a

1 changed files with 2 additions and 1 deletions

									
										3

rag/app/naive.py
									
												View File
												
				@@ -881,7 +881,8 @@ def chunk(filename, binary=None, from_page=0, to_page=100000, lang="Chinese", ca

				            tables = append_context2table_image4pdf(sections, tables, image_context_size)

				        if name in ["tcadp", "docling", "mineru", "paddleocr"]:

				            parser_config["chunk_token_num"] = 0

				            if int(parser_config.get("chunk_token_num", 0)) <= 0:

				                parser_config["chunk_token_num"] = 0

				        res = tokenize_table(tables, doc, is_english)

				        callback(0.8, "Finish parsing.")

Fix: respect user-configured chunk_token_num for MinerU/docling/paddleocr parsers (#13234)

3 rag/app/naive.py Unescape Escape View File

3

rag/app/naive.py

View File