### Bug
`RAGFlowHtmlParser.chunk_block()` splits an oversized block by slicing
the **tokenized** string and storing the joined tokens:
```python
tks_str = rag_tokenizer.tokenize(block)
...
tokens = tks_str.split(" ")
while start < len(tokens):
chunks.append(" ".join(tokens[start:start + chunk_token_num])) # tokenized form, not source
```
On the default (Elasticsearch) backend `rag_tokenizer.tokenize`
transforms text: it lowercases/stems Latin words and inserts spaces
between CJK characters. So any text block longer than `chunk_token_num`
is stored as garbled, lowercased, space-segmented text instead of the
source content. The small-block branch correctly stores the original
`block`, so only oversized blocks are corrupted. Affects HTML and EPUB
ingestion (both go through `chunk_block`), degrading retrieved chunks
and the answers generated from them.
### Real tokenizer behavior (infinity-sdk 0.7.0, ES backend)
```
tokenize("Hello World FOO Bar Baz Qux Jumps") -> "hello world foo bar baz qux jump" # lowercased + stemmed
tokenize("你好世界这是一个测试") -> "你好世界 这 是 一个 测试" # spaces inserted
```
### Fix
Split the **original** text: break it into atoms (whitespace-delimited
runs for space-separated scripts, per-character for spaceless scripts
such as Chinese) and pack them into pieces of at most `chunk_token_num`
tokens. This preserves the source characters and still splits scripts
that have no whitespace — a plain whitespace split would leave CJK as
one un-splittable chunk.
### Proof (real tokenizer, before/after)
Running the old vs new split against the real `infinity.rag_tokenizer`:
```
ENGLISH "Hello World FOO Bar Baz Qux Lazy Dogs" (chunk_token_num=4)
OLD: ['hello world foo bar', 'baz qux jump over', 'lazi dog'] # lowercased + stemmed
NEW: ['Hello World FOO Bar ', 'Baz Qux Jumps Over ', 'Lazy Dogs'] # preserved; each <= 4 tokens
NEW preserves text exactly: True
CHINESE "你好世界这是一个测试用例需要被切分成多个块" (chunk_token_num=3)
OLD: ['你好世界 这 是', '一个 测试用例 需要', ...] # spurious spaces
NEW: ['你好世', '界这是', '一个测', ...] # preserved; each <= 3 tokens
NEW preserves text exactly: True
```
### Tests
Added `test/unit_test/deepdoc/parser/test_html_parser.py` (English +
Chinese oversized blocks, plus small-block merge). Before the fix the
two oversized tests fail (English shows lowercasing, Chinese shows
inserted spaces); after the fix all pass. `ruff check` clean.
## Summary
Fixes#15487 — lone markdown headers are no longer isolated as empty
chunks when a custom `delimiter` is set.
- Merge consecutive lone headers before attaching to the following prose
body
- Skip code fences, tables, lists, and blockquotes via
`_is_attachable_body()`
- Unit tests include the `# Title / ## Intro / Body` regression from
CodeRabbit review
## Validation
- `pytest test/unit_test/deepdoc/parser/test_markdown_parser.py` (11
passed locally)
Closes#15487
### What problem does this PR solve?
`RAGFlowExcelParser.html()` iterates `(len(rows) - 1) // chunk_rows + 1`
times. `rows[0]` is the header, so `len(rows) - 1` is the data-row
count. When that count is an exact multiple of `chunk_rows`, the `+ 1`
over-counts by one: the final iteration's data slice is empty, but the
header row is still appended — producing a chunk that contains only the
table header and no data.
This is reachable via `rag/app/naive.py` (`html4excel`, `chunk_rows=12`)
and `rag/app/one.py`. A sheet with 12/24/36… data rows (or 256/512… with
the default `chunk_rows=256`) produces an extra
`<table><caption>…</caption><tr><th>…</th></tr></table>` chunk. It is
non-empty, so it passes the `if _` filter and gets indexed as a real
(empty) chunk.
| data rows (chunk_rows=12) | before | after |
|---|---|---|
| 12 | 2 chunks (1 header-only) | 1 |
| 24 | 3 chunks (1 header-only) | 2 |
| 13 | 2 (unchanged) | 2 |
### Fix
Iterate `ceil(n_data / chunk_rows)` times instead of `n_data //
chunk_rows + 1`. Adds
`test/unit_test/deepdoc/parser/test_excel_parser.py`; the
header-only-chunk cases fail before this change and pass after.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Used the Claude CLI while working on this.
### What problem does this PR solve?
Markdown extraction can split tables row by row when delimiter-based
extraction uses a newline delimiter. That loses table structure during
chunking even though delimiters should still split normally outside
tables.
This PR keeps the follow-up to #15482 intentionally narrow:
- preserve Markdown pipe tables during delimiter-based extraction
- preserve borderless pipe tables during delimiter-based extraction
- preserve multiline HTML tables during delimiter-based extraction
- keep delimiter splitting unchanged outside protected table ranges
Refs #15482
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Testing
- `ruff check deepdoc/parser/markdown_parser.py
test/unit_test/deepdoc/parser/test_markdown_parser.py`
- `python3 run_tests.py -t
test/unit_test/deepdoc/parser/test_markdown_parser.py`
- `git diff --check`
## Summary
- keep the native Docling chunking path when it returns usable chunks
- fall back to the standard Docling response parser when a chunked
request gets HTTP 200 but returns no usable chunks
- add a regression test for older Docling servers that accept the
chunking request but return a standard conversion payload
## Why
Older external Docling servers can accept a request containing
`do_chunking: true` and still return the standard conversion response
shape. The current code treats any HTTP 200 from the chunked request as
a native chunk response, finds no chunk entries, and returns zero
sections without trying the standard response parser.
Fixes#15569.
## Validation
- `python -m pytest
test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py -q`
- `python -m py_compile deepdoc\\parser\\docling_parser.py
test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py`
- `python -m ruff check deepdoc\\parser\\docling_parser.py
test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py`
- `git diff --check`
### What problem does this PR solve?
Markdown extraction currently applies custom delimiters before
respecting fenced code blocks. When a delimiter such as a newline is
configured, fenced code can be split into separate chunks, and longer
outer fences can be closed incorrectly by shorter nested fences.
This PR keeps the fix intentionally narrow for the Markdown chunking
discussion in #15482:
- preserve fenced code blocks when delimiter-based extraction is used
- support both backtick and tilde fences
- respect fence length so longer outer fences can contain shorter inner
fences
- keep delimiter splitting unchanged outside fenced blocks
Refs #15482
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Testing
- `ruff check deepdoc/parser/markdown_parser.py
test/unit_test/deepdoc/parser/test_markdown_parser.py`
- `python3 run_tests.py -t
test/unit_test/deepdoc/parser/test_markdown_parser.py`
## Summary
- Skip MinerU `header`, `footer`, and `page_number` blocks when
converting `content_list.json` into sections.
- Ignore unsupported block types explicitly so future MinerU output
types cannot re-emit the previous text block.
Fixes duplicate text in General/naive chunks when parsing PDFs via
MinerU (reported with repeated page headers and body text in slices).
Closes#15335
## Test plan
- [x] `pytest test/unit_test/deepdoc/parser/test_mineru_parser.py -v`
(4/4 passed)
## Summary
This change fixes ingestion quality issues where MinerU parser output
may contain HTML fragments (for example, table-related tags like `<tr>`,
`<td>`, `<br>`), which were previously passed directly into
chunking/tokenization and degraded chunk quality.
The fix adds a sanitization step in the MinerU parser path so parsed
sections are normalized to clean text before chunking.
## Change Type (select all)
- [x] Bug fix
- [x] Ingestion pipeline improvement
- [x] Parser/chunking quality fix
## Related Issue
- https://github.com/infiniflow/ragflow/issues/14831
Closes#1398
### What problem does this PR solve?
Adds native support for EPUB files. EPUB content is extracted in spine
(reading) order and parsed using the existing HTML parser. No new
dependencies required.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
To check this parser manually:
```python
uv run --python 3.12 python -c "
from deepdoc.parser import EpubParser
with open('$HOME/some_epub_book.epub', 'rb') as f:
data = f.read()
sections = EpubParser()(None, binary=data, chunk_token_num=512)
print(f'Got {len(sections)} sections')
for i, s in enumerate(sections[:5]):
print(f'\n--- Section {i} ---')
print(s[:200])
"
```
## Problem
When PDF fonts lack ToUnicode/CMap mappings, pdfplumber (pdfminer)
cannot map CIDs to correct Unicode characters, outputting PUA characters
(U+E000~U+F8FF) or `(cid:xxx)` placeholders. The original code fully
trusted pdfplumber text without any garbled detection, causing garbled
output in the final parsed result.
Relates to #13366
## Solution
### 1. Garbled text detection functions
- `_is_garbled_char(ch)`: Detects PUA characters (BMP/Plane 15/16),
replacement character U+FFFD, control characters, and
unassigned/surrogate codepoints
- `_is_garbled_text(text, threshold)`: Calculates garbled ratio and
detects `(cid:xxx)` patterns
### 2. Box-level fallback (in `__ocr()`)
When a text box has ≥50% garbled characters, discard pdfplumber text and
fallback to OCR recognition.
### 3. Page-level detection (in `__images__()`)
Sample characters from each page; if garbled rate ≥30%, clear all
pdfplumber characters for that page, forcing full OCR.
### 4. Layout recognizer CID filtering
Filter out `(cid:xxx)` patterns in `layout_recognizer.py` text
processing to prevent them from polluting layout analysis.
## Testing
- 29 unit tests covering: normal CJK/English text, PUA characters, CID
patterns, mixed text, boundary thresholds, edge cases
- All 85 existing project unit tests pass without regression