Files
ragflow/test/fixtures/mineru
monsterDavid d398d617ca fix(mineru): skip page chrome blocks to prevent duplicate chunks (#15387)
## Summary
- Skip MinerU `header`, `footer`, and `page_number` blocks when
converting `content_list.json` into sections.
- Ignore unsupported block types explicitly so future MinerU output
types cannot re-emit the previous text block.

Fixes duplicate text in General/naive chunks when parsing PDFs via
MinerU (reported with repeated page headers and body text in slices).

Closes #15335

## Test plan
- [x] `pytest test/unit_test/deepdoc/parser/test_mineru_parser.py -v`
(4/4 passed)
2026-06-01 20:15:04 +08:00
..