ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 15:31:05 +08:00

Author	SHA1	Message	Date
Yash Raj Pandey	dd2c88b768	fix(excel_parser): keep zero-valued cells when building Excel text chunks (#16287 )	2026-06-26 09:30:09 +08:00
Wang Qi	5defb4e7d6	Revert "fix(deepdoc): keep zero and false Excel cells in __call__" (#16366 ) Reverts infiniflow/ragflow#16318	2026-06-25 19:56:47 +08:00
Harsh Kashyap	66d86154ab	fix(deepdoc): accept GFM table separators with one or more dashes (#16319 )	2026-06-25 19:25:57 +08:00
Harsh Kashyap	0af5d43e8d	fix(deepdoc): keep zero and false Excel cells in __call__ (#16318 )	2026-06-25 19:12:57 +08:00
Yash Raj Pandey	091417980e	fix(html_parser): preserve original text when splitting oversized blocks (#16052 ) ### Bug `RAGFlowHtmlParser.chunk_block()` splits an oversized block by slicing the tokenized string and storing the joined tokens: ```python tks_str = rag_tokenizer.tokenize(block) ... tokens = tks_str.split(" ") while start < len(tokens): chunks.append(" ".join(tokens[start:start + chunk_token_num])) # tokenized form, not source ``` On the default (Elasticsearch) backend `rag_tokenizer.tokenize` transforms text: it lowercases/stems Latin words and inserts spaces between CJK characters. So any text block longer than `chunk_token_num` is stored as garbled, lowercased, space-segmented text instead of the source content. The small-block branch correctly stores the original `block`, so only oversized blocks are corrupted. Affects HTML and EPUB ingestion (both go through `chunk_block`), degrading retrieved chunks and the answers generated from them. ### Real tokenizer behavior (infinity-sdk 0.7.0, ES backend) ``` tokenize("Hello World FOO Bar Baz Qux Jumps") -> "hello world foo bar baz qux jump" # lowercased + stemmed tokenize("你好世界这是一个测试") -> "你好世界这是一个测试" # spaces inserted ``` ### Fix Split the original text: break it into atoms (whitespace-delimited runs for space-separated scripts, per-character for spaceless scripts such as Chinese) and pack them into pieces of at most `chunk_token_num` tokens. This preserves the source characters and still splits scripts that have no whitespace — a plain whitespace split would leave CJK as one un-splittable chunk. ### Proof (real tokenizer, before/after) Running the old vs new split against the real `infinity.rag_tokenizer`: ``` ENGLISH "Hello World FOO Bar Baz Qux Lazy Dogs" (chunk_token_num=4) OLD: ['hello world foo bar', 'baz qux jump over', 'lazi dog'] # lowercased + stemmed NEW: ['Hello World FOO Bar ', 'Baz Qux Jumps Over ', 'Lazy Dogs'] # preserved; each <= 4 tokens NEW preserves text exactly: True CHINESE "你好世界这是一个测试用例需要被切分成多个块" (chunk_token_num=3) OLD: ['你好世界这是', '一个测试用例需要', ...] # spurious spaces NEW: ['你好世', '界这是', '一个测', ...] # preserved; each <= 3 tokens NEW preserves text exactly: True ``` ### Tests Added `test/unit_test/deepdoc/parser/test_html_parser.py` (English + Chinese oversized blocks, plus small-block merge). Before the fix the two oversized tests fail (English shows lowercasing, Chinese shows inserted spaces); after the fix all pass. `ruff check` clean.	2026-06-25 16:43:35 +08:00
helloxjade	1b2da645c3	fix: deduplicate markdown table chunks (#16143 )	2026-06-24 13:22:57 +08:00
jaso0n0818	a70c7e8cc7	fix(deepdoc): attach lone header lines to the following section when delimiter is set (#16109 ) ## Summary Fixes #15487 — lone markdown headers are no longer isolated as empty chunks when a custom `delimiter` is set. - Merge consecutive lone headers before attaching to the following prose body - Skip code fences, tables, lists, and blockquotes via `_is_attachable_body()` - Unit tests include the `# Title / ## Intro / Body` regression from CodeRabbit review ## Validation - `pytest test/unit_test/deepdoc/parser/test_markdown_parser.py` (11 passed locally) Closes #15487	2026-06-18 14:24:09 +08:00
Yingfeng	cf5cca5cbb	Fix wrong unit test path (#15864 )	2026-06-09 22:48:33 +08:00
Yash Raj Pandey	14c460a525	Fix: Excel parser emits a spurious header-only chunk at exact chunk_rows multiples (#15490 ) ### What problem does this PR solve? `RAGFlowExcelParser.html()` iterates `(len(rows) - 1) // chunk_rows + 1` times. `rows[0]` is the header, so `len(rows) - 1` is the data-row count. When that count is an exact multiple of `chunk_rows`, the `+ 1` over-counts by one: the final iteration's data slice is empty, but the header row is still appended — producing a chunk that contains only the table header and no data. This is reachable via `rag/app/naive.py` (`html4excel`, `chunk_rows=12`) and `rag/app/one.py`. A sheet with 12/24/36… data rows (or 256/512… with the default `chunk_rows=256`) produces an extra `<table><caption>…</caption><tr><th>…</th></tr></table>` chunk. It is non-empty, so it passes the `if _` filter and gets indexed as a real (empty) chunk. \| data rows (chunk_rows=12) \| before \| after \| \|---\|---\|---\| \| 12 \| 2 chunks (1 header-only) \| 1 \| \| 24 \| 3 chunks (1 header-only) \| 2 \| \| 13 \| 2 (unchanged) \| 2 \| ### Fix Iterate `ceil(n_data / chunk_rows)` times instead of `n_data // chunk_rows + 1`. Adds `test/unit_test/deepdoc/parser/test_excel_parser.py`; the header-only-chunk cases fail before this change and pass after. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Used the Claude CLI while working on this.	2026-06-08 17:16:45 +08:00
bitloi	9f3e289b78	Fix: preserve markdown tables during delimiter extraction (#15632 ) ### What problem does this PR solve? Markdown extraction can split tables row by row when delimiter-based extraction uses a newline delimiter. That loses table structure during chunking even though delimiters should still split normally outside tables. This PR keeps the follow-up to #15482 intentionally narrow: - preserve Markdown pipe tables during delimiter-based extraction - preserve borderless pipe tables during delimiter-based extraction - preserve multiline HTML tables during delimiter-based extraction - keep delimiter splitting unchanged outside protected table ranges Refs #15482 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Testing - `ruff check deepdoc/parser/markdown_parser.py test/unit_test/deepdoc/parser/test_markdown_parser.py` - `python3 run_tests.py -t test/unit_test/deepdoc/parser/test_markdown_parser.py` - `git diff --check`	2026-06-05 10:35:33 +08:00
Yufeng He	5db1b296fb	fix: fall back from empty Docling native chunks (#15601 ) ## Summary - keep the native Docling chunking path when it returns usable chunks - fall back to the standard Docling response parser when a chunked request gets HTTP 200 but returns no usable chunks - add a regression test for older Docling servers that accept the chunking request but return a standard conversion payload ## Why Older external Docling servers can accept a request containing `do_chunking: true` and still return the standard conversion response shape. The current code treats any HTTP 200 from the chunked request as a native chunk response, finds no chunk entries, and returns zero sections without trying the standard response parser. Fixes #15569. ## Validation - `python -m pytest test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py -q` - `python -m py_compile deepdoc\\parser\\docling_parser.py test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py` - `python -m ruff check deepdoc\\parser\\docling_parser.py test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py` - `git diff --check`	2026-06-04 13:42:58 +08:00
bitloi	01a5598aa5	Fix: markdown fenced code block extraction (#15630 ) ### What problem does this PR solve? Markdown extraction currently applies custom delimiters before respecting fenced code blocks. When a delimiter such as a newline is configured, fenced code can be split into separate chunks, and longer outer fences can be closed incorrectly by shorter nested fences. This PR keeps the fix intentionally narrow for the Markdown chunking discussion in #15482: - preserve fenced code blocks when delimiter-based extraction is used - support both backtick and tilde fences - respect fence length so longer outer fences can contain shorter inner fences - keep delimiter splitting unchanged outside fenced blocks Refs #15482 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Testing - `ruff check deepdoc/parser/markdown_parser.py test/unit_test/deepdoc/parser/test_markdown_parser.py` - `python3 run_tests.py -t test/unit_test/deepdoc/parser/test_markdown_parser.py`	2026-06-04 13:33:46 +08:00
monsterDavid	d398d617ca	fix(mineru): skip page chrome blocks to prevent duplicate chunks (#15387 ) ## Summary - Skip MinerU `header`, `footer`, and `page_number` blocks when converting `content_list.json` into sections. - Ignore unsupported block types explicitly so future MinerU output types cannot re-emit the previous text block. Fixes duplicate text in General/naive chunks when parsing PDFs via MinerU (reported with repeated page headers and body text in slices). Closes #15335 ## Test plan - [x] `pytest test/unit_test/deepdoc/parser/test_mineru_parser.py -v` (4/4 passed)	2026-06-01 20:15:04 +08:00
Jonathan Chang	9d1006e4ec	fix: The output of the parser in the ingestion pipeline contains HTML tags (#14920 ) ## Summary This change fixes ingestion quality issues where MinerU parser output may contain HTML fragments (for example, table-related tags like `<tr>`, `<td>`, `<br>`), which were previously passed directly into chunking/tokenization and degraded chunk quality. The fix adds a sanitization step in the MinerU parser path so parsed sections are normalized to clean text before chunking. ## Change Type (select all) - [x] Bug fix - [x] Ingestion pipeline improvement - [x] Parser/chunking quality fix ## Related Issue - https://github.com/infiniflow/ragflow/issues/14831	2026-05-25 16:06:36 +08:00
wdeveloper16	78188ce9e9	Feat: add OpenDataLoader PDF parser backend (#14058 ) (#14097 ) ### What problem does this PR solve? Closes #14058. RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU, Docling, TCADP, PaddleOCR). This PR adds OpenDataLoader ([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf)) as a new optional backend, giving users a deterministic, local-first alternative with competitive table extraction accuracy. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --- ### Changes #### Backend - `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser` class inheriting `RAGFlowPdfParser`. Implements `check_installation()` (guards Python package + Java 11+ runtime), `parse_pdf()` with JSON-first extraction (heading/paragraph/table/list/image/formula) and Markdown fallback, position-tag generation compatible with the shared `@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup. - `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in `PARSERS` dict, added to `chunk_token_num=0` override list. - `rag/flow/parser/parser.py` — `"opendataloader"` branch in the pipeline PDF handler + check validation list. #### Infrastructure - `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH. #### Frontend - `web/src/components/layout-recognize-form-field.tsx` — `OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown. Cascades automatically to the pipeline editor's Parser component. #### Docs - `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader entry and full env-var reference. --- ### Environment variables \| Variable \| Default \| Description \| \|---\|---\|---\| \| `USE_OPENDATALOADER` \| `false` \| Set `true` to install `opendataloader-pdf` on container startup \| \| `OPENDATALOADER_VERSION` \| latest \| Pin the PyPI release (e.g. `==2.2.1`) \| \| `OPENDATALOADER_HYBRID` \| _(unset)_ \| Enable hybrid AI mode (e.g. `docling-fast`) \| \| `OPENDATALOADER_IMAGE_OUTPUT` \| _(unset)_ \| `off` / `embedded` / `external` \| \| `OPENDATALOADER_OUTPUT_DIR` \| _(tmp)_ \| Persistent output dir; temp dir used + cleaned if unset \| \| `OPENDATALOADER_DELETE_OUTPUT` \| `1` \| `0` to retain intermediate files for debugging \| \| `OPENDATALOADER_SANITIZE` \| _(unset)_ \| `1` to filter prompt-injection patterns from output \| --- ### Dependencies - Runtime: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not added to `pyproject.toml` core deps. Installed by `ensure_opendataloader()` at container startup when `USE_OPENDATALOADER=true`. - System: Java 11+ on PATH (JVM is the underlying engine). The installer skips with a warning if `java` is not found. --- ### How to test Standalone parser: ```bash source .venv/bin/activate uv pip install opendataloader-pdf python3 -c " import sys; sys.path.insert(0, '.') from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser p = OpenDataLoaderParser() print('available:', p.check_installation()) s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline') print(f'sections={len(s)} tables={len(t)}') " ``` ### Benchmark vs Docling ``` file parser secs sections tables ---------------------------------------------------------------------- text-heavy.pdf docling 45.29 148 10 text-heavy.pdf opendataloader 3.14 559 0 table-heavy.pdf docling 7.05 76 3 table-heavy.pdf opendataloader 3.71 90 0 complex.pdf docling 42.67 114 8 complex.pdf opendataloader 3.51 180 0 ```	2026-04-25 00:33:02 +08:00
Daniil Sivak	60ad32a0c2	Feat: support epub parsing (#13650 ) Closes #1398 ### What problem does this PR solve? Adds native support for EPUB files. EPUB content is extracted in spine (reading) order and parsed using the existing HTML parser. No new dependencies required. ### Type of change - [x] New Feature (non-breaking change which adds functionality) To check this parser manually: ```python uv run --python 3.12 python -c " from deepdoc.parser import EpubParser with open('$HOME/some_epub_book.epub', 'rb') as f: data = f.read() sections = EpubParser()(None, binary=data, chunk_token_num=512) print(f'Got {len(sections)} sections') for i, s in enumerate(sections[:5]): print(f'\n--- Section {i} ---') print(s[:200]) " ```	2026-03-17 20:14:06 +08:00
tunsuy	292a1a8566	fix: detect and fallback garbled PDF text to OCR (#13366 ) (#13404 ) ## Problem When PDF fonts lack ToUnicode/CMap mappings, pdfplumber (pdfminer) cannot map CIDs to correct Unicode characters, outputting PUA characters (U+E000~U+F8FF) or `(cid:xxx)` placeholders. The original code fully trusted pdfplumber text without any garbled detection, causing garbled output in the final parsed result. Relates to #13366 ## Solution ### 1. Garbled text detection functions - `_is_garbled_char(ch)`: Detects PUA characters (BMP/Plane 15/16), replacement character U+FFFD, control characters, and unassigned/surrogate codepoints - `_is_garbled_text(text, threshold)`: Calculates garbled ratio and detects `(cid:xxx)` patterns ### 2. Box-level fallback (in `__ocr()`) When a text box has ≥50% garbled characters, discard pdfplumber text and fallback to OCR recognition. ### 3. Page-level detection (in `__images__()`) Sample characters from each page; if garbled rate ≥30%, clear all pdfplumber characters for that page, forcing full OCR. ### 4. Layout recognizer CID filtering Filter out `(cid:xxx)` patterns in `layout_recognizer.py` text processing to prevent them from polluting layout analysis. ## Testing - 29 unit tests covering: normal CJK/English text, PUA characters, CID patterns, mixed text, boundary thresholds, edge cases - All 85 existing project unit tests pass without regression	2026-03-10 11:20:31 +08:00

17 Commits