ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Author	SHA1	Message	Date
helloxjade	1b2da645c3	fix: deduplicate markdown table chunks (#16143 )	2026-06-24 13:22:57 +08:00
Rander	017adf841f	fix(paddleocr): support PP-OCRv6 ocrResults fallback and integrate image parsing (#16150 ) ## Summary This PR fixes two issues discovered during testing of the PaddleOCR async API refactoring: ### 1. PP-OCRv6 returns `ocrResults` instead of `layoutParsingResults` Models like PP-OCRv6 are pure text recognition models that return results in `ocrResults.prunedResult.rec_texts` format rather than the `layoutParsingResults.prunedResult.parsing_res_list` format used by layout-aware models (PaddleOCR-VL series). Changes: - `deepdoc/parser/paddleocr_parser.py`: Extract `ocrResults` alongside `layoutParsingResults` in `_send_request()`, add fallback logic in `_transfer_to_sections()` and `parse_image()` - `internal/entity/models/paddleocr.go`: Add `ocrResults` struct and fallback extraction in Go OCR handler ### 2. Image parsing not integrated into picture chunker The `parse_image()` method existed in PaddleOCRParser but was never called from `rag/app/picture.py` (the module that handles image file uploads). Users configuring PaddleOCR as their layout recognizer would still get local deepdoc OCR for images. Changes: - `rag/app/picture.py`: When `layout_recognize` is set to PaddleOCR, use `PaddleOCROcrModel.parse_image()` instead of local OCR. Falls back gracefully to local OCR on failure. ## Testing Verified end-to-end in Docker: - PaddleOCR-VL-1.6 PDF parsing: ✅ (10 text blocks with bbox) - PaddleOCR-VL-1.6 image parsing: ✅ (219 chars) - PP-OCRv6 PDF parsing with ocrResults fallback: ✅ (10 text blocks) - PP-OCRv6 image parsing with ocrResults fallback: ✅ (136 chars) ## Related PRs - #15967 (merged) - PaddleOCR async Job API refactoring + new models - #16086 (merged) - PaddleOCR image parsing support	2026-06-23 22:02:54 +08:00
Manan Bansal	70c0121b78	Fix: preserve tables when parsing DOCX with the laws parser (#16008 ) (#16155 ) ## What Fixes #16008 — tables contained in a DOCX are silently dropped when the document is parsed with the laws chunking method. ## Root cause `Docx.__call__` in `rag/app/laws.py` iterated `self.doc.paragraphs`, which only yields paragraph elements. Tables are separate `tbl` blocks in the document body, so they were never visited and were lost from the output. (The `naive` parser already handles tables by iterating the document body.) ## Changes - Iterate `self.doc._element.body` so tables are visited in document order alongside paragraphs. - Add a `__table_to_html` helper that renders each table to HTML, including merged-cell `colspan` detection (mirrors the `naive` parser's logic). - Inject each table into the section tree with a sentinel level deeper than any heading, so `Node.build_tree` merges it into its enclosing section — keeping the chapter/article title path as retrieval context rather than producing an orphaned chunk. - Guard the `h2_level` computation against an empty heading set, so a tables-only or empty DOCX no longer raises `IndexError`. This keeps the laws parser's hierarchical chunking and adds table extraction, so users no longer have to choose between losing structure (naive) or losing tables (laws). ## Tests Adds `test/unit_test/rag/test_laws_docx_tables.py` covering: - table content is preserved and carries its section title path, - merged adjacent cells collapse to `colspan`, - tables-only document does not crash, - empty document returns `[]`. All four pass; `ruff check` / `ruff format` are clean.	2026-06-22 09:46:44 +08:00
galuis116	6bfaa3f21e	Fix: SSRF in markdown parser remote image fetch (#15438 ) ### What problem does this PR solve? `rag/app/naive.py` `Markdown.load_images_from_urls` fetched image URLs parsed straight out of an untrusted uploaded markdown document via a raw `requests.get`, with no SSRF validation. Markdown chunking always reaches this path (`return_section_images=True`), so any authenticated user who uploads a `.md`/`.markdown`/`.mdx` file to a knowledge base could make the server issue requests to internal services or cloud-metadata endpoints, e.g. `![x](http://169.254.169.254/latest/meta-data/...)`. The `image/` Content-Type check only gates decoding — the outbound request (the SSRF) always fires. This was the one user-controlled fetch site missed by the project's existing SSRF-hardening (`common/ssrf_guard.py`, already applied to the crawler, SearXNG, RSS connector, MCP/document APIs, and OAuth avatar download). The fix validates and DNS-pins every hop with `common.ssrf_guard.assert_url_is_safe` before connecting, and follows redirects manually so each redirect target is re-validated (closing the DNS-rebinding / redirect-bypass window), mirroring `common/data_source/rss_connector.py`. Blocked URLs are skipped and logged like any other unreachable image, so legitimate public images are unaffected. Adds a regression test at `test/unit_test/rag/app/test_markdown_image_ssrf.py`. Closes #15437 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Ubuntu <ubuntu@ubuntu-2204.linuxvmimages.local> Co-authored-by: galuis116 <galuis116@users.noreply.github.com>	2026-06-16 18:54:55 +08:00
Lynn	7355db183f	Fix: model list (#15905 ) ### What problem does this PR solve? Set OpenDataLoader and call in parser and naive ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-10 17:44:50 +08:00
Lynn	478c9846a1	Fix: model list (#15860 ) ### What problem does this PR solve? Remove tenant_llm call in rag. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-10 14:59:57 +08:00
Wang Qi	9aa81e7cad	Fix paddle ocr / minerU cannot add (#15858 ) Fix paddle ocr / minerU cannot add	2026-06-10 13:04:13 +08:00
VictorECDSA	ff5971448b	[Fix] naive: force-merge short markdown headers to prevent separate chunks (#15488 ) ## Problem When uploading `.md` files with `parser=naive` and `delimiter="\n"`, markdown headers (e.g., `## Quick Travel`) become separate chunks with very short content (16-18 characters). This causes retrieval issues: when the header is matched, the corresponding body text is not included in the chunk. ## Related Issues Closes #15487 ## Checklist - [x] Code changes are minimal and focused - [x] Unit tests added (12/12 passed) - [x] No breaking changes	2026-06-03 10:49:28 +08:00
Lynn	dc4b82523b	Feat: tenant llm provider (#14595 ) ### What problem does this PR solve? Python implementation of the Go-based model_provider API suite. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: bill <yibie_jingnian@163.com>	2026-05-29 17:39:41 +08:00
buua436	04bdb41909	Fix: guard missing task language (#15136 ) ### What problem does this PR solve? guard missing task language ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-22 11:46:38 +08:00
Rene Arredondo	f58e0b3eca	Feat: VLM image descriptions in MinerU parser (#14869 ) (#14946 ) ## Summary Closes #14869. Adds VLM-based semantic descriptions to image chunks produced by the MinerU parser, closing a long-standing parity gap with the deepdoc parser's `VisionFigureParser`. A maintainer flagged this in #13342 ("We may add the VLM enhancement to MinerU parser as well") and an earlier proposal exists in #13824; this PR lands the change end-to-end inside the existing parser plumbing. ## Why Today the MinerU parser returns image chunks containing only the native `image_caption` and `image_footnote` strings from MinerU's JSON. When neither is present (or when both are sparse), the chunk carries effectively no searchable content for the figure and retrieval misses it entirely. Users who configured a local VLM (reporter's case: Gemma-4-31B) had to post-process MinerU's `tmp/.json` themselves. The deepdoc parser already solves this via [`VisionFigureParser`](deepdoc/parser/figure_parser.py): when the tenant has an `IMAGE2TEXT` model configured, each figure gets a semantic description merged into its chunk. This PR brings the same behavior to MinerU. ## What changed ### `deepdoc/parser/mineru_parser.py` - New method `_enhance_images_with_vlm(outputs, vision_model, callback=None)`* — collects every `IMAGE` block with a readable `img_path`, runs `rag.app.picture.vision_llm_chunk` in a 10-worker `ThreadPoolExecutor` using the existing `vision_llm_figure_describe_prompt`, and writes the result back as `vlm_description`. Per-image failures are logged and skipped — they never abort the run. - `_transfer_to_sections` (IMAGE branch) — folds `vlm_description` into the section text alongside caption + footnote, so the description becomes part of the chunk and is searchable / retrievable. - `parse_pdf` — after `_read_output`, calls `_enhance_images_with_vlm(outputs, vision_model, callback=callback)` when a `vision_model` kwarg is supplied. Wrapped in `try / except` so a VLM outage cannot break parsing. ### `rag/app/naive.py` (`by_mineru`) After successfully resolving the MinerU OCR parser, also resolves the tenant's default `LLMType.IMAGE2TEXT` model via `get_tenant_default_model_by_type`, wraps it in an `LLMBundle`, and injects it as `kwargs["vision_model"]` before delegating to `parse_pdf`. ## Behavior \| Tenant config \| Behavior \| \|---\|---\| \| `IMAGE2TEXT` model configured \| MinerU image chunks contain `caption + footnote + VLM description`. Retrieval against figures now actually works. \| \| No `IMAGE2TEXT` model configured \| Exact same output as today (caption + footnote only). Lookup fails silently with an info log; no error, no regression. \| \| VLM call fails for a single image \| That image silently falls back to caption + footnote; other images proceed. \| \| Caller already passes `vision_model` in kwargs \| We don't override it — `if "vision_model" not in kwargs` guards the lookup. \| ## Files - `deepdoc/parser/mineru_parser.py` (+56) - `rag/app/naive.py` (+13)	2026-05-19 16:08:10 +08:00
Ricardo-M-L	cb606e1c38	fix: correct attribute name typo model_speciess to model_species (#13929 ) ## Summary - Rename misspelled attribute `model_speciess` to `model_species` across 4 files - The extra `s` is a typo — `species` is already plural ## Test plan - [ ] Verify PDF parsing with laws/manual/paper parser types still works correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuj <yuj@ztjzsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-15 14:19:41 +08:00
Paul Yao	c34c81e8e6	fix: remove duplicate .wav and .aac in audio supported extensions list (#14791 ) What problem does this PR solve? In rag/app/audio.py, the supported audio extensions list contains duplicate entries: .wav appears twice (positions 3 and 5) and .aac appears twice (positions 6 and 14). While this does not affect runtime behavior, it is redundant and makes the code harder to maintain. This PR removes the duplicate entries to keep the list clean and consistent. Type of change - [X] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 09:42:31 +08:00
FPlust	0734fd793a	fix: scope pending_cell_images by sheet in excel parser (#14120 ) pending_cell_images should be scoped by sheet ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 13:17:14 +08:00
Ahmad Intisar	3c4d1da98f	Feature/table parser column roles (#13710 ) ### What problem does this PR solve? The table file parser (CSV/Excel) currently treats all columns identically — every column is both vectorized (embedded in chunk text) and stored as filterable metadata. There's no way for users to control which columns should be searchable by semantic meaning versus which should only be filterable attributes. For example, when ingesting a news articles CSV with columns like title, content, country, category, source, etc., the embedding includes metadata fields like country: Brazil and source: Reuters in the chunk text, which dilutes the semantic quality of the embedding without adding retrieval value. The RDBMS connector (MySQL/PostgreSQL) already supports content_columns / metadata_columns, but this capability was missing for file-based table ingestion. This PR adds column-level control (vectorize / metadata / both) for the table file parser, following RAGFlow's existing patterns. Backward compatible: Datasets without table_column_roles or with table_column_mode: auto behave exactly as before (all columns = both). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-11 10:06:04 +08:00
Idriss Sbaaoui	38f6484e98	Fix OpenDataLoader naive parsing by normalizing `@OpenDataLoader` and filtering unsupported parser kwargs (#14581 ) ### What problem does this PR solve? This PR fixes a bug where `layout_recognize="<name>@OpenDataLoader"` was misrouted and then failed during parsing in the naive parser path. It now routes correctly to OpenDataLoader and avoids passing unsupported arguments that caused runtime errors. fixes #14572 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-06 15:00:55 +08:00
Idriss Sbaaoui	9075872435	Fix: Manual/Naive outline tuple unpack crash (#14518 ) ### What problem does this PR solve? This fixes a crash in Manual and Naive parsing when PDF outlines include page numbers as a third tuple value. It makes outline unpacking accept extra values so parsing no longer fails. fixes #14411 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 11:55:02 +08:00
Idriss Sbaaoui	2a37562791	Fix manual naive parser position extraction fallback (#14420 ) ### What problem does this PR solve? This PR fixes a regression where Manual pipeline + Naive (Plain Text) PDF parsing crashed with `AttributeError: 'PlainParser' object has no attribute 'extract_positions'` in `rag/app/manual.py`. fixes #14411 ### Type of change: - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 14:21:30 +08:00
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
yuch85	0d87cecae2	feat: persist PDF bookmark outline as document metadata (#13287 ) ## Summary PDF files often contain a bookmark/outline tree (table of contents built into the file by the authoring tool). RAGFlow's `pdf_parser.outlines` already extracts these `(title, depth)` tuples via pypdf, but they are used ephemerally during chunking (`manual` parser uses them for hierarchy detection) and then discarded. This PR persists the outline as `doc.meta_fields["outline"]` — a JSON array of `{"title": str, "depth": int}` objects — so downstream features can use the structural information. ### Why this matters - Complementary to `toc_extraction` — the existing `toc_extraction` feature uses LLM calls to generate a TOC and only works for the `naive` parser. The raw PDF outline is free (already extracted by pypdf), works for all parsers, and captures the author's original document structure. - Document navigation — frontends can render a clickable TOC from the outline - Entity extraction — the outline provides a structural map for identifying document sections and key topics - Search result context — knowing which section a chunk belongs to helps users evaluate relevance ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/app/naive.py` \| Attach `pdf_parser.outlines` as `__outline__` on first chunk dict \| ~7 \| \| `rag/app/manual.py` \| Same for the manual parser \| ~5 \| \| `rag/svr/task_executor.py` \| Extract `__outline__`, persist via `DocMetadataService.update_document_metadata()` \| ~12 \| ### Design decisions - Transient key pattern: The outline is passed from parser → task_executor via `__outline__` on the first chunk dict, then removed before indexing. This follows the same pattern as `metadata_obj` for LLM-generated metadata. - No schema changes: Uses the existing `meta_fields` JSON column on the document table. - Graceful degradation: If a PDF has no outline (common for scanned docs), nothing is stored. If persistence fails, it logs a warning and continues — parsing is not interrupted. ### Backward compatibility - Fully backward compatible — no existing fields, behavior, or schemas changed - PDFs without outlines are unaffected - Existing `meta_fields` data is preserved (merged, not overwritten) ## Test plan - [ ] Parse a PDF with bookmarks (e.g. any multi-chapter document), verify `meta_fields["outline"]` is populated - [ ] Parse a PDF without bookmarks, verify no errors and no outline key in meta_fields - [ ] Verify existing `meta_fields` data is preserved (not overwritten) when outline is added - [ ] Verify `manual` parser also persists outlines - [ ] Verify outline JSON structure: `[{"title": "Chapter 1", "depth": 0}, ...]` Related: #9921 (Deterministic Document Access Layer) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 11:57:06 +08:00
wdeveloper16	78188ce9e9	Feat: add OpenDataLoader PDF parser backend (#14058 ) (#14097 ) ### What problem does this PR solve? Closes #14058. RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU, Docling, TCADP, PaddleOCR). This PR adds OpenDataLoader ([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf)) as a new optional backend, giving users a deterministic, local-first alternative with competitive table extraction accuracy. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --- ### Changes #### Backend - `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser` class inheriting `RAGFlowPdfParser`. Implements `check_installation()` (guards Python package + Java 11+ runtime), `parse_pdf()` with JSON-first extraction (heading/paragraph/table/list/image/formula) and Markdown fallback, position-tag generation compatible with the shared `@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup. - `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in `PARSERS` dict, added to `chunk_token_num=0` override list. - `rag/flow/parser/parser.py` — `"opendataloader"` branch in the pipeline PDF handler + check validation list. #### Infrastructure - `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH. #### Frontend - `web/src/components/layout-recognize-form-field.tsx` — `OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown. Cascades automatically to the pipeline editor's Parser component. #### Docs - `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader entry and full env-var reference. --- ### Environment variables \| Variable \| Default \| Description \| \|---\|---\|---\| \| `USE_OPENDATALOADER` \| `false` \| Set `true` to install `opendataloader-pdf` on container startup \| \| `OPENDATALOADER_VERSION` \| latest \| Pin the PyPI release (e.g. `==2.2.1`) \| \| `OPENDATALOADER_HYBRID` \| _(unset)_ \| Enable hybrid AI mode (e.g. `docling-fast`) \| \| `OPENDATALOADER_IMAGE_OUTPUT` \| _(unset)_ \| `off` / `embedded` / `external` \| \| `OPENDATALOADER_OUTPUT_DIR` \| _(tmp)_ \| Persistent output dir; temp dir used + cleaned if unset \| \| `OPENDATALOADER_DELETE_OUTPUT` \| `1` \| `0` to retain intermediate files for debugging \| \| `OPENDATALOADER_SANITIZE` \| _(unset)_ \| `1` to filter prompt-injection patterns from output \| --- ### Dependencies - Runtime: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not added to `pyproject.toml` core deps. Installed by `ensure_opendataloader()` at container startup when `USE_OPENDATALOADER=true`. - System: Java 11+ on PATH (JVM is the underlying engine). The installer skips with a warning if `java` is not found. --- ### How to test Standalone parser: ```bash source .venv/bin/activate uv pip install opendataloader-pdf python3 -c " import sys; sys.path.insert(0, '.') from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser p = OpenDataLoaderParser() print('available:', p.check_installation()) s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline') print(f'sections={len(s)} tables={len(t)}') " ``` ### Benchmark vs Docling ``` file parser secs sections tables ---------------------------------------------------------------------- text-heavy.pdf docling 45.29 148 10 text-heavy.pdf opendataloader 3.14 559 0 table-heavy.pdf docling 7.05 76 3 table-heavy.pdf opendataloader 3.71 90 0 complex.pdf docling 42.67 114 8 complex.pdf opendataloader 3.51 180 0 ```	2026-04-25 00:33:02 +08:00
Wang Qi	8aab158942	OpenSource Resume is supported only with Elasticsearch. (#14233 ) ### What problem does this PR solve? OpenSource Resume is supported only with Elasticsearch. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-21 10:05:47 +08:00
Magicbook1108	27329b40ed	Refact: refact on parser structure (#14012 ) ### What problem does this PR solve? Refact: refact on parser structure ### Type of change - [x] Refactoring	2026-04-10 10:03:44 +08:00
Magicbook1108	69264b3a70	Feat: Refact pipeline (#13826 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 19:26:45 +08:00
Stephen Hu	d32967eda8	refactor: let excel use lazy image loader (#13558 ) ### What problem does this PR solve? let excel use lazy image loader ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-23 21:24:40 +08:00
Magicbook1108	f991cd362e	Fix: type check in resume parsing method (#13740 ) ### What problem does this PR solve? Fix: type check in resume parsing method ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-23 21:19:09 +08:00
Daniil Sivak	60ad32a0c2	Feat: support epub parsing (#13650 ) Closes #1398 ### What problem does this PR solve? Adds native support for EPUB files. EPUB content is extracted in spine (reading) order and parsed using the existing HTML parser. No new dependencies required. ### Type of change - [x] New Feature (non-breaking change which adds functionality) To check this parser manually: ```python uv run --python 3.12 python -c " from deepdoc.parser import EpubParser with open('$HOME/some_epub_book.epub', 'rb') as f: data = f.read() sections = EpubParser()(None, binary=data, chunk_token_num=512) print(f'Got {len(sections)} sections') for i, s in enumerate(sections[:5]): print(f'\n--- Section {i} ---') print(s[:200]) " ```	2026-03-17 20:14:06 +08:00
NeedmeFordev	387b0b27c4	feat(parser): support external Docling server via DOCLING_SERVER_URL (#13527 ) ### What problem does this PR solve? This PR adds support for parsing PDFs through an external Docling server, so RAGFlow can connect to remote `docling serve` deployments instead of relying only on local in-process Docling. It addresses the feature request in [#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns with the external-server usage pattern already used by MinerU. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### What is changed? - Add external Docling server support in `DoclingParser`: - Use `DOCLING_SERVER_URL` to enable remote parsing mode. - Try `POST /v1/convert/source` first, and fallback to `/v1alpha/convert/source`. - Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not set. - Wire Docling env settings into parser invocation paths: - `rag/app/naive.py` - `rag/flow/parser/parser.py` - Add Docling env hints in constants and update docs: - `docs/guides/dataset/select_pdf_parser.md` - `docs/guides/agent/agent_component_reference/parser.md` - `docs/faq.mdx` ### Why this approach? This keeps the change focused on one issue and one capability (external Docling connectivity), without introducing unrelated provider-model plumbing. ### Validation - Static checks: - `python -m py_compile` on changed Python files - `python -m ruff check` on changed Python files - Functional checks: - Remote v1 endpoint path works - v1alpha fallback works - Local Docling path remains available when server URL is unset ### Related links - Feature request: [Support external Docling server (issue #13426)](https://github.com/infiniflow/ragflow/issues/13426) - Compare view for this branch: [main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1) ##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)	2026-03-12 17:09:03 +08:00
eviaaaaa	d0ca388bec	Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329 ) ## Summary This PR is the direct successor to the previous `docx` lazy-loading implementation. It addresses the technical debt intentionally left out in the last PR by fully migrating the `qa` and `manual` parsing strategies to the new lazy-loading model. Additionally, this PR comprehensively refactors the underlying `docx` parsing pipeline to eliminate significant code redundancy and introduces robust fallback mechanisms to handle completely corrupted image streams safely. ## What's Changed * Centralized Abstraction (`docx_parser.py`): Moved the `get_picture` extraction logic up to the `RAGFlowDocxParser` base class. Previously, `naive`, `qa`, and `manual` parsers maintained separate, redundant copies of this method. All downstream strategies now natively gather raw blobs and return `LazyDocxImage` objects automatically. * Robust Corrupted Image Fallback (`docx_parser.py`): Handled edge cases where `python-docx` encounters critically malformed magic headers. Implemented an explicit `try-except` structure that safely intercepts `UnrecognizedImageError` (and similar exceptions) and seamlessly falls back to retrieving the raw binary via `getattr(related_part, "blob", None)`, preventing parser crashes on damaged documents. * Legacy Code & Redundancy Purge: * Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`, and `manual.py`. * Removed the standalone, immediate-decoding `concat_img` method in `manual.py`. It has been completely replaced by the globally unified, lazy-loading-compatible `rag.nlp.concat_img`. * Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception packages) across all updated strategy files. ## Scope To keep this PR focused, I have restricted these changes strictly to the unification of `docx` extraction logic and the lazy-load migration of `qa` and `manual`. ## Validation & Testing I've tested this to ensure no regressions and validated the fallback logic: * Output Consistency: Compared identical `.docx` inputs using `qa` and `manual` strategies before and after this branch: chunk counts, extracted text, table HTML, and attached images match perfectly. * Memory Footprint Drop: Confirmed a noticeable drop in peak memory usage when processing image-dense documents through the `qa` and `manual` pipelines, bringing them up to parity with the `naive` strategy's performance gains. ## Breaking Changes * None.	2026-03-11 10:00:07 +08:00
guptas6est	32d31284cc	Fix: upgrade pypdf to 6.7.5 and migrate from deprecated pypdf2 to fix CVE-2026-28804 and CVE-2023-36464 (#13454 ) ### What problem does this PR solve? This PR addresses security vulnerabilities in PDF processing dependencies identified by Trivy security scan: 1. CVE-2026-28804 (MEDIUM): pypdf 6.7.4 vulnerable to inefficient decoding of ASCIIHexDecode streams 2. CVE-2023-36464 (MEDIUM): pypdf2 3.0.1 susceptible to infinite loop when parsing malformed comments Since pypdf2 is deprecated with no available fixes, this PR migrates all pypdf2 usage to the actively maintained pypdf library (version 6.7.5), which resolves both vulnerabilities. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-09 12:06:00 +08:00
Lynn	62cb292635	Feat/tenant model (#13072 ) ### What problem does this PR solve? Add id for table tenant_llm and apply in LLMBundle. ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-05 17:27:17 +08:00
Yao Wei	c99b53064d	fix: remove company info from resume_summary to prevent over-retrieval (#13358 ) ### What problem does this PR solve? Problem: When searching for a specific company name like(Daofeng Technology), the search would incorrectly return unrelated resumes containing generic terms like (Technology) in their company names Root Cause: The `corporation_name_tks` field was included in the identity fields that are redundantly written to every chunk. This caused common words like "科技" to match across all chunks, leading to over-retrieval of irrelevant resumes. Solution: Remove `corporation_name_tks` from the `_IDENTITY_FIELDS` list. Company information is still preserved in the "Work Overview" chunk where it belongs, allowing proper company-based searches while preventing false positives from generic terms. --------- Co-authored-by: Aron.Yao <yaowei@192.168.1.68> Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local> Co-authored-by: Liu An <asiro@qq.com>	2026-03-04 19:24:49 +08:00
Magicbook1108	93d621a666	Fix: Correct PDF chunking parameter name in naive (#13357 ) ### What problem does this PR solve? Fix: Correct PDF chunking parameter name in naive #13325 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-04 11:51:10 +08:00
Yao Wei	48755a3352	Fix: (resume) Cross-verify project experience and work experience, and remove duplicate text (#13323 ) Cross-verify project experience and work experience, and remove duplicate text --------- Co-authored-by: Aron.Yao <yaowei@192.168.1.68> Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>	2026-03-03 14:53:46 +08:00
Yongteng Lei	707de2461a	Fix: use async_chat with sync wrapper in resume parser (#13320 ) ### What problem does this PR solve? Fix AttributeError when calling llm.chat() in resume parser. LLMBundle only has async_chat method, not chat method. Use `_run_coroutine_sync` wrapper to call async_chat synchronously. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-02 19:51:06 +08:00
Yao Wei	f8c91e8854	Refa: Resume parsing module (architectural optimizations based on SmartResume Pipeline) (#13255 ) Core optimizations (refer to arXiv:2510.09722): 1. PDF text fusion: Metadata + OCR dual-path extraction and fusion 2. Page-aware reconstruction: YOLOv10 page segmentation + hierarchical sorting + line number indexing 3. Parallel task decomposition: Basic information/work experience/educational background three-way parallel LLM extraction 4. Index pointer mechanism: LLM returns a range of line numbers instead of generating the full text, reducing the illusion of full text. --------- Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local> Co-authored-by: Aron.Yao <yaowei@192.168.1.68> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-02 19:05:50 +08:00
liuxiaoyusky	8ba66dd62a	Fix: respect user-configured chunk_token_num for MinerU/docling/paddleocr parsers (#13234 ) ## Summary When using MinerU, docling, TCADP, or paddleocr as the PDF parser with the General (naive) chunk method, the user-configured `chunk_token_num` is unconditionally overwritten to 0 at [rag/app/naive.py#L858-L859](https://github.com/infiniflow/ragflow/blob/main/rag/app/naive.py#L858-L859), effectively disabling chunk merging regardless of what the user sets in the UI. ### Problem A user sets `chunk_token_num = 2048` in the dataset configuration UI, expecting small parser blocks to be merged into larger chunks. However, this line: ```python if name in ["tcadp", "docling", "mineru", "paddleocr"]: parser_config["chunk_token_num"] = 0 ``` silently overrides the user's setting. As a result, every MinerU output block becomes its own chunk. For short documents (e.g. a 3-page PDF fund factsheet parsed by MinerU), this produces 47 tiny chunks — some as small as 11 characters (`"July 2025"`) or 15 characters (`"CIES Eligible"`). This severely degrades retrieval quality: vector embeddings of such short fragments have minimal semantic value, and keyword search produces excessive noise. ### Fix Only apply the `chunk_token_num = 0` override when the user has not explicitly configured a positive value: ```python if name in ["tcadp", "docling", "mineru", "paddleocr"]: if int(parser_config.get("chunk_token_num", 0)) <= 0: parser_config["chunk_token_num"] = 0 ``` This preserves the original default behavior (no merging) while respecting the user's explicit configuration. ### Before / After (MinerU, 3-page PDF, chunk_token_num=2048) \| \| Before \| After \| \|---\|---\|---\| \| Chunks produced \| 47 \| ~8 (merged by token limit) \| \| Smallest chunk \| 11 chars \| ~500 chars \| \| User setting respected \| No \| Yes \| ## Test plan - [ ] Parse a PDF with MinerU and `chunk_token_num = 2048` → verify chunks are merged up to token limit - [ ] Parse a PDF with MinerU and `chunk_token_num = 0` (or default) → verify original behavior (no merging) - [ ] Parse a PDF with DeepDOC parser → verify no change in behavior (not affected by this code path) - [ ] Repeat with docling/paddleocr if available	2026-03-02 15:31:40 +08:00
Attili-sys	21bc1ab7ec	Feature rtl support (#13118 ) ### What problem does this PR solve? This PR adds comprehensive Right-to-Left (RTL) language support, primarily targeting Arabic and other RTL scripts (Hebrew, Persian, Urdu, etc.). Previously, RTL content had multiple rendering issues: - Incorrect sentence splitting for Arabic punctuation in citation logic - Misaligned text in chat messages and markdown components - Improper positioning of blockquotes and “think” sections - Incorrect table alignment - Citation placement ambiguity in RTL prompts - UI layout inconsistencies when mixing LTR and RTL text This PR introduces backend and frontend improvements to properly detect, render, and style RTL content while preserving existing LTR behavior. #### Backend - Updated sentence boundary regex in `rag/nlp/search.py` to include Arabic punctuation: - `،` (comma) - `؛` (semicolon) - `؟` (question mark) - `۔` (Arabic full stop) - Ensures citation insertion works correctly in RTL sentences. - Updated citation prompt instructions to clarify citation placement rules for RTL languages. #### Frontend - Introduced a new utility: `text-direction.ts` - Detects text direction based on Unicode ranges. - Supports Arabic, Hebrew, Syriac, Thaana, and related scripts. - Provides `getDirAttribute()` for automatic `dir` assignment. - Applied dynamic `dir` attributes across: - Markdown rendering - Chat messages - Search results - Tables - Hover cards and reference popovers - Added proper RTL styling in LESS: - Text alignment adjustments - Blockquote border flipping - Section indentation correction - Table direction switching - Use of `<bdi>` for figure labels to prevent bidirectional conflicts #### DevOps / Environment - Added Windows backend launch script with retry handling. - Updated dependency metadata. - Adjusted development-only React debugging behavior. --- ### Type of change - [x] Bug Fix (non-breaking change which fixes RTL rendering and citation issues) - [x] New Feature (non-breaking change which adds RTL detection and dynamic direction handling) --------- Co-authored-by: 6ba3i <isbaaoui09@gmail.com> Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local> Co-authored-by: Ahmad Intisar <168020872+ahmadintisar@users.noreply.github.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-02 13:03:44 +08:00
eviaaaaa	fa71f8d0c7	refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 ) Summary This PR tackles a significant memory bottleneck when processing image-heavy Word documents. Previously, our pipeline eagerly decoded DOCX images into `PIL.Image` objects, which caused high peak memory usage. To solve this, I've introduced a lazy-loading approach: images are now stored as raw blobs and only decoded exactly when and where they are consumed. This successfully reduces the memory footprint while keeping the parsing output completely identical to before. What's Changed Instead of a dry file-by-file list, here is the logical breakdown of the updates: * The Core Abstraction (`lazy_image.py`): Introduced `LazyDocxImage` along with helper APIs to handle lazy decoding, image-type checks, and NumPy compatibility. It also supports `.close()` and detached PIL access to ensure safe lifecycle management and prevent memory leaks. * Pipeline Integration (`naive.py`, `figure_parser.py`, etc.): Updated the general DOCX picture extraction to return these new lazy images. Downstream consumers (like the figure/VLM flow and base64 encoding paths) now decode images right at the use site using detached PIL instances, avoiding shared-instance side effects. * Compatibility Hooks (`operators.py`, `book.py`, etc.): Added necessary compatibility conversions so these lazy images flow smoothly through existing merging, filtering, and presentation steps without breaking. Scope & What is Intentionally Left Out To keep this PR focused, I have restricted these changes strictly to the general Word pipeline and its downstream consumers. The `QA` and `manual` Word parsing pipelines are explicitly not modified in this PR. They can be safely migrated to this new lazy-load model in a subsequent, standalone PR. Design Considerations I briefly considered adding image compression during processing, but decided against it to avoid any potential quality degradation in the derived outputs. I also held off on a massive pipeline re-architecture to avoid overly invasive changes right now. Validation & Testing I've tested this to ensure no regressions: * Compared identical DOCX inputs before and after this branch: chunk counts, extracted text, table HTML, and image descriptions match perfectly. * Confirmed a noticeable drop in peak memory usage when processing image-dense documents. For a 30MB Word document containing 243 1080p screenshots, memory consumption is reduced by approximately 1.5GB. Breaking Changes None.	2026-02-28 11:22:31 +08:00
Magicbook1108	158503a1aa	Feat: optimize ingestion pipeline with preprocess (#13211 ) ### What problem does this PR solve? Feat: optimize ingestion pipeline with preprocess ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-02-26 10:24:13 +08:00
Magicbook1108	109441628b	Fix: upload image files (#13071 ) ### What problem does this PR solve? Fix: upload image files ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-11 09:47:33 +08:00
qinling0210	4bc622b409	Fix parameter of calling self.dataStore.get() and warning info during parser (#13068 ) ### What problem does this PR solve? Fix parameter of calling self.dataStore.get() and warning info during parser https://github.com/infiniflow/ragflow/issues/13036 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-09 17:56:59 +08:00
yH	5333e764fc	fix: optimize Excel row counting for files with abnormal max_row (#13018 ) ### What problem does this PR solve? Some Excel files have abnormal `max_row` metadata (e.g., `max_row=1,048,534` with only 300 actual data rows). This causes: - `row_number()` returns incorrect count, creating 350+ tasks instead of 1 - `list(ws.rows)` iterates through millions of empty rows, causing system hang This PR uses binary search to find the actual last row with data. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Performance Improvement Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-06 14:43:52 +08:00
Stephen Hu	11703d957d	Refactor: Improve Picture.py resource usage (#13011 ) ### What problem does this PR solve? Improve Picture.py resource usage ### Type of change - [x] Refactoring	2026-02-06 09:50:53 +08:00
Stephen Hu	6c9ca45b30	Refactor: improve close for presentation (#12957 ) ### What problem does this PR solve? improve close for presentation ### Type of change - [x] Refactoring	2026-02-03 10:24:27 +08:00
eviaaaaa	1a2d69edc4	feat: Implement legacy .ppt parsing via Tika (alternative to Aspose) (#12932 ) ## What problem does this PR solve? This PR implements parsing support for legacy PowerPoint files (`.ppt`, 97-2003 format). Currently, parsing these files fails because `python-pptx` natively lacks support for the legacy OLE2 binary format. ## Context: I originally using `aspose-slides` for this purpose. However, since `aspose-slides` is no longer a project dependency, I implemented a fallback mechanism using the existing `tika-server` to ensure compatibility and stability. ## Key Changes: - Fallback Logic: Modified `rag/app/presentation.py` to catch `python-pptx` failures and automatically fall back to Tika parsing. - No New Dependencies: Utilizes the `tika` service that is already part of the RAGFlow stack. - Note: Since Tika focuses on text extraction, this implementation extracts text content but does not generate slide thumbnails . ## 🧪 Test / Verification Results ### 1. Before (The Issue) I have verified the fix using a legacy `.ppt` file (`math(1).ppt`, ~8MB). <img width="963" height="970" alt="image" src="https://github.com/user-attachments/assets/468c4ba8-f90b-4d7b-b969-9c5f5e42c474" /> ### 2. After (The Fix) With this PR, the system detects the failure in python-pptx and successfully falls back to Tika. The text is extracted correctly. <img width="1467" height="1121" alt="image" src="https://github.com/user-attachments/assets/fa0fed3b-b923-4c86-ba2c-24b3ce6ee7a6" /> Type of change - [x] New Feature (non-breaking change which adds functionality) Signed-off-by: evilhero <2278596667@qq.com> Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-02-02 13:40:51 +08:00
Carve_	23bdf25a1f	feature:Add OceanBase Storage Support for Table Parser (#12923 ) ### What problem does this PR solve? close #12770 This PR adds OceanBase as a storage backend for the Table Parser. It enables dynamic table schema storage via JSON and implements OceanBase SQL execution for text-to-SQL retrieval. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Changes - Table Parser stores row data into `chunk_data` when doc engine is OceanBase. (table.py) - OceanBase table schema adds `chunk_data` JSON column and migrates if needed. - Implemented OceanBase `sql()` to execute text-to-SQL results. (ob_conn.py) - Add `DOC_ENGINE_OCEANBASE` flag for engine detection (setting.py) ### Test 1. Set `DOC_ENGINE=oceanbase` (e.g. in `docker/.env`) <img width="1290" height="783" alt="doc_engine_ob" src="https://github.com/user-attachments/assets/7d1c609f-7bf2-4b2e-b4cc-4243e72ad4f1" /> 2. Upload an Excel file to Knowledge Base.(for test, we use as below) <img width="786" height="930" alt="excel" src="https://github.com/user-attachments/assets/bedf82f2-cd00-426b-8f4d-6978a151231a" /> 3. Choose Table as parsing method. <img width="2550" height="1134" alt="parse_excel" src="https://github.com/user-attachments/assets/aba11769-02be-4905-97e1-e24485e24cd0" /> 4.Ask a natural language query in chat. <img width="2550" height="1134" alt="query" src="https://github.com/user-attachments/assets/26a910a6-e503-4ac7-b66a-f5754bbb0e91" />	2026-01-31 15:11:54 +08:00
Kevin Hu	f262d416fe	Refa: remove aspose dependency. (#12910 ) ### Type of change - [x] Refactoring	2026-01-30 14:06:19 +08:00
Kevin Hu	f1c2fac03e	Refa: remove ppt image. (#12909 ) ### What problem does this PR solve? remove `aspose` ### Type of change - [x] Refactoring	2026-01-30 13:35:42 +08:00
Kevin Hu	32c0161ff1	Refa: Clean the folders. (#12890 ) ### Type of change - [x] Refactoring	2026-01-29 14:23:26 +08:00

1 2 3 4 5 ...

318 Commits