ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-03 17:21:59 +08:00

Author	SHA1	Message	Date
Lynn	bc54903bf6	Fix: display model_id in memory_list (#16567 )	2026-07-02 20:28:27 +08:00
Lynn	400476f0b3	Feat: SoMark (#16482 ) Follow #15486 Co-authored-by: limuting <limuting233@gmail.com> Co-authored-by: lutianyi <lutianyi233@163.com> Co-authored-by: justinychuang <huangyicheng@soulcode.cn> Co-authored-by: maybehokori <138367708+maybehokori@users.noreply.github.com>	2026-07-01 13:29:28 +08:00
Rander	62698725ca	feat(paddleocr): add image parsing support with async Job API (#16086 ) ## Summary Add image parsing capability to PaddleOCR integration, building on top of #15967 (async Job API migration). ## Changes ### `deepdoc/parser/paddleocr_parser.py` - Add `parse_image()` method that uses the same async Job API flow as `parse_pdf()` - Extracts text from `layoutParsingResults` → `prunedResult` → `parsing_res_list` - Returns concatenated block content as a single string ### `rag/llm/ocr_model.py` - Add `parse_image()` wrapper to `PaddleOCROcrModel` with availability check and logging ## Relationship to other PRs - Depends on: #15967 (async Job API migration) — this PR is based on that branch - Replaces: #14826 (original image processing PR based on old sync API) ## Notes This PR uses `base_url` and the async Job API (submit → poll → fetch) consistent with #15967, rather than the old `api_url` + sync POST pattern from #14826.	2026-06-16 19:34:38 +08:00
Rander	1235da7093	refactor(paddleocr): migrate from sync API to async Job API (#15967 ) ## Summary Migrate PaddleOCR integration from the deprecated synchronous HTTP API to the new asynchronous Job API (`submit → poll → fetch`), aligning with PaddleOCR 3.6.0+ architecture. ## Changes ### Python (`deepdoc/parser/paddleocr_parser.py`) - Replace synchronous `requests.post()` with async Job API flow (submit → poll → fetch) - Authentication: `token {token}` → `Bearer {token}` - File transfer: base64 JSON body → multipart file upload - Polling: exponential backoff (initial 3s, ×1.5, max 15s, timeout controlled by `request_timeout`) - Result: fetch full JSONL from result URL, preserving `prunedResult` with bbox info for crop functionality - Rename `api_url` → `base_url` (backward compatible: `api_url` still accepted as fallback) ### Python (`rag/llm/ocr_model.py`) - Prefer `paddleocr_base_url` / `PADDLEOCR_BASE_URL`, fallback to `paddleocr_api_url` / `PADDLEOCR_API_URL` ### Go (`internal/entity/models/paddleocr.go`) - Add `Client-Platform: ragflow` header to submit and poll requests - Change polling from fixed 3s to exponential backoff (initial 3s, ×1.5, max 15s) ### Python (`common/constants.py`) - Add `PADDLEOCR_BASE_URL` to env keys and default config ## Backward Compatibility - Old env var `PADDLEOCR_API_URL` still works (used as fallback) - Frontend field `paddleocr_api_url` still works (backend reads it as fallback) - No user-facing configuration changes required for existing setups ## Why not use the `paddleocr` SDK package directly? RAGFlow's `_transfer_to_sections()` relies on `prunedResult` (containing `block_bbox`, `block_label`, `parsing_res_list`) from the raw API response for PDF crop functionality. The SDK's public `parse_document()` API only returns `DocParsingResult` with `markdown_text`, discarding the bbox data. Therefore we implement the async Job API flow directly via HTTP, following the same logic as the SDK internally.	2026-06-16 19:34:21 +08:00
wdeveloper16	78188ce9e9	Feat: add OpenDataLoader PDF parser backend (#14058 ) (#14097 ) ### What problem does this PR solve? Closes #14058. RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU, Docling, TCADP, PaddleOCR). This PR adds OpenDataLoader ([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf)) as a new optional backend, giving users a deterministic, local-first alternative with competitive table extraction accuracy. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --- ### Changes #### Backend - `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser` class inheriting `RAGFlowPdfParser`. Implements `check_installation()` (guards Python package + Java 11+ runtime), `parse_pdf()` with JSON-first extraction (heading/paragraph/table/list/image/formula) and Markdown fallback, position-tag generation compatible with the shared `@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup. - `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in `PARSERS` dict, added to `chunk_token_num=0` override list. - `rag/flow/parser/parser.py` — `"opendataloader"` branch in the pipeline PDF handler + check validation list. #### Infrastructure - `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH. #### Frontend - `web/src/components/layout-recognize-form-field.tsx` — `OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown. Cascades automatically to the pipeline editor's Parser component. #### Docs - `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader entry and full env-var reference. --- ### Environment variables \| Variable \| Default \| Description \| \|---\|---\|---\| \| `USE_OPENDATALOADER` \| `false` \| Set `true` to install `opendataloader-pdf` on container startup \| \| `OPENDATALOADER_VERSION` \| latest \| Pin the PyPI release (e.g. `==2.2.1`) \| \| `OPENDATALOADER_HYBRID` \| _(unset)_ \| Enable hybrid AI mode (e.g. `docling-fast`) \| \| `OPENDATALOADER_IMAGE_OUTPUT` \| _(unset)_ \| `off` / `embedded` / `external` \| \| `OPENDATALOADER_OUTPUT_DIR` \| _(tmp)_ \| Persistent output dir; temp dir used + cleaned if unset \| \| `OPENDATALOADER_DELETE_OUTPUT` \| `1` \| `0` to retain intermediate files for debugging \| \| `OPENDATALOADER_SANITIZE` \| _(unset)_ \| `1` to filter prompt-injection patterns from output \| --- ### Dependencies - Runtime: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not added to `pyproject.toml` core deps. Installed by `ensure_opendataloader()` at container startup when `USE_OPENDATALOADER=true`. - System: Java 11+ on PATH (JVM is the underlying engine). The installer skips with a warning if `java` is not found. --- ### How to test Standalone parser: ```bash source .venv/bin/activate uv pip install opendataloader-pdf python3 -c " import sys; sys.path.insert(0, '.') from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser p = OpenDataLoaderParser() print('available:', p.check_installation()) s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline') print(f'sections={len(s)} tables={len(t)}') " ``` ### Benchmark vs Docling ``` file parser secs sections tables ---------------------------------------------------------------------- text-heavy.pdf docling 45.29 148 10 text-heavy.pdf opendataloader 3.14 559 0 table-heavy.pdf docling 7.05 76 3 table-heavy.pdf opendataloader 3.71 90 0 complex.pdf docling 42.67 114 8 complex.pdf opendataloader 3.51 180 0 ```	2026-04-25 00:33:02 +08:00
Lin Manhui	2e09db02f3	feat: add paddleocr parser (#12513 ) ### What problem does this PR solve? Add PaddleOCR as a new PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-09 17:48:45 +08:00
Magicbook1108	82d4e5fb87	Ref: update loggings (#11987 ) ### What problem does this PR solve? Ref: update loggins ### Type of change - [x] Refactoring	2025-12-17 15:43:25 +08:00
Yongteng Lei	03f9be7cbb	Refa: only support MinerU-API now (#11977 ) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring	2025-12-17 12:58:48 +08:00
concertdictate	49c74d08e8	Feature/mineru improvements (#11938 ) 我已在下面的评论中用中文重复说明。 ### What problem does this PR solve? ## Summary This PR enhances the MinerU document parser with additional configuration options, giving users more control over PDF parsing behavior and improving support for multilingual documents. ## Changes ### Backend (`deepdoc/parser/mineru_parser.py`) - Added configurable parsing options: - Parse Method: `auto`, `txt`, or `ocr` — allows users to choose the extraction strategy - Formula Recognition: Toggle for enabling/disabling formula extraction (useful to disable for Cyrillic documents where it may cause issues) - Table Recognition: Toggle for enabling/disabling table extraction - Added language code mapping (`LANGUAGE_TO_MINERU_MAP`) to translate RAGFlow language settings to MinerU-compatible language codes for better OCR accuracy - Improved parser configuration handling to pass these options through the processing pipeline ### Frontend (`web/`) - Created new `MinerUOptionsFormField` component that conditionally renders when MinerU is selected as the layout recognition engine - Added UI controls for: - Parse method selection (dropdown) - Formula recognition toggle (switch) - Table recognition toggle (switch) - Added i18n translations for English and Chinese - Integrated the options into both the dataset creation dialog and dataset settings page ### Integration - Updated `rag/app/naive.py` to forward MinerU options to the parser - Updated task service to handle the new configuration parameters ## Why MinerU is a powerful document parser, but the default settings don't work well for all document types. This PR allows users to: 1. Choose the best parsing method for their documents 2. Disable formula recognition for Cyrillic/non-Latin scripts where it causes issues 3. Control table extraction based on document needs 4. Benefit from automatic language detection for better OCR results ## Testing - [x] Tested MinerU parsing with different parse methods - [x] Verified UI renders correctly when MinerU is selected/deselected - [x] Confirmed settings persist correctly in dataset configuration ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: user210 <user210@rt> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-12-16 13:15:25 +08:00
Yongteng Lei	ad6f7fd4b0	Fix: pipeline ignore MinerU backend config and vllm module is missing (#11955 ) ### What problem does this PR solve? Fix pipeline ignore MinerU backend config and vllm module is missing. #11944, #11947. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-15 18:03:34 +08:00
Yongteng Lei	e9710b7aa9	Refa: treat MinerU as an OCR model 2 (#11905 ) ### What problem does this PR solve? Treat MinerU as an OCR model 2. #11903 ### Type of change - [x] Refactoring	2025-12-11 17:33:12 +08:00
Yongteng Lei	a94b3b9df2	Refa: treat MinerU as an OCR model (#11849 ) ### What problem does this PR solve? Treat MinerU as an OCR model. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2025-12-09 18:54:14 +08:00

12 Commits