ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 15:31:05 +08:00

Author	SHA1	Message	Date
FuturMix	2548c28d65	feat: add FuturMix as model provider (#14419 ) ## Summary Add [FuturMix](https://futurmix.ai) as a new model provider. FuturMix is an OpenAI-compatible unified AI gateway that provides access to 22+ models (GPT, Claude, Gemini, DeepSeek, and more) through a single API endpoint and key. - API Base: `https://futurmix.ai/v1` (OpenAI-compatible) - Supported capabilities: Chat, Embedding, Image2Text, TTS, Speech2Text, Rerank ### Changes \| File \| Change \| \|------\|--------\| \| `rag/llm/__init__.py` \| Add `FuturMix` to `SupportedLiteLLMProvider` enum, `FACTORY_DEFAULT_BASE_URL`, and `LITELLM_PROVIDER_PREFIX` \| \| `rag/llm/chat_model.py` \| Add `FuturMixChat(Base)` — follows Astraflow/Avian pattern \| \| `rag/llm/embedding_model.py` \| Add `FuturMixEmbed(OpenAIEmbed)` — follows Astraflow pattern \| \| `rag/llm/cv_model.py` \| Add `FuturMixCV(GptV4)` — follows SILICONFLOW/OpenRouter pattern \| \| `rag/llm/tts_model.py` \| Add `FuturMixTTS(OpenAITTS)` — follows CometAPI/DeerAPI pattern \| \| `rag/llm/sequence2txt_model.py` \| Add `FuturMixSeq2txt(GPTSeq2txt)` — follows StepFun pattern \| \| `rag/llm/rerank_model.py` \| Add `FuturMixRerank(OpenAI_APIRerank)` \| \| `conf/llm_factories.json` \| Add factory config with 8 chat, 2 embedding, 1 image2text, 2 TTS, 1 speech2text models \| \| `docs/guides/models/supported_models.mdx` \| Add FuturMix to supported models table \| ### Models included - Chat: claude-sonnet-4-20250514, claude-3.5-haiku, gpt-4o, gpt-4o-mini, gemini-2.5-flash, gemini-2.0-flash, deepseek-chat, deepseek-reasoner - Embedding: text-embedding-3-small, text-embedding-3-large - Image2Text: gpt-4o - TTS: tts-1, tts-1-hd - Speech2Text: whisper-1 ## Test plan - [ ] Verify FuturMix appears in the model provider list in RAGFlow UI - [ ] Configure FuturMix with API key and test chat completion - [ ] Test embedding model with document indexing - [ ] Test image2text with a sample image 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-30 10:59:37 +08:00
Magicbook1108	de8c6ad0f3	Feat: enable sync deleted file for Discord (#14451 ) ### What problem does this PR solve? Feat: enable sync deleted file for Discord ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:40 +08:00
bitloi	2bc8c6d35e	feat(dropbox): support deleted-file sync (#14476 ) ### What problem does this PR solve? Partially addresses #14362 by adding deleted-file sync support for the Dropbox data source. Dropbox previously did not provide the slim current-file snapshot required by stale document reconciliation, and its sync runner returned only document batches. As a result, enabling deleted-file sync could not remove local documents that had been deleted from Dropbox. This PR: - Adds `retrieve_all_slim_docs_perm_sync()` to `DropboxConnector`. - Reuses Dropbox metadata traversal to collect current remote file IDs without downloading file contents. - Wires incremental Dropbox sync to return `(document_generator, file_list)` when `sync_deleted_files` is enabled. - Enables the deleted-file sync toggle for Dropbox in the data source settings UI. - Adds regression coverage for slim snapshots, nested folders, paginated listings, duplicate filenames, and full reindex behavior. Tests: - `uv run pytest test/unit_test/common/test_dropbox_connector.py -q` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `uv run pytest test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py -q` - `uv run ruff check common/data_source/dropbox_connector.py rag/svr/sync_data_source.py test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:11 +08:00
Magicbook1108	db1a73b255	Feat: enable sync deleted files in gitlab (#14481 ) ### What problem does this PR solve? Feat: enable sync deleted files in gitlab ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:04:10 +08:00
Magicbook1108	e0b3070012	Feat: enable sync deleted files for Gmail && fix google drive issues (#14462 ) ### What problem does this PR solve? Feat: enable sync deleted files for Gmail && fix google drive issues ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: bill <yibie_jingnian@163.com> Co-authored-by: balibabu <assassin_cike@163.com>	2026-04-29 17:03:56 +08:00
buua436	c08ced09a7	Fix: add retrieval fallback comments (#14457 ) ### What problem does this PR solve? add retrieval fallback comments ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 14:44:31 +08:00
buua436	a7ce1b1677	Fix: prune deleted doc chunks from retrieval (#14454 ) ### What problem does this PR solve? prune deleted doc chunks from retrieval ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 13:03:09 +08:00
Magicbook1108	3b7a6eaa6c	Feat: sync deleted files in Bitbucket (#14450 ) ### What problem does this PR solve? Feat: sync deleted files in Bitbucket ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 11:29:17 +08:00
Paras Sondhi	74fa54f122	feat(google-drive): optimize memory payload and enable sync deletion (#14372 ) Addresses the Google Drive integration for #14362 This PR completely overhauls the Google Drive sync logic to accurately detect remote deletions, while drastically reducing the memory footprint during the snapshot phase. ### What changed under the hood: * Killed the memory bloat: Swapped out the massive document dictionary objects for a lightweight `collections.namedtuple` (`SlimDoc = namedtuple('SlimDoc', ['id'])`). This prevents RAM spikes during `retrieve_all_slim_docs_perm_sync` on massive enterprise drives. * Flawless downstream integration: The `SlimDoc` object relies on simple duck typing. It perfectly delivers the `.id` attribute required by `ConnectorService.cleanup_stale_documents_for_task`, meaning your core `hash128` vector cleanup logic runs natively without modification. * Fixed the Shared Drive blindspot: The standard API query was missing team folders. Injected the `corpora="allDrives"` and `includeItemsFromAllDrives=True` override flags so the connector now accurately maps state across both personal workspaces and organizational Shared Drives. ### Testing: Isolated the Google API retrieval logic locally to prove the `SlimDoc` mapping works and correctly registers state drops when a file is trashed remotely. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Performance Improvement	2026-04-29 10:04:36 +08:00
Stephen Hu	345bec812d	refactor: improve QwenRerank logic (#14388 ) ### What problem does this PR solve? improve QwenRerank logic ### Type of change - [x] Refactoring	2026-04-28 20:17:34 +08:00
Magicbook1108	0d18b293f5	Fix: enable sync deleted file in airtable (#14438 ) ### What problem does this PR solve? Fix: enable sync deleted file in airtable ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 20:09:08 +08:00
buua436	e6e80041f5	Fix: agent toolcall null response & schema validation & DeepSeek think history (#14425 ) ### What problem does this PR solve? agent toolcall null response & schema validation & DeepSeek think history ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 17:09:08 +08:00
Magicbook1108	18fbfafca6	Feat: enable sync deleted files for more connectors (#14353 ) ### What problem does this PR solve? Feat: enable sync delted files for connectors ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-28 15:07:14 +08:00
Idriss Sbaaoui	2a37562791	Fix manual naive parser position extraction fallback (#14420 ) ### What problem does this PR solve? This PR fixes a regression where Manual pipeline + Naive (Plain Text) PDF parsing crashed with `AttributeError: 'PlainParser' object has no attribute 'extract_positions'` in `rag/app/manual.py`. fixes #14411 ### Type of change: - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 14:21:30 +08:00
Jack	872ff08304	Fix: add executor.shutdown (#14403 ) ### What problem does this PR solve? Add executor shutdown in finally clause to free resources. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-27 22:38:43 +08:00
Idriss Sbaaoui	4303be223f	Fix metadata parsing regression for upgraded v0.24 datasets (#14383 ) ### What problem does this PR solve? This PR fixes issue #14371 where file parsing failed after upgrading from v0.24.0 to v0.25.0, because metadata config could be a JSON Schema object but was handled like a list and later caused `KeyError: 'properties'`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-27 16:18:06 +08:00
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
yuch85	0d87cecae2	feat: persist PDF bookmark outline as document metadata (#13287 ) ## Summary PDF files often contain a bookmark/outline tree (table of contents built into the file by the authoring tool). RAGFlow's `pdf_parser.outlines` already extracts these `(title, depth)` tuples via pypdf, but they are used ephemerally during chunking (`manual` parser uses them for hierarchy detection) and then discarded. This PR persists the outline as `doc.meta_fields["outline"]` — a JSON array of `{"title": str, "depth": int}` objects — so downstream features can use the structural information. ### Why this matters - Complementary to `toc_extraction` — the existing `toc_extraction` feature uses LLM calls to generate a TOC and only works for the `naive` parser. The raw PDF outline is free (already extracted by pypdf), works for all parsers, and captures the author's original document structure. - Document navigation — frontends can render a clickable TOC from the outline - Entity extraction — the outline provides a structural map for identifying document sections and key topics - Search result context — knowing which section a chunk belongs to helps users evaluate relevance ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/app/naive.py` \| Attach `pdf_parser.outlines` as `__outline__` on first chunk dict \| ~7 \| \| `rag/app/manual.py` \| Same for the manual parser \| ~5 \| \| `rag/svr/task_executor.py` \| Extract `__outline__`, persist via `DocMetadataService.update_document_metadata()` \| ~12 \| ### Design decisions - Transient key pattern: The outline is passed from parser → task_executor via `__outline__` on the first chunk dict, then removed before indexing. This follows the same pattern as `metadata_obj` for LLM-generated metadata. - No schema changes: Uses the existing `meta_fields` JSON column on the document table. - Graceful degradation: If a PDF has no outline (common for scanned docs), nothing is stored. If persistence fails, it logs a warning and continues — parsing is not interrupted. ### Backward compatibility - Fully backward compatible — no existing fields, behavior, or schemas changed - PDFs without outlines are unaffected - Existing `meta_fields` data is preserved (merged, not overwritten) ## Test plan - [ ] Parse a PDF with bookmarks (e.g. any multi-chapter document), verify `meta_fields["outline"]` is populated - [ ] Parse a PDF without bookmarks, verify no errors and no outline key in meta_fields - [ ] Verify existing `meta_fields` data is preserved (not overwritten) when outline is added - [ ] Verify `manual` parser also persists outlines - [ ] Verify outline JSON structure: `[{"title": "Chapter 1", "depth": 0}, ...]` Related: #9921 (Deterministic Document Access Layer) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 11:57:06 +08:00
euvre	f3b7d55a1e	fix: handle Infinity table-not-exist error (3022) in update() methods (#14153 ) ### What problem does this PR solve? ## Summary Closes #6102 When using Infinity as the document store engine (GPU version), calling `update()` on a non-existent table throws an unhandled `InfinityException` with error code 3022 (`TABLE_NOT_EXIST`). This causes users to see a raw "3022" error when clicking on a parsed document. ## Root Cause The `update()` methods in both `rag/utils/infinity_conn.py` and `memory/utils/infinity_conn.py` call `db_instance.get_table(table_name)` without catching `InfinityException`. In contrast, other CRUD methods (`insert`, `delete`, `search`) all handle this exception gracefully: \| Method \| Handles table-not-exist? \| Behavior \| \|----------\|--------------------------\|----------\| \| `insert` \| ✅ Yes \| Auto-creates the table \| \| `search` \| ✅ Yes \| Skips the table \| \| `delete` \| ✅ Yes \| Returns 0 \| \| `update` \| ❌ No \| Crashes with 3022 \| Additionally, `api/apps/document_app.py` worked around this with a fragile string match (`"3022" in msg`) to detect the error. ## Changes - `rag/utils/infinity_conn.py`: Catch `InfinityException` in `update()`. When `TABLE_NOT_EXIST` is detected, log a warning and return `False` — consistent with `delete()`. - `memory/utils/infinity_conn.py`: Apply the same fix to its `update()` method. - `api/apps/document_app.py`: Remove the fragile `"3022"` string-matching workaround. Table-not-exist is now handled by the `if not ok` path with an improved error message. ### Type of change - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 11:52:22 +08:00
yuch85	3ad3241ae0	feat: persist RAPTOR layer metadata on summary chunks (#13286 ) ## Summary RAPTOR's recursive clustering builds a `layers` list tracking `(start_idx, end_idx)` boundaries per level, but currently discards this information — only the flat `chunks` list is returned. This makes it impossible to distinguish leaf-level summaries from top-level ones. This PR: - Returns `(chunks, layers)` tuple from `raptor.py`'s `__call__` - Annotates each RAPTOR summary chunk with `raptor_layer_int` (1 = first summary level, 2 = summary-of-summaries, etc.) - Adds `raptor_layer_int` to `infinity_mapping.json` (Elasticsearch handles it via existing `_int` dynamic template) ### Why this matters Downstream features need to know which RAPTOR layer a summary belongs to: - Retrieving the top-level document summary* for entity extraction, search snippets, or document comparison - Filtering by abstraction level — users may want only high-level summaries or only leaf-level cluster summaries - RAPTOR recall quality — #10951 reports summaries not being recalled for definition queries; layer metadata enables targeted retrieval ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/raptor.py` \| Return `(chunks, layers)` tuple \| ~3 \| \| `rag/svr/task_executor.py` \| Build `chunk_layer` mapping, set `raptor_layer_int` \| ~12 \| \| `conf/infinity_mapping.json` \| Add `raptor_layer_int` integer field \| ~1 \| ### Backward compatibility - Additive only — no existing fields or behavior changed - Existing RAPTOR chunks continue to work (they'll have `raptor_layer_int = 0` by default) - New RAPTOR chunks get layer metadata automatically ## Test plan - [ ] Parse a document with RAPTOR enabled, verify `raptor_layer_int` is set on indexed chunks - [ ] Verify `raptor_layer_int` values increase with abstraction level (layer 1 < layer 2 < ...) - [ ] Verify existing RAPTOR deletion (`delete by raptor_kwd`) still works - [ ] Verify Infinity backend accepts the new field Fixes #7488 Related: #4104, #11191, #10951 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 10:20:46 +08:00
wdeveloper16	78188ce9e9	Feat: add OpenDataLoader PDF parser backend (#14058 ) (#14097 ) ### What problem does this PR solve? Closes #14058. RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU, Docling, TCADP, PaddleOCR). This PR adds OpenDataLoader ([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf)) as a new optional backend, giving users a deterministic, local-first alternative with competitive table extraction accuracy. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --- ### Changes #### Backend - `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser` class inheriting `RAGFlowPdfParser`. Implements `check_installation()` (guards Python package + Java 11+ runtime), `parse_pdf()` with JSON-first extraction (heading/paragraph/table/list/image/formula) and Markdown fallback, position-tag generation compatible with the shared `@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup. - `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in `PARSERS` dict, added to `chunk_token_num=0` override list. - `rag/flow/parser/parser.py` — `"opendataloader"` branch in the pipeline PDF handler + check validation list. #### Infrastructure - `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH. #### Frontend - `web/src/components/layout-recognize-form-field.tsx` — `OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown. Cascades automatically to the pipeline editor's Parser component. #### Docs - `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader entry and full env-var reference. --- ### Environment variables \| Variable \| Default \| Description \| \|---\|---\|---\| \| `USE_OPENDATALOADER` \| `false` \| Set `true` to install `opendataloader-pdf` on container startup \| \| `OPENDATALOADER_VERSION` \| latest \| Pin the PyPI release (e.g. `==2.2.1`) \| \| `OPENDATALOADER_HYBRID` \| _(unset)_ \| Enable hybrid AI mode (e.g. `docling-fast`) \| \| `OPENDATALOADER_IMAGE_OUTPUT` \| _(unset)_ \| `off` / `embedded` / `external` \| \| `OPENDATALOADER_OUTPUT_DIR` \| _(tmp)_ \| Persistent output dir; temp dir used + cleaned if unset \| \| `OPENDATALOADER_DELETE_OUTPUT` \| `1` \| `0` to retain intermediate files for debugging \| \| `OPENDATALOADER_SANITIZE` \| _(unset)_ \| `1` to filter prompt-injection patterns from output \| --- ### Dependencies - Runtime: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not added to `pyproject.toml` core deps. Installed by `ensure_opendataloader()` at container startup when `USE_OPENDATALOADER=true`. - System: Java 11+ on PATH (JVM is the underlying engine). The installer skips with a warning if `java` is not found. --- ### How to test Standalone parser: ```bash source .venv/bin/activate uv pip install opendataloader-pdf python3 -c " import sys; sys.path.insert(0, '.') from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser p = OpenDataLoaderParser() print('available:', p.check_installation()) s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline') print(f'sections={len(s)} tables={len(t)}') " ``` ### Benchmark vs Docling ``` file parser secs sections tables ---------------------------------------------------------------------- text-heavy.pdf docling 45.29 148 10 text-heavy.pdf opendataloader 3.14 559 0 table-heavy.pdf docling 7.05 76 3 table-heavy.pdf opendataloader 3.71 90 0 complex.pdf docling 42.67 114 8 complex.pdf opendataloader 3.51 180 0 ```	2026-04-25 00:33:02 +08:00
Lynn	e22cf333ed	Fix: allow search id or _id (#14356 ) ### What problem does this PR solve? Allow search id or _id when using es as doc_engine. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-24 21:38:19 +08:00
Magicbook1108	25089600d0	Feat: introduce minimum type check for pipeline (#14354 ) ### What problem does this PR solve? Feat: introduce minimum type check for pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-24 21:12:50 +08:00
Idriss Sbaaoui	ca01c7a745	Fix blob sync: skip unsupported files before download (#14357 ) ### What problem does this PR solve? Blob storage sync was downloading unsupported files first and rejecting them later, which wasted bandwidth and made sync slower. This PR skips unsupported extensions before download and applies `allow_images` in blob sync. fixes #14338 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-24 19:22:32 +08:00
qinling0210	1473000135	Implement retrieval_test in GO (#14231 ) ### What problem does this PR solve? Implement retrieval_test in GO ### Type of change - [x] Refactoring	2026-04-24 15:30:14 +08:00
newyangyang	d84438fd53	fix azure blob put method param (#14329 ) ### What problem does this PR solve? when use azure blob as the file container, when click parse file, it calls: ```python partial(settings.STORAGE_IMPL.put, tenant_id=task["tenant_id"]) ``` So any storage backend used there must accept tenant_id as a kwarg. RAGFlowAzureSasBlob.put() did not, causing: ``` TypeError: ... got an unexpected keyword argument 'tenant_id' ``` Now it does, so parsing should proceed past this point. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-23 20:40:54 +08:00
Magicbook1108	75a5548b85	Feat: optimize title chunk (#14325 ) ### What problem does this PR solve? Feat: optimize title chunk 1. Add a new button to enable "Use root chunk as H0 heading", so that the first chunk is carried on to all remaining chunks. 2. Update resume agent template ### Type of change - [x] New Feature (non-breaking change which adds functionality) <img width="700" alt="img_v3_02111_63b04951-b3d7-4001-a08b-539db6d5298g" src="https://github.com/user-attachments/assets/4179ac4d-90e7-4353-9b93-d649a455e634" /> <img width="700" alt="image" src="https://github.com/user-attachments/assets/c0ba0f3c-05aa-4f2c-b418-e808ca1a2641" />	2026-04-23 18:55:55 +08:00
Wang Qi	224574831c	Add REDIS zcard (#14316 ) ### What problem does this PR solve? As description. ### Type of change - [x] Refactoring	2026-04-23 12:51:55 +08:00
NeedmeFordev	38e45a1117	Fix: serialize GraphRAG entity resolution merges to avoid graph mutation races (#14237 ) ### What problem does this PR solve? This PR fixes the merge-phase crash reported in #14236 during GraphRAG entity resolution. The issue happens after candidate pair resolution completes, when multiple merge coroutines mutate the same shared `networkx` graph concurrently. In `_merge_graph_nodes`, the code iterates over `graph.neighbors(node1)` and also awaits during edge/description merging. That allows another coroutine to modify the graph adjacency structure in between, which can trigger `RuntimeError: dictionary keys changed during iteration` and can also lead to unsafe shared-graph mutation. This change keeps the PR scoped to that single issue by: - serializing merge-time graph mutations with a dedicated merge lock - snapshotting `graph.neighbors(node1)` with `list(...)` before iteration Together, these changes prevent concurrent mutation of the shared graph during the merge phase and make the merge loop safe against live-view invalidation. Fixes #14236 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-22 16:42:53 +08:00
ucloudnb666	f853a39b40	feat: Add Astraflow provider support (global + China endpoints) (#14270 ) ## Add Astraflow Provider Support This PR integrates [Astraflow](https://astraflow.ucloud.cn/) (by UCloud / 优刻得) as a new AI model provider in RAGFlow, with support for both global and China endpoints. ### About Astraflow Astraflow is an OpenAI-compatible AI model aggregation platform supporting 200+ models from major providers including DeepSeek, Qwen, GPT, Claude, Gemini, Llama, Mistral, and more. \| Variant \| Factory Name \| Endpoint \| Env Var \| \|---------\|-------------\|----------\|---------\| \| Global \| `Astraflow` \| `https://api-us-ca.umodelverse.ai/v1` \| `ASTRAFLOW_API_KEY` \| \| China \| `Astraflow-CN` \| `https://api.modelverse.cn/v1` \| `ASTRAFLOW_CN_API_KEY` \| - API key signup: https://astraflow.ucloud.cn/ --- ### Files Changed \| File \| Change \| \|------\|--------\| \| `rag/llm/__init__.py` \| Register `Astraflow` and `Astraflow-CN` in `SupportedLiteLLMProvider` enum, `FACTORY_DEFAULT_BASE_URL`, and `LITELLM_PROVIDER_PREFIX` \| \| `rag/llm/chat_model.py` \| Add `AstraflowChat` and `AstraflowCNChat` (OpenAI-compatible `Base` subclass) \| \| `rag/llm/embedding_model.py` \| Add `AstraflowEmbed` and `AstraflowCNEmbed` (subclasses of `OpenAIEmbed`) \| \| `rag/llm/rerank_model.py` \| Add `AstraflowRerank` and `AstraflowCNRerank` (subclasses of `OpenAI_APIRerank`) \| \| `rag/llm/cv_model.py` \| Add `AstraflowCV` and `AstraflowCNCV` (subclasses of `GptV4`) \| \| `rag/llm/tts_model.py` \| Add `AstraflowTTS` and `AstraflowCNTTS` (subclasses of `OpenAITTS`) \| \| `rag/llm/sequence2txt_model.py` \| Add `AstraflowSeq2txt` and `AstraflowCNSeq2txt` (subclasses of `GPTSeq2txt`) \| \| `conf/llm_factories.json` \| Register `Astraflow` and `Astraflow-CN` factories with a curated list of popular models \| --- ### Supported Model Types - ✅ Chat / LLM — DeepSeek-V3/R1, Qwen3, GPT-4o/4.1, Claude 3.5/3.7, Gemini 2.0/2.5 Flash, Llama 3.3/4, Mistral, and 200+ more - ✅ Text Embedding — text-embedding-3-small/large - ✅ Image / Vision (IMAGE2TEXT) — GPT-4o, GPT-4.1, Claude, Gemini, Llama-4, etc. - ✅ Text Re-Rank - ✅ TTS — tts-1 - ✅ Speech-to-Text (SPEECH2TEXT) — whisper-1 ### Implementation Notes - Uses the `openai/` LiteLLM prefix — consistent with other OpenAI-compatible aggregation platforms (SILICONFLOW, DeerAPI, CometAPI, OpenRouter, n1n, Avian, etc.) - `Astraflow` (global, rank 250) and `Astraflow-CN` (China, rank 249) are separate factory entries, allowing users to choose the optimal endpoint based on their region. - All model classes cleanly subclass existing base classes (`Base`, `OpenAIEmbed`, `OpenAI_APIRerank`, `GptV4`, `OpenAITTS`, `GPTSeq2txt`) with no custom logic needed — the provider is fully OpenAI-compatible. --------- Co-authored-by: user <user@xzaaaMacBook-Air.local>	2026-04-22 15:38:34 +08:00
Lynn	afdf0814d7	Fix: get metadata conf (#14250 ) ### What problem does this PR solve? Get metadata configuration from union of custom metadata and built_in_metadata. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-21 17:22:42 +08:00
Liu An	6e33d8722f	Revert "Fix: forwarding highlight param" (#14249 ) Reverts infiniflow/ragflow#14112	2026-04-21 15:23:18 +08:00
Magicbook1108	b3891ba6a4	Fix audio/video in pipeline (#14241 ) ### What problem does this PR solve? Fix audio/video in pipeline ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-21 12:17:57 +08:00
Wang Qi	8aab158942	OpenSource Resume is supported only with Elasticsearch. (#14233 ) ### What problem does this PR solve? OpenSource Resume is supported only with Elasticsearch. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-21 10:05:47 +08:00
Magicbook1108	19eedeec61	Fix: accept empty value as 0 chunk (#14220 ) ### What problem does this PR solve? Fix: accept empty value as 0 chunk ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-20 12:53:47 +08:00
rhinoceros.xn	4e992de91f	Add tongyi gte-rerank-v2 (#14215 ) https://bailian.console.aliyun.com/cn-beijing?tab=api#/api/?type=model&url=2780056 ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Other (please describe): add gte-rerank-v2、qwen3-rerank	2026-04-20 11:39:17 +08:00
Daniil Sivak	22c6648348	Fix: forwarding highlight param (#14112 ) Closes #9078 ### What problem does this PR solve? The `retrieval_test` endpoint in `chunk_app.py` never forwarded the `highlight` request parameter to `retriever.retrieval()`, so the search engine never produced highlight snippets. Additionally, the frontend always rendered `content_with_weight` instead of preferring the `highlight` field, and the CSS rule color `var(--accent-primary)` didn't work because the variable stores an RGB triplet `(45,212,191)` requiring the `rgb()` wrapper. ### Before - Search page: displayed raw content_with_weight as a wall of plain white text with no term highlighting, including markdown headings rendered as literal text - Retrieval testing page: showed `content_with_weight` in a plain `<p>` tag, no `<em>` tags rendered, no highlight coloring - Children chunks: when child chunks were consolidated into a parent via `retrieval_by_children`, any highlight data from children was discarded - TOC chunks: chunks fetched via `retrieval_by_toc` had no `highlight` field, appearing as plain text while other chunks had highlights Retrieval testing: <img width="1449" height="1178" alt="before-retrieval-no-highlight-cropped" src="https://github.com/user-attachments/assets/5c6f5a5e-6c11-461a-bdb4-049d7dfb7a33" /> Search: <img width="1378" height="711" alt="before-search-no-highlight-cropped" src="https://github.com/user-attachments/assets/be7b5152-72ef-40da-a8fd-921e997ae7d3" /> ### After - Search page: displays the highlight field with search terms rendered in teal/cyan color (`rgb(var(--accent-primary))`) - Retrieval testing page: sends highlight: true in the request, uses `HighLightMarkdown` component to render `<em>` tags with proper coloring - Children chunks: highlights from child chunks are joined and preserved on the parent - TOC chunks: when other chunks have highlights, TOC-fetched chunks use `content_with_weight` as a highlight fallback Retrieval testing: <img width="1410" height="1015" alt="05-retrieval-testing-results" src="https://github.com/user-attachments/assets/f0cff8cf-0962-4320-b559-cd5037f622d2" /> Search: <img width="1294" height="455" alt="03-search-highlight-results" src="https://github.com/user-attachments/assets/a90e0e3e-3837-46be-8ddd-2412ff7cbc19" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-17 20:59:20 +08:00
Yongteng Lei	fac46ef67f	Refa: change Minimax base url to mainland by default to align with UI (#14195 ) ### What problem does this PR solve? Change Minimax base url to mainland by default to align with UI. ### Type of change - [x] Refactoring	2026-04-17 19:08:57 +08:00
euvre	0cd49e14dd	fix: make Infinity connection pool size configurable and add retry logic for GraphRAG write bursts (#14143 ) ### What problem does this PR solve? Resolve #14137 . ### Problem Graph resolution succeeds (nodes/edges merged, pagerank updated), but the subsequent burst of Infinity write operations in `set_graph` exhausts the connection pool with `TOO_MANY_CONNECTIONS` errors. Root causes: 1. Hardcoded pool size — `infinity_conn_pool.py` hardcoded `ConnectionPool(max_size=4)` on initial creation and `max_size=32` on refresh. Operators cannot tune this without patching code. 2. No retry on transient failures — a single `TOO_MANY_CONNECTIONS` on edge deletes or chunk inserts kills the entire resolution+community pipeline with no retry. ### Changes #### `common/doc_store/infinity_conn_pool.py` - Read `ConnectionPool` `max_size` from the `INFINITY_POOL_MAX_SIZE` environment variable (default: `4`), applied consistently to both initial creation and refresh paths. - Log the actual pool size on startup for easier debugging. #### `rag/graphrag/utils.py` — `set_graph()` - Edge deletes: add exponential-backoff retry (3 attempts, 1s/2s/4s delays) so transient `TOO_MANY_CONNECTIONS` errors are retried instead of failing the entire job. Concurrency continues to be gated by the existing `chat_limiter`. - Batch inserts: add exponential-backoff retry (3 attempts, 1s/2s/4s delays) for the same reason. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-16 15:40:54 +08:00
Qi Wang	969ce3a79f	[Bug fix #14133 ] fix graph rag, raptor, mindmap log cannot show correctly in UI (#14136 ) ### What problem does this PR solve? Fix #14133, knowledge graph, raptor, mindmap log cannot show correctly in UI <img width="1930" height="982" alt="Image" src="https://github.com/user-attachments/assets/d2f8e6c1-d82d-4b00-a377-949aada545ca" /> After Fix: <img width="2108" height="805" alt="image" src="https://github.com/user-attachments/assets/b37426c1-83d3-4a32-a83c-9d340d69e0e6" /> <img width="2173" height="1067" alt="image" src="https://github.com/user-attachments/assets/30105222-3310-43a0-9f83-1e320d05e413" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-16 13:08:36 +08:00
Magicbook1108	944a90d645	Feat: add button to turn off vlm parsing (#14125 ) ### What problem does this PR solve? Feat: add button to turn off vlm parsing ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: chanx <1243304602@qq.com>	2026-04-15 19:06:00 +08:00
Magicbook1108	d51789e2be	Feat: update templates && add resume template (#14124 ) ### What problem does this PR solve? Feat: update templates && add resume template ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-15 18:42:29 +08:00
Minal Mahala	f930389311	Refact: improve task resume mechanism for graphrag (#14096 ) ### What problem does this PR solve? Addresses review feedback on #14074 (Checkpoint mechanism for long-running workflow jobs, issue #12494). Changes based on @yuzhichang's review: 1. Renamed `checkpoint_service.py` → `task_checkpoint.py` as suggested. 2. Replaced Redis with direct docEngine queries as suggested — the subgraph already gets persisted to the doc store by `generate_subgraph()`, so we just query for it instead of maintaining a separate checkpoint in Redis. This is simpler, has no extra dependency, and uses a single source of truth. Changes based on CodeRabbit review: 3. Fixed `source_id` query format mismatch — subgraphs are stored with `source_id: [doc_id]` (list), but the original query used `source_id: doc_id` (string). Now follows the same pattern as `does_graph_contains()` in `rag/graphrag/utils.py`: filter by `knowledge_graph_kwd` only, then match `source_id` in Python. This avoids ambiguity across Elasticsearch / Infinity / OceanBase backends. ### Changes \| File \| Change \| \|---\|---\| \| `api/db/services/task_checkpoint.py` (new) \| `load_subgraph_from_store()` and `has_raptor_chunks()` — docEngine-based checkpoint queries \| \| `rag/graphrag/general/index.py` \| `build_one()` calls `load_subgraph_from_store()` before running LLM extraction \| \| `rag/svr/task_executor.py` \| RAPTOR per-doc loop calls `has_raptor_chunks()` before processing \| \| `test/unit_test/rag/graphrag/test_checkpoint_resume.py` (new) \| 10 unit tests covering subgraph loading, source_id filtering, edge cases \| ### How it works - GraphRAG: Before running expensive LLM entity/relation extraction for a doc, checks the doc store for an existing subgraph (saved by a previous interrupted run). If found, loads it directly and skips LLM calls. - RAPTOR: Before processing a doc, checks if RAPTOR chunks (`raptor_kwd="raptor"`) already exist for it. If yes, skips. ### Testing - 10 new unit tests — all passing - Full existing suite: 617 passed ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2026-04-15 17:37:28 +08:00
Ea001	38cefd88e2	Fix tag_feas code injection in retrieval ranking (#13923 ) ## Summary - remove eval-based parsing from retrieval rank feature scoring - validate `tag_feas` at write time in chunk APIs and SDK routes - add regression tests for safe parsing and malicious payload rejection ## Details `tag_feas` is intended to be structured rank-feature data, but the retrieval ranking path was evaluating stored values as Python expressions. This change treats `tag_feas` strictly as data. ### What changed - replace `eval()` in `rag/nlp/search.py` with safe parsing via `json.loads()` and optional `ast.literal_eval()` compatibility for legacy Python-dict strings - strictly filter parsed values down to `dict[str, finite number]` - reject invalid `tag_feas` payloads at write time in web chunk routes and SDK document chunk routes - add focused regression tests to prove executable strings are ignored and invalid payloads are rejected ## Validation - `python -m pytest test/unit_test/common/test_tag_feature_utils.py test/unit_test/rag/test_rank_feature_scores.py -q` --------- Co-authored-by: unknown <zhenglinkai@CCN.Local> Co-authored-by: Yingfeng Zhang <yingfeng.zhang@gmail.com>	2026-04-15 16:31:11 +08:00
NeedmeFordev	1a1b5aa53e	Fix: respect the internet toggle before running Tavily web search (#14051 ) (#14052 ) ### What problem does this PR solve? Fixes #14051. The chat UI already sends an `internet` flag with each request, but the backend previously triggered Tavily web retrieval whenever `prompt_config.tavily_api_key` was configured. As a result, web search could still run even when the internet toggle was off. This PR makes web search an explicit opt-in at request time: - `tavily_api_key` only indicates that web search is available - Tavily retrieval runs only when `internet` is explicitly enabled - the same behavior now applies to both the normal retrieval path and the deep-research / reasoning path This also fixes the no-KB fallback case so chats without KBs fall back to normal solo chat when `internet` is off. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-14 19:55:20 +08:00
Idriss Sbaaoui	de6a8e789a	Fix: rerank overflow by enforcing top_k and 64 cap (#14084 ) ### What problem does this PR solve? This fixes rerank overflow where retrieval could send more documents than allowed (for example 66 when `page_size=6`), causing provider 400 errors and bypassing the user’s `top_k` intent in rerank-enabled paths. this pr fixes #14081 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-14 10:47:25 +08:00
Tong Liu	6fdca2d212	[Security] Fix jinja2 SSTI vulnerability using SandboxedEnvironment (#14068 )	2026-04-13 19:24:13 +08:00
Zhichang Yu	a9ca4ea1a1	Disable flask and quart debug (#14042 ) ### What problem does this PR solve? Visit `http://127.0.0.1:9381/?__debugger__=yes&cmd=resource&f=debugger.js` will expose the flask code: ``` docReady(() => { if (!EVALEX_TRUSTED) { initPinBox(); } // if we are in console mode, show the console. if (CONSOLE_MODE && EVALEX) { createInteractiveConsole(); } const frames = document.querySelectorAll("div.traceback div.frame"); if (EVALEX) { addConsoleIconToFrames(frames); } addEventListenersToElements(document.querySelectorAll("div.detail"), "click", () => document.querySelector("div.traceback").scrollIntoView(false) ); addToggleFrameTraceback(frames); addToggleTraceTypesOnClick(document.querySelectorAll("h2.traceback")); addInfoPrompt(document.querySelectorAll("span.nojavascript")); wrapPlainTraceback(); }); function addToggleFrameTraceback(frames) { frames.forEach((frame) => { frame.addEventListener("click", () => { frame.getElementsByTagName("pre")[0].parentElement.classList.toggle("expanded"); }); }) } ``` ### Type of change - [x] Other (please describe): Fix security risk	2026-04-10 18:01:49 +08:00
Magicbook1108	18cafff790	Fix: markdown parser in pipeline (#14032 ) ### What problem does this PR solve? Fix: markdown parser in pipeline ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-10 14:11:14 +08:00
Magicbook1108	87a87a7122	Feat: pipeline support ONE chunking method (#14024 ) ### What problem does this PR solve? Feat: pipeline support ONE chunking method ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-04-10 13:11:22 +08:00

... 2 3 4 5 6 ...

1550 Commits