ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-04 01:29:35 +08:00

Author	SHA1	Message	Date
Lynn	794c1f4b25	Fix: volc engine and other json key factories (#15653 ) ### What problem does this PR solve? Fix: - VolcEngine adapt to new api_key format - Save dict api_key as json ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-05 09:45:44 +08:00
web-dev0521	98f2a2e60b	feat(connectors): add Azure Blob Storage data source connector (#15466 ) ### What problem does this PR solve? Closes #15465. RAGFlow supports S3, Google Cloud Storage, R2, and OCI as data sources but not Azure Blob Storage, leaving Azure users without a way to index container objects into a knowledge base. This adds a first-class Azure Blob Storage data-source connector — distinct from RAGFlow's existing Azure storage backends (`rag/utils/azure_sas_conn.py`, `rag/utils/azure_spn_conn.py`) which store RAGFlow's own files. Highlights - `common/data_source/azure_blob_connector.py`: new `AzureBlobConnector` (`CheckpointedConnectorWithPermSync` + `SlimConnectorWithPermSync`). - Uses the existing `azure-storage-blob` dependency (already in `pyproject.toml`). - Three auth modes, tried in order of precedence: 1. Account key — `account_name` + `account_key` + `container_name`. 2. Connection string — `connection_string` + `container_name`. 3. SAS token — `container_url` + `sas_token` (same shape as `RAGFlowAzureSasBlob`). - ETag fingerprint stored per blob in `AzureBlobCheckpoint.etags` — unchanged blobs (same ETag as last run) are skipped without a download. Only new/modified blobs are fetched. - Optional `prefix` scopes indexing to a virtual folder. - `validate_connector_settings()` probes `get_container_properties()` and maps `AuthenticationFailed / 403 / ContainerNotFound` to typed connector exceptions. - Slim-doc IDs are blob names so prune reconciles correctly. - `common/constants.py`, `common/data_source/config.py`, `common/data_source/__init__.py`: register `azure_blob` in `FileSource` / `DocumentSource` and export `AzureBlobConnector`. - `rag/svr/sync_data_source.py`: new `AzureBlob(SyncBase)` class routed through `load_from_checkpoint` (ETag fingerprint owns change-detection) and added to `func_factory`. - Frontend: - `web/src/pages/user-setting/data-source/constant/index.tsx`: new `DataSourceKey.AZURE_BLOB`, auth-mode selector (account key / connection string / SAS token), all credential fields, prefix + batch-size, `syncDeletedFiles` capability, default form values, tile entry with icon. - `web/src/locales/{en,zh}.ts`: description + per-field tooltips for all 9 new keys. - `web/src/assets/svg/data-source/azure-blob.svg`: Azure-branded stacked-cylinders icon. Verification - `npm run build` (vite + esbuild) passes (37 s). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-06-04 21:06:01 +08:00
Jack	b363146997	refactor: overhaul task executor with layered architecture and comprehensive test suite (#15471 ) ## Summary Decomposes the monolithic `task_executor.py` (1945 lines) into a 6-layer architecture with clear separation of concerns. The refactored code is functionally equivalent to the original, verified through 400 passing tests and a production-vs-dry-run comparison framework. ## Architecture ``` entry (task_manager) └─ orchestration (task_handler) ├─ services (chunk_service, embedding_service, dataflow_service, raptor_service, post_processor) │ └─ utilities (chunk_builder, chunk_post_processor, embedding_utils) └─ infrastructure (task_context, recording_context, interceptor) ``` Key design decisions: - TaskContext — typed facade over raw task dict, injects rate limiters + callbacks via composition - RecordingContext + Comparator — enables side-by-side production vs dry-run execution for safe migration - NullRecordingContext — zero-allocation no-op for production, uses `__slots__` - WriteOperationInterceptor — FIFO replay of previous runs function returns for comparison mode ## Migration Strategy The original `handle_task()` in `task_executor.py` uses a 3-way switch via `TE_RUN_MODE`: - `TE_RUN_MODE=0` (default) → runs refactored code - `TE_RUN_MODE=1` → runs both original + refactored, compares all intermediate results - `TE_RUN_MODE=2` → runs original code (fallback) The comparison mode (`TE_RUN_MODE=1`) records ~40 intermediate values (chunks, vectors, token counts, func return values) from the production run and replays them during dry-run, then uses `ContextComparator` to report mismatches. ## Functional Equivalence Fixes All divergences between original and refactored code were identified and fixed: - Timeout decorators (handle/build_chunks/raptor/embedding) - NullRecordingContext leak in finally block causing RuntimeError - MinIO None-binary check with proper FileNotFoundError - Dataflow dispatch after embedding binding + init_kb - Memory task missing return after processing - RAPTOR checkpoint progress reporting - Tag cache (get_tags_from_cache/set_tags_to_cache) restoration - dataflow_id correction in _load_dsl - Language default Chinese, dead code guard removal - embed_chunks made async with proper thread_pool_exec - Full GraphRAG default configuration (10 parameters) - Hardcoded q_768_vec fallback removal in RAPTOR ## Test Changes - 20 new tests covering table parser manual mode, tag cache, embedding edge cases, RAPTOR checkpoint, dataflow_id correction, storage binary None, cancel cleanup, metadata=None boundary - Unified `make_task_context`/`make_task_dict` factories eliminated 10+ duplicated helpers - DataflowService tests migrated from internal method mocks to IO boundary mocks (real orchestration code executes) - Parametrized duplicate build_chunks post-processor tests - 7 raptor tests modernized to @pytest.mark.asyncio - Mock count per test reduced through boundary-level mocking strategy Test count: 400 passing, 0 warnings, 0 skips ## Files Changed \| File \| Change \| \|------\|--------\| \| `rag/svr/task_executor.py` \| +1 line (NullRecordingContext fix) \| \| `rag/svr/task_executor_refactor/task_handler.py` \| Orchestration layer, 8 logic fixes \| \| `rag/svr/task_executor_refactor/chunk_service.py` \| +timeout + None-check \| \| `rag/svr/task_executor_refactor/embedding_service.py` \| sync→async rewrite \| \| `rag/svr/task_executor_refactor/dataflow_service.py` \| dataflow_id fix + timeout \| \| `rag/svr/task_executor_refactor/raptor_service.py` \| checkpoint fix + assert \| \| `rag/svr/task_executor_refactor/chunk_post_processor.py` \| tag cache restore \| \| `rag/svr/task_executor_refactor/task_context.py` \| language default fix \| \| `test/.../conftest.py` \| +294 lines shared helpers \| \| `test/.../*.py` \| 15 test files refactored, 20 new tests \| --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 17:18:31 +08:00
VictorECDSA	ff5971448b	[Fix] naive: force-merge short markdown headers to prevent separate chunks (#15488 ) ## Problem When uploading `.md` files with `parser=naive` and `delimiter="\n"`, markdown headers (e.g., `## Quick Travel`) become separate chunks with very short content (16-18 characters). This causes retrieval issues: when the header is matched, the corresponding body text is not included in the chunk. ## Related Issues Closes #15487 ## Checklist - [x] Code changes are minimal and focused - [x] Unit tests added (12/12 passed) - [x] No breaking changes	2026-06-03 10:49:28 +08:00
Wang Qi	d41373cfa9	Feature: Add the new anthropic and voyage models (#15516 ) add the newanthropic and voyage models. Strip opus 4.7 and 4.8 of certain usnspported keys Co-authored-by: Idriss Sbaaoui <112825897+6ba3i@users.noreply.github.com>	2026-06-02 17:29:18 +08:00
Aeovy	600590cd18	Fix: disable thinking to avoid potential infinite loops in Qwen3.5/Qwen3.6 models (#15101 ) ### What problem does this PR solve? This PR fixes the issue where Qwen3.5/Qwen3.6 series models may spend excessive time on simple document-parsing tasks, such as Auto Metadata extraction, keyword extraction, question generation, and image description when using the MinerU parser. For these tasks, Qwen3.5/Qwen3.6 models may perform unnecessary reasoning by default, which can lead to very long response times, high token consumption, and, in some cases, potential infinite output loops. Since Qwen3.5/Qwen3.6 multimodal models are instantiated as `CvModel` when configured as `image2text`, the existing `enable_thinking=False` logic in `chat_model.py` does not apply to them. This PR adds the corresponding handling for the CV/image-to-text model path as well. This helps reduce unnecessary thinking time, avoid potential infinite loops, and improve parsing efficiency without noticeably affecting output quality for these simple extraction and image-description tasks. Fixes #15083.	2026-06-02 13:21:35 +08:00
kpdev	a4bc066f74	fix(rag): id2image parsing for hyphenated storage object keys (#15117 ) (#15118 ) ### What problem does this PR solve? Fixes #15117. Chunk images are stored with `img_id = f"{bucket}-{objname}"` in `image2id()` (`rag/utils/base64_image.py`). When loading via `id2image()`, the code used `image_id.split("-")` and required exactly two segments. Object keys that contain hyphens (e.g. `page-1.jpg`) produce more than two segments, so `id2image` returns `None` and chunk image previews fail even though the blob exists. This is the same parsing issue as #15115 (HTTP thumbnail route); this PR fixes the indexing/retrieval path. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Test plan - [x] `pytest test/unit_test/rag/utils/test_base64_image.py` - [ ] Manual: index a chunk with an `objname` containing hyphens and confirm `img_id` resolves to an image in retrieval Fixes #15117.	2026-06-02 10:52:51 +08:00
nickmopen	bebf6ed244	fix(llm): strip non-generation keys from gen_conf for LiteLLM providers (#15427 ) (#15432 ) ### What problem does this PR solve? Fixes #15427. All LiteLLM-routed chats fail with: - Anthropic: `litellm.BadRequestError: AnthropicException - {"type":"invalid_request_error","message":"model_type: Extra inputs are not permitted"}` - OpenAI: `litellm.BadRequestError: OpenAIException - Unknown parameter: 'model_type'` This is a regression from v0.25.4. #### Root cause A chat assistant's `llm_setting` is forwarded to the model as `gen_conf`. `llm_setting` can legitimately carry RAGFlow-internal metadata such as `model_type` (the chat REST APIs in `api/apps/restful_apis/` read it back out of `llm_setting`), so that key ends up inside `gen_conf`. `Base._clean_conf` (OpenAI-compatible providers) already whitelists the keys it forwards, so direct-OpenAI providers were unaffected. `LiteLLMBase._clean_conf` only dropped `max_tokens` and passed everything else straight through to `litellm.acompletion`, which forwarded `model_type` to the upstream provider — and Anthropic / OpenAI reject it. Because both Claude and GPT route through LiteLLM, every chat broke. #### Fix - Extract the allowed-key set into a shared `ALLOWED_GEN_CONF_KEYS` constant and reuse it in `Base._clean_conf`. - Apply the same whitelist in `LiteLLMBase._clean_conf`, plus the LiteLLM-specific reasoning params (`thinking`, `reasoning_effort`, `extra_body`) that the model-family policies inject for reasoning models. This covers all four LiteLLM completion paths (`async_chat`, `async_chat_streamly`, `async_chat_with_tools`, `async_chat_streamly_with_tools`), since they all route through `_clean_conf`. #### Tests Adds `test/unit_test/rag/llm/test_clean_conf_whitelist.py` covering both backends: `model_type` (and other stray keys) are dropped, genuine generation params and `thinking` survive, `max_tokens` is removed, and the whitelist invariants hold. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Added test cases	2026-06-02 10:04:11 +08:00
Wang Qi	1a6df01b53	Bug fix: Enhance embeding model to give better error message (#15346 ) To resolve https://github.com/infiniflow/ragflow/issues/15343 enhance the model embedding message to give extact failure message to customer. # QWen ## Retrieval <img width="3321" height="1033" alt="image" src="https://github.com/user-attachments/assets/6b82921a-a3a7-4a33-a383-1cf316398ee2" /> ## Chat <img width="2241" height="311" alt="image" src="https://github.com/user-attachments/assets/ec311365-62d5-407a-8915-5c8d72be9716" /> # SiliconFlow ## Retrieval <img width="3321" height="1033" alt="image" src="https://github.com/user-attachments/assets/ee2cd191-a27d-4729-b53d-2fbdb4e352cd" /> ## Chat <img width="1562" height="210" alt="image" src="https://github.com/user-attachments/assets/10376a8e-a3f4-422f-bc2e-96f2a8a96448" /> # Baichuan ## Retrieval <img width="3321" height="1107" alt="image" src="https://github.com/user-attachments/assets/dcb5409d-f7fc-4804-b186-5e1ee11e09c4" /> ## Chat <img width="2241" height="311" alt="image" src="https://github.com/user-attachments/assets/ec311365-62d5-407a-8915-5c8d72be9716" /> # Zhipu zhipu is good.	2026-06-01 19:18:16 +08:00
euvre	1e80419c21	fix: restore TitleChunker output for json/chunks upstream formats (#15396 ) fix: restore TitleChunker output for json/chunks upstream formats ## Summary The refactor commit `e194027b` (#14247) introduced two regressions that caused `TitleChunker` to produce zero chunks when the upstream Parser node outputs `json` or `chunks` format (e.g. PDF parsing). ## Root Cause ### 1. Dead code in `extract_line_records` (critical) After refactor, when `payload` is `None` (which is the case for `json` and `chunks` output formats), the method returns an empty list immediately via `return []`, so no records are ever extracted from structured upstream output. The original `json`/`chunks` handling code became unreachable dead code. ### 2. Unconditional overwrite in `build_chunks_from_record_groups` The `chunks` variable assigned in the `if` branch for markdown/text/html formats was unconditionally overwritten by the statement below it, due to a missing `else` keyword. ## Fix - Remove the premature `return []` so the `json`/`chunks` branch is reachable again. - Add `else` branch in `build_chunks_from_record_groups` so the two format families are handled independently. ## Test Plan - [x] Verified no lint errors on the changed file - [ ] Tested with a PDF document parsed via DeepDOC → TitleChunker pipeline - [ ] Tested with markdown input through TitleChunker - [ ] Tested hierarchy and group chunking modes ## Impact - Fixes the regression where documents parsed with `json`/`chunks` output format produced no chunks from `TitleChunker`. - No API or configuration changes. Fully backward compatible. Signed-off-by: noob <yixiao121314@outlook.com>	2026-06-01 17:14:22 +08:00
Wang Qi	10e8690890	GraphRAG - NER - spacy - fix spacy extraction (#14783 ) Fix spacy extraction	2026-06-01 13:05:54 +08:00
web-dev0521	cd18cfab79	feat(connector): implement Outlook data source connector (issue #15332 ) (#15333 ) ### What problem does this PR solve? Closes #15332. RAGFlow can index Gmail and generic IMAP mailboxes but had no native connector for Outlook / Microsoft 365 mail. Organisations on Microsoft 365 had no way to bring mailbox content into a knowledge base through Microsoft Graph. This PR adds a net-new Outlook data source that: - Authenticates against Microsoft Graph with the same MSAL client-credentials flow already used by the SharePoint and Teams connectors (no new auth primitives). - Pages over `/users/{id}/mailFolders/{folder}/messages/delta` per mailbox and persists `@odata.deltaLink` values in `OutlookCheckpoint.delta_links`, so incremental syncs only fetch changed messages. - Supports two scoping modes: - Tenant-wide (default): enumerates every user in the tenant via `/users` and syncs each mailbox. Requires `User.Read.All`. - Targeted: when `user_ids` is provided (comma-separated UPNs or object IDs), only those mailboxes are synced. `User.Read.All` is not needed in this mode. - Lets the caller pick the mail folder (`inbox`, `sentitems`, `archive`, ...). Defaults to `inbox`. - Maps each message to a `Document` shaped after the Gmail connector: one `TextSection` carrying `From/To/Cc/Subject` headers + body, with HTML bodies stripped to text inline (no extra dependency). - Surfaces typed errors on the validation probe: 401 → `ConnectorMissingCredentialError`, 403 → `InsufficientPermissionsError` (with `Mail.Read` / `User.Read.All` hint), 404 on a configured mailbox → `ConnectorValidationError`, 5xx → `UnexpectedValidationError`. - Skips messages flagged `@removed` by the delta semantics and messages whose `receivedDateTime` is older than `poll_range_start`. #### Files \| File \| Change \| \|------\|--------\| \| `common/data_source/outlook_connector.py` \| New — `OutlookConnector` (`CheckpointedConnectorWithPermSync` + `SlimConnectorWithPermSync`) + `OutlookCheckpoint` + tiny `_strip_html` helper. \| \| `common/data_source/config.py` \| `DocumentSource.OUTLOOK = "outlook"`. \| \| `common/constants.py` \| `FileSource.OUTLOOK = "outlook"`. \| \| `common/data_source/__init__.py` \| Export `OutlookConnector`. \| \| `rag/svr/sync_data_source.py` \| `Outlook(SyncBase)` with `batch_size` normalisation, CSV/list parsing of `user_ids`; registered in `func_factory`. \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `DataSourceKey.OUTLOOK`, visibility map (`syncDeletedFiles: true`), info entry, form fields (tenant_id, client_id, client_secret, folder, user_ids, batch_size), default values. \| \| `web/src/locales/en.ts`, `web/src/locales/zh.ts` \| `outlookDescription` + 5 tooltip keys (EN + ZH). \| \| `test/unit_test/data_source/test_outlook_connector_unit.py` \| New — 19 unit tests (`p1`/`p2`/`p3`) covering auth, validation (tenant-wide vs specific user vs error paths), checkpoint helpers, user enumeration pagination, message filtering, HTML body stripping. \| #### Required Azure AD permissions - `Mail.Read` (Application, admin-granted) — always. - `User.Read.All` (Application, admin-granted) — only when `user_ids` is left blank so the connector can enumerate mailboxes. #### Out of scope - Attachment indexing. The current connector emits message body + headers; binary attachments are flagged via `metadata.has_attachments` but not pulled. Adding attachment hydration is straightforward but scoped out per the issue's "decide whether attachments are indexed in the first version" note. - Delegated (per-user) OAuth. The connector uses app-only credentials, consistent with the SharePoint / Teams precedent in this codebase. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-29 21:52:29 +08:00
Rintaro	11af34a895	fix(opensearch): repair document-metadata path broken by #14577 (#15393 ) ### What problem does this PR solve? Document metadata is completely broken on the OpenSearch backend (`DOC_ENGINE=opensearch`). Both failures were introduced by #14577, which added a doc-metadata dispatch surface but only validated it against Elasticsearch. 1. Index creation rejected (`mapper_parsing_exception`). `OSConnection.create_doc_meta_idx` feeds `conf/doc_meta_es_mapping.json` verbatim to OpenSearch. That file declares a top-level `"dynamic": "runtime"`. Runtime fields are Elasticsearch-only; OpenSearch cannot parse the value: mapper_parsing_exception: Could not convert [dynamic.dynamic] to boolean (400) 2. `search()` signature mismatch (`TypeError`). `DocMetadataService` (added by #14577) calls `docStoreConn.search(...)` with snake_case kwargs (`select_fields=`, `index_names=`, `knowledgebase_ids=`, …), matching `ESConnection.search`. But `OSConnection.search` still uses camelCase parameters (`selectFields`, `indexNames`, `knowledgebaseIds`, …): TypeError: OSConnection.search() got an unexpected keyword argument 'select_fields' The UI then shows "0 fields" for every document on OpenSearch. ### Fix 1. In `OSConnection.create_doc_meta_idx`, normalize a top-level `"dynamic": "runtime"` to `True` for the OpenSearch request only. The shared mapping file is left untouched, so the Elasticsearch backend keeps its runtime-field behavior. Dynamic field discovery is preserved on OpenSearch. 2. Rename the `OSConnection.search()` parameters (and their in-method local uses) from camelCase to snake_case so they match `ESConnection.search()` and the `DocMetadataService` call sites. The change is confined to `search()`; `get/insert/update/delete` keep their existing positional signatures (they are called positionally from `rag/nlp/search.py`). ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Affected backends OpenSearch only. Elasticsearch, Infinity and OceanBase are untouched. ### How to reproduce 1. `DOC_ENGINE=opensearch`, restart the stack. 2. Upload/parse a document, then open the dataset's document list / set metadata. - Before: index creation 400s (`Could not convert [dynamic.dynamic]`), and/or `TypeError ... 'select_fields'`; document metadata shows 0 fields. ### Risk & backward compatibility - ES default deployment: no change. `doc_meta_es_mapping.json` is not modified, so ES still receives `"dynamic": "runtime"`. - `search()` rename is internal; the only kwarg caller (`DocMetadataService`) already uses the snake_case names this PR aligns to. ### Test plan - [ ] `DOC_ENGINE=opensearch`: per-tenant `ragflow_doc_meta_*` index is created (no `mapper_parsing_exception`); document metadata reads/writes work. - [ ] `DOC_ENGINE=elasticsearch` regression: doc-meta index still created with runtime mapping; metadata unchanged.	2026-05-29 21:49:36 +08:00
Rintaro	3dfc16973c	fix(opensearch): implement get_scores for KNN second-pass scoring (#15390 ) ### What problem does this PR solve? On the OpenSearch backend (`DOC_ENGINE=opensearch`), every retrieval that performs the KNN second-pass scoring crashes with: AttributeError: 'OSConnection' object has no attribute 'get_scores' Root cause. #14970 ("Refactor: Drop the vector fetch for ES") added a `get_scores()` helper to `ESConnectionBase` (`common/doc_store/es_conn_base.py`) and introduced `Dealer._knn_scores()` in `rag/nlp/search.py`, which calls `self.dataStore.get_scores(res)`. `search.py` routes Infinity and OceanBase to their own similarity paths via `DOC_ENGINE_INFINITY` / `DOC_ENGINE_OCEANBASE`, but OpenSearch sets neither flag, so it falls into the Elasticsearch branch and calls `get_scores`. `OSConnection` (which subclasses `DocStoreConnection` directly, not `ESConnectionBase`) never received that method, so any vector-search hit triggers the crash. It reproduces with any normal embedding (e.g. 1024-dim mistral-embed) as soon as a KNN query returns hits. ### Fix Add `OSConnection.get_scores()`, mirroring `ESConnectionBase.get_scores()`. OpenSearch hit headers expose `_score` exactly like Elasticsearch (the existing `OSConnection.__getSource` already reads `d["_score"]`), so the implementation is identical. Scope note: Infinity and OceanBase deliberately do not use `get_scores` (#14970 routes them elsewhere), so this fix is intentionally limited to the OpenSearch backend, which is the only one reaching the ES KNN-score path. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Affected backends OpenSearch only. Elasticsearch already implements `get_scores`; Infinity / OceanBase are routed away from it. ### How to reproduce 1. `DOC_ENGINE=opensearch` (docker `.env`), restart the stack. 2. Create a knowledge base with any dense embedding model and parse a document. 3. Run a retrieval / chat over that KB -> 500 with the AttributeError above. ### Risk & backward compatibility None for the default Elasticsearch deployment -- the change only adds a method to `OSConnection`. No default values or ES/Infinity/OceanBase behavior change. ### Test plan - [ ] With `DOC_ENGINE=opensearch`, retrieval over a KB returns scored chunks (no AttributeError). - [ ] `DOC_ENGINE=elasticsearch` regression: retrieval unchanged. - [ ] Empty-result path: `_knn_scores` early-returns `{}` (guarded), get_scores handles an empty `hits` list gracefully.	2026-05-29 21:49:15 +08:00
呆萌闷油瓶	658ff06ca4	feat: add 4 new models for siliconflow (#15383 ) ### What problem does this PR solve? Added 4 new models: deepseek-ai/DeepSeek-V4-Pro deepseek-ai/DeepSeek-V4-Flash Pro/moonshotai/Kimi-K2.6 Pro/zai-org/GLM-5.1 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-29 19:28:29 +08:00
web-dev0521	bda2117a25	feat(connector): implement OneDrive data source connector (issue #15330 ) (#15331 ) ### What problem does this PR solve? Closes #15330. RAGFlow had no connector for OneDrive / OneDrive for Business. Users who store working documents in OneDrive could not index them into a knowledge base without manually downloading and re-uploading files. This PR adds a net-new OneDrive data source that: - Authenticates against Microsoft Graph with the same MSAL client-credentials flow already used by the SharePoint and Teams connectors (no new auth primitives). - Enumerates every drive visible to the service principal and pages through `/drives/{id}/root/delta`, persisting `@odata.deltaLink` values per drive so subsequent syncs only fetch changed items. - Optionally narrows ingestion to a sub-folder (`folder_path`) without needing a separate code path. - Surfaces typed errors on the validation probe (`GET /drives?$top=1`): 401 → `ConnectorMissingCredentialError`, 403 → `InsufficientPermissionsError` (with a `Files.Read.All` hint), 5xx → `UnexpectedValidationError`. - Filters folders, soft-deleted items, and unsupported extensions (`.pdf .docx .doc .xlsx .xls .pptx .ppt .txt .md .csv`). #### Files \| File \| Change \| \|------\|--------\| \| `common/data_source/onedrive_connector.py` \| New — `OneDriveConnector` + `OneDriveCheckpoint`. \| \| `common/data_source/config.py` \| `DocumentSource.ONEDRIVE = "onedrive"`. \| \| `common/constants.py` \| `FileSource.ONEDRIVE = "onedrive"`. \| \| `common/data_source/__init__.py` \| Export `OneDriveConnector`. \| \| `rag/svr/sync_data_source.py` \| `OneDrive(SyncBase)` with `batch_size` normalisation; registered in `func_factory`. \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `DataSourceKey.ONEDRIVE`, visibility map (`syncDeletedFiles: true`), info entry, form fields (tenant_id, client_id, client_secret, folder_path, batch_size), default values. \| \| `web/src/locales/en.ts`, `web/src/locales/zh.ts` \| `onedriveDescription` + 4 tooltip keys (EN + ZH). \| \| `test/unit_test/data_source/test_onedrive_connector_unit.py` \| New — 13 unit tests (`p1`/`p2`) covering auth, validation, checkpoint helpers, and document filtering. \| #### Required Azure AD permission `Files.Read.All` (Application, admin-granted). #### Out of scope - Interactive end-user OAuth (delegated permissions) — the connector uses app-only credentials, consistent with the SharePoint / Teams precedent. - Binary download of file contents — the sync layer emits `Document`s carrying `webUrl` + metadata; bytes are hydrated downstream by the parse pipeline. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-29 19:26:06 +08:00
Lynn	dc4b82523b	Feat: tenant llm provider (#14595 ) ### What problem does this PR solve? Python implementation of the Go-based model_provider API suite. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: bill <yibie_jingnian@163.com>	2026-05-29 17:39:41 +08:00
web-dev0521	98bc9ca6ac	feat: implement Microsoft Teams data source connector (#15193 ) ### What problem does this PR solve? Closes #15191. RAGFlow shipped a Microsoft Teams connector stub (`common/data_source/teams_connector.py`) whose document-loading methods all returned `[]`, `Teams._generate()` was a `pass`, and Teams was commented out of the data-source settings UI. As a result there was no way to index Teams channel conversations into a knowledge base. This PR implements the connector end to end on top of Microsoft Graph (Office365-REST-Python-Client). It shares the MSAL client-credentials auth shape with the SharePoint connector. Backend - `common/data_source/teams_connector.py` - `load_credentials()` now builds the Graph client using an MSAL client-credentials token callback — the form `GraphClient` actually expects. (The previous stub passed a raw access-token string to `GraphClient(...)`, which is not how that client is driven.) Token acquisition is lazy, so credential loading performs no network call. - `validate_connector_settings()` lists teams via Graph. - `load_from_checkpoint()` is now a generator that pages teams → channels → messages, flattens each top-level post together with its replies into one blob-based `Document` (`extension` `.txt`/`.html`, `blob`, `size_bytes`, `doc_updated_at`). Incremental syncs are bounded by message `lastModifiedDateTime` (falling back to `createdDateTime`). Per-message errors surface as `ConnectorFailure` instead of aborting the run. - `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument` batches and the checkpoint helpers return proper `TeamsCheckpoint`s. - ACL → `ExternalAccess` mapping is intentionally left best-effort (`load_from_checkpoint_with_perm_sync` delegates to the standard load) because the sync pipeline does not currently persist `ExternalAccess`. - `rag/svr/sync_data_source.py` - Implemented `Teams._generate()` using the existing `CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google Drive), supporting full reindex and incremental polling from `poll_range_start`. - `TeamsConnector` is already exported from `common/data_source/__init__.py`. Frontend (`web/`) - Enabled the `TEAMS` data-source enum and added its form fields (`tenant_id`, `client_id`, `client_secret`), default values, display metadata, and a Teams icon. - Added `teamsDescription` / `teamsTenantIdTip` to `en.ts` and `zh.ts`. Tests - `test/unit_test/data_source/test_teams_connector_unit.py`: mock-based unit tests covering credential loading (incomplete creds raise, happy path sets the Graph client, fetch-without-creds raises), post/reply flattening (incl. the HTML vs text extension), incremental `lastModifiedDateTime` filtering, and slim-doc listing. All 6 pass; `ruff check` is clean. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-28 17:10:38 +08:00
web-dev0521	5de021ebb4	feat: implement Slack data source connector (#15188 ) ### What problem does this PR solve? Closes #15187. RAGFlow shipped a Slack connector (`common/data_source/slack_connector.py`) but it was never usable: `Slack._generate()` in the sync worker was a `pass` stub, the connector's document-generating code was incompatible with the current data model, and Slack was commented out of the data-source settings UI. As a result, teams had no way to index Slack channels/threads into a knowledge base. This PR completes the connector end to end. Backend - `common/data_source/slack_connector.py` - Rewrote `thread_to_doc` to produce a blob-based `Document` (`extension`/`blob`/`size_bytes`). The previous implementation built the doc with a `sections=[...]` argument and omitted the now-required `blob`/`extension`/ `size_bytes` fields, so it raised a validation error against the current `Document` model. Thread messages are now cleaned and flattened into a single UTF-8 text blob. - Added `load_from_state()` / `poll_source(start, end)` generators. The connector's checkpoint interface is a no-op stub, so both full and incremental syncs run through a single channel-iterating generator built on the existing module helpers (`get_channels`, `filter_channels`, `get_channel_messages`, `_process_message`), with per-channel thread de-duplication. - `rag/svr/sync_data_source.py` - Implemented `Slack._generate()`. Credentials are loaded via `StaticCredentialsProvider` (the connector requires `slack_bot_token` and does not support `load_credentials`). Supports full reindex and incremental polling from `poll_range_start`, plus the optional channel filter. Modeled on the Confluence/Dropbox wrappers. - `SlackConnector` was already exported from `common/data_source/__init__.py`. Frontend (`web/`) - Enabled the `SLACK` data-source enum and added its form fields (Slack bot token + optional channel filter), default values, display metadata, and a Slack icon. - Added `slackDescription` / `slackBotTokenTip` / `slackChannelsTip` strings to `en.ts` and `zh.ts`. Tests - `test/unit_test/data_source/test_slack_connector_unit.py`: unit tests covering credential loading (`load_credentials` raises, `set_credentials_provider` initializes clients, missing credentials raises) and document generation (standalone message + flattened thread, blob/extension/size_bytes/metadata, and the incremental poll time window). All 5 pass; `ruff check` is clean. Required Slack scopes: `channels:read`, `channels:history`, `users:read`. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-28 15:46:07 +08:00
web-dev0521	c4c4e228e3	feat: implement SharePoint data source connector (#15190 ) ### What problem does this PR solve? Closes #15189. RAGFlow shipped a SharePoint connector stub (`common/data_source/sharepoint_connector.py`) whose document-loading methods all returned `[]`, `SharePoint._generate()` was a `pass`, and SharePoint was commented out of the data-source settings UI. As a result there was no way to index files stored in SharePoint document libraries. This PR implements the connector end to end on top of Microsoft Graph (Office365-REST-Python-Client). Backend - `common/data_source/sharepoint_connector.py` - `load_credentials()` now builds the Graph client using an MSAL client-credentials token callback — the form `GraphClient` actually expects. (The previous stub passed a raw access-token string to `GraphClient(...)`, which is not how that client is driven.) Token acquisition is lazy, so credential loading does no network call. - `validate_connector_settings()` resolves the configured site via Graph. - `load_from_checkpoint()` is now a generator that enumerates every document library under the site, walks folders depth-first, downloads each file, and yields blob-based `Document` objects (`extension` / `blob` / `size_bytes` / `doc_updated_at`). Incremental syncs are bounded by file `lastModifiedDateTime`. Per-file errors are surfaced as `ConnectorFailure` rather than aborting the run. - `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument` batches (no downloads) and the checkpoint helpers return proper checkpoints. - ACL → `ExternalAccess` mapping is intentionally left best-effort (`load_from_checkpoint_with_perm_sync` delegates to the standard load) because the sync pipeline does not currently persist `ExternalAccess`; this can be extended once that plumbing exists. - `rag/svr/sync_data_source.py` - Implemented `SharePoint._generate()` using the existing `CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google Drive), supporting full reindex and incremental polling from `poll_range_start`. - `SharePointConnector` is already exported from `common/data_source/__init__.py`. Frontend (`web/`) - Enabled the `SHAREPOINT` data-source enum and added its form fields `site_url`, `tenant_id`, `client_id`, `client_secret`), default values, display metadata, and a SharePoint icon. - Added `sharepointDescription` / `sharepointSiteUrlTip` to `en.ts` and `zh.ts`. Tests - `test/unit_test/data_source/test_sharepoint_connector_unit.py`: mock-based unit tests covering credential loading (incomplete creds raise, happy path sets the Graph client, fetch-without-creds raises), drive traversal + file download, incremental `lastModifiedDateTime` filtering, and slim-doc listing. All 6 pass; `ruff check` is clean. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-28 13:26:08 +08:00
Jack	f0cb7a544b	Refactor: Task Executor (#15154 ) ### What problem does this PR solve? 1. Break huge function into smaller pieces 2. Add unit test for the smaller pieces function 3. Layer-ed design a. infra layer - task_context.py, recording_context.py, write_operation_interceptor.py, ... b. service layer - *_service.py c. business layer - task_handler.py 4. Default behavior: use "refactor-ed version" - can switch to original version by change env variable ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement --------- Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-05-27 21:54:17 +08:00
Wang Qi	619b971785	Fix: empty file with better message (#15232 ) Fix: empty file with better message	2026-05-26 12:28:53 +08:00
wdeveloper16	4b36801b53	fix: resolve asyncio correctness issues (fire-and-forget tasks, event loop nesting) (#14761 ) ## Summary Fixes the confirmed asyncio anti-patterns from #14755. Only the three verified bugs are addressed; patterns already correctly using `asyncio.new_event_loop()` in a fresh thread are left untouched. ### Changes `api/apps/restful_apis/tenant_api.py` — fire-and-forget `send_invite_email` `asyncio.create_task()` was called without storing the `Task` reference. CPython's GC can collect an unfinished task, silently cancelling it and swallowing exceptions. Fixed by storing the task in a module-level `_background_tasks: set[Task]` with a `done_callback` to discard it on completion — the standard Python idiom for safe background tasks. `api/apps/restful_apis/agent_api.py` — fire-and-forget `background_run` Same root cause in the webhook "Immediately" execution path. Same fix applied. `rag/llm/chat_model.py` (`LocalLLM._stream_response`) — `asyncio.get_event_loop()` on running loop `asyncio.get_event_loop()` returns Quart's running event loop when called from an async context. Calling `loop.run_until_complete()` on it raises `RuntimeError`. Replaced with `asyncio.new_event_loop()` so the generator uses a dedicated fresh loop, closed in a `finally` block. ## What was NOT changed - `llm_service._sync_from_async_stream` and `evaluation_service._sync_from_async_gen`: both already correctly use `asyncio.new_event_loop()` inside a fresh thread. - `llm_service._run_coroutine_sync`: only caller is `rag/app/resume.py` (sync context), so `thread.join()` is correct there. - `requests` in agent tools: sync methods dispatched through thread pools; httpx migration is a separate, larger refactor. ## Test plan - [ ] Invite a team member and confirm the email is sent with no task warnings in logs. - [ ] Trigger a webhook agent in "Immediately" mode; confirm canvas state is persisted after background run. - [ ] Verify `LocalLLM` (Jina backend) chat and streaming work end-to-end. Closes #14755 --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-05-25 22:45:40 +08:00
Wang Qi	7e6844118b	Fix search vector_similarity_weight (#15108 ) ### What problem does this PR solve? Fix search vector_similarity_weight ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-22 16:05:13 +08:00
Wang Qi	a9ec78cb9c	Refactor: enahnce retry and timeout (#14983 ) ### What problem does this PR solve? 1. Enhance retry and timeout, and adjust the default timeout 2. NER: spacy do not batch chunks 3. extract _has_cancel_and_exit 4. enhance log messages ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2026-05-22 13:16:39 +08:00
buua436	04bdb41909	Fix: guard missing task language (#15136 ) ### What problem does this PR solve? guard missing task language ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-22 11:46:38 +08:00
Wang Qi	c5a46fda44	Fix: <asyncio.locks.Semaphore object at 0xabcd [locked]> is bound to a different event loop (#15100 ) Fix: <asyncio.locks.Semaphore object at 0xabcd [locked]> is bound to a different event loop	2026-05-21 19:23:41 +08:00
Jonathan Hill	111cdc77b5	fix: guard LLM response against empty choices (fixes #14711 ) (#14988 ) ## Summary Fixes 10 unguarded `response.choices[0]` accesses that cause `IndexError` or `AttributeError` when the LLM returns an empty `choices` list — the scenario described in #14711. - `rag/llm/cv_model.py` - `rag/llm/chat_model.py` Each access site is now guarded with: ```python if not response.choices: raise ValueError("LLM returned empty response") ``` ## Verification Detected and verified by [pact](https://github.com/qizwiz/pact) — a sheaf-cohomological LLM contract checker using Z3 as a local theory solver. pact sheaf-cohomological proof status after fix: \| File \| Ȟ¹ (after) \| Z3 \| \|------\|-----------\|-----\| \| `rag/llm/cv_model.py` \| 0 \| UNSAT ✓ \| \| `rag/llm/chat_model.py` \| 0 \| UNSAT ✓ \| All access sites proven safe (Z3 UNSAT certificate). The checker was also used to verify the autogen streaming-None fix in [microsoft/autogen#7711](https://github.com/microsoft/autogen/pull/7711). ## Test plan - [ ] Existing test suite passes - [ ] Manually test with a provider that returns empty `choices` under load (e.g. Vertex AI) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Jonathan Hill <jonathan.f.hill@gmail.com>	2026-05-21 15:37:19 +08:00
Prateek Jain	bf4864e614	fix(infinity): declare `extra` field + serialize dict on write to unbreak RAPTOR (#14998 ) ### What problem does this PR solve? Fixes #14997. RAPTOR builds on the Infinity backend have been broken since v0.25.2 introduced the `extra` field in code (`rag/svr/task_executor.py:1011`) without declaring it in `conf/infinity_mapping.json`. Every RAPTOR job fails with: ``` infinity.common.InfinityException: (3013, 'Fail to bind the expression: extra@src/planner/expression_binder_impl.cpp:99') ``` The auto-migration in `common/doc_store/infinity_conn_base.py:_migrate_db()` adds any columns it finds in the mapping JSON to existing tables — so the only thing standing between users and a working RAPTOR build is that one missing declaration. OceanBase, ES, and OpenSearch were unaffected because they store `extra` as a native JSON type; only Infinity (which has a strict `varchar`/`integer`/`float` schema) needed the addition. ### The fix Two-part change: 1. `conf/infinity_mapping.json`: declare `"extra": {"type": "varchar", "default": ""}`. On next startup, `_migrate_db()` adds the column to all existing chunk tables — no manual DDL needed for upgrading installations. 2. `rag/utils/infinity_conn.py` `insert()`: serialize the `extra` dict to a JSON string at write time, since Infinity's `varchar` can't store a Python dict directly. Modelled on the existing `chunk_data` handling a few lines above. The read path (`rag/utils/raptor_utils.py:_as_extra_dict`) already normalises both dict and JSON-string inputs, so no read-side change is needed. Other backends are untouched — `task_executor.py` still writes the dict, and the OceanBase/ES/OpenSearch insert paths handle dicts natively. ### Verification Tested on a v0.25.4 deployment with the Infinity backend by applying the same two changes via mounted-volume override: - Confirmed `_migrate_db()` adds the `extra` column to all pre-existing chunk tables on startup (column visible via Infinity's `show_columns()`). - Triggered RAPTOR builds on four datasets (~21k chunks total) via `POST /api/v1/datasets/<id>/index?type=raptor`. - All four progressed past the previously-failing `get_raptor_chunk_methods()` call into actual entity-extraction and clustering work without the (3013) error. - GraphRAG builds (which can trigger the same path indirectly via `task_executor.py:857`) also progressed cleanly. ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue)	2026-05-21 15:36:15 +08:00
Kevin Hu	e7544562cc	Feat: @tool decorator for chat-model tool registration (#15047 ) ## Summary - Adds a lightweight `@tool` decorator and `FunctionToolSession` adapter in `rag/llm/tool_decorator.py` that let callers register plain Python functions as LLM tools without hand-writing OpenAI function schemas or building an MCP-style session. - Refactors `Base.bind_tools` and `LiteLLMBase.bind_tools` in `rag/llm/chat_model.py` to accept either the new decorator form `bind_tools(tools=[fn1, fn2])` or the existing `(toolcall_session, tools_schemas)` form, so existing agent/dialog call-sites in `agent/component/agent_with_tools.py`, `api/db/services/llm_service.py`, and `api/db/services/dialog_service.py` are unaffected. - Adds 8 unit tests in `test/unit_test/rag/llm/test_tool_decorator.py` covering schema shape, required/optional inference, sync + async dispatch, and bad-input rejection. ## Usage ```python from rag.llm.tool_decorator import tool @tool def get_weather(city: str) -> str: """Get current weather for a city. :param city: City name to look up. """ return f"{city}: 21 C, partly cloudy" chat_mdl.bind_tools(tools=[get_weather]) ans, tk = await chat_mdl.async_chat_with_tools(system, history) ``` The decorator introspects `inspect.signature` + type hints + the docstring (`:param name:` style) and attaches an OpenAI-format `openai_schema` to the callable. `FunctionToolSession` duck-types the existing `ToolCallSession` protocol, dispatching async callables directly and sync ones through `thread_pool_exec` so the event loop is never blocked. ## Design notes - `tool_decorator.py` deliberately does not live inside `rag/llm/__init__.py` to avoid forcing every consumer through the heavy provider auto-discovery loop and to sidestep a circular import (`__init__.py` imports `chat_model`, which would otherwise need symbols from `__init__.py`). - `FunctionToolSession` is duck-typed against `common.mcp_tool_call_conn.ToolCallSession` rather than explicitly inheriting from it, so importing the decorator doesn't pull the MCP client SDK into the import graph. - Docstring parsing is intentionally minimal (`:param name:` only) to keep this dependency-free; Google/NumPy styles can be added later via `docstring_parser` if needed. ## Test plan - [x] `python -m pytest test/unit_test/rag/llm/test_tool_decorator.py -v` — 8 passed - [x] `python -m pytest test/unit_test/rag/llm/ --ignore=test/unit_test/rag/llm/test_perplexity_embed.py` — 11 passed (the ignored test has a pre-existing `numpy` import that's unrelated) - [ ] Reviewer: smoke-test the new path end-to-end with a live model via `chat_mdl.bind_tools(tools=[my_fn])` to confirm the OpenAI-format schemas pass through unchanged 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 15:32:17 +08:00
qinling0210	dbef3e361f	Update chunk/metadata cli (#15055 ) ### What problem does this PR solve? Update chunk/metadata cli ### Type of change - [ ] Refactoring	2026-05-20 20:32:06 +08:00
Magicbook1108	b28e134944	Feat: add local & ssh provider in admin panel (#15039 ) ### What problem does this PR solve? Feat: add local & ssh provider in admin panel ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-20 16:56:20 +08:00
Rene Arredondo	f58e0b3eca	Feat: VLM image descriptions in MinerU parser (#14869 ) (#14946 ) ## Summary Closes #14869. Adds VLM-based semantic descriptions to image chunks produced by the MinerU parser, closing a long-standing parity gap with the deepdoc parser's `VisionFigureParser`. A maintainer flagged this in #13342 ("We may add the VLM enhancement to MinerU parser as well") and an earlier proposal exists in #13824; this PR lands the change end-to-end inside the existing parser plumbing. ## Why Today the MinerU parser returns image chunks containing only the native `image_caption` and `image_footnote` strings from MinerU's JSON. When neither is present (or when both are sparse), the chunk carries effectively no searchable content for the figure and retrieval misses it entirely. Users who configured a local VLM (reporter's case: Gemma-4-31B) had to post-process MinerU's `tmp/.json` themselves. The deepdoc parser already solves this via [`VisionFigureParser`](deepdoc/parser/figure_parser.py): when the tenant has an `IMAGE2TEXT` model configured, each figure gets a semantic description merged into its chunk. This PR brings the same behavior to MinerU. ## What changed ### `deepdoc/parser/mineru_parser.py` - New method `_enhance_images_with_vlm(outputs, vision_model, callback=None)`* — collects every `IMAGE` block with a readable `img_path`, runs `rag.app.picture.vision_llm_chunk` in a 10-worker `ThreadPoolExecutor` using the existing `vision_llm_figure_describe_prompt`, and writes the result back as `vlm_description`. Per-image failures are logged and skipped — they never abort the run. - `_transfer_to_sections` (IMAGE branch) — folds `vlm_description` into the section text alongside caption + footnote, so the description becomes part of the chunk and is searchable / retrievable. - `parse_pdf` — after `_read_output`, calls `_enhance_images_with_vlm(outputs, vision_model, callback=callback)` when a `vision_model` kwarg is supplied. Wrapped in `try / except` so a VLM outage cannot break parsing. ### `rag/app/naive.py` (`by_mineru`) After successfully resolving the MinerU OCR parser, also resolves the tenant's default `LLMType.IMAGE2TEXT` model via `get_tenant_default_model_by_type`, wraps it in an `LLMBundle`, and injects it as `kwargs["vision_model"]` before delegating to `parse_pdf`. ## Behavior \| Tenant config \| Behavior \| \|---\|---\| \| `IMAGE2TEXT` model configured \| MinerU image chunks contain `caption + footnote + VLM description`. Retrieval against figures now actually works. \| \| No `IMAGE2TEXT` model configured \| Exact same output as today (caption + footnote only). Lookup fails silently with an info log; no error, no regression. \| \| VLM call fails for a single image \| That image silently falls back to caption + footnote; other images proceed. \| \| Caller already passes `vision_model` in kwargs \| We don't override it — `if "vision_model" not in kwargs` guards the lookup. \| ## Files - `deepdoc/parser/mineru_parser.py` (+56) - `rag/app/naive.py` (+13)	2026-05-19 16:08:10 +08:00
plind	f169ab4b39	feat(tts): cache synthesized speech in Redis to avoid redundant calls (#14851 ) ## What problem does this PR solve? Closes #12017. TTS output is deterministic for a given `(model, text)` pair, so re-running the same text through the same TTS model produces the same bytes — yet `Canvas.tts` and `dialog_service.tts` re-synthesized on every request. That's slow and wastes provider quota whenever the same assistant response is replayed, shared across users, or repeated within a session. ### Change New helper `rag/utils/tts_cache.py` with `synthesize_with_cache(tts_mdl, cleaned_text)`: - Key: `tts:cache:{model_id}:{sha256(text)}` — separate namespace per model, identical cleaned text reuses a single entry across both call sites. - Value: the hex-encoded audio blob both call sites already returned. No format change for downstream consumers. - TTL: 7 days by default, configurable via `RAGFLOW_TTS_CACHE_TTL_SECONDS`. - Failure modes: a Redis hiccup falls back to direct synthesis; a failed synthesis still returns `None` (existing contract preserved). [`Canvas.tts`](https://github.com/infiniflow/ragflow/blob/main/agent/canvas.py#L683-L724) and [`dialog_service.tts`](https://github.com/infiniflow/ragflow/blob/main/api/db/services/dialog_service.py#L1367-L1380) now route through the helper; the per-file bytes-accumulation/hex-encode loop has been removed in favor of one shared implementation. ## Type of change - [x] New Feature (non-breaking change which adds functionality) ## Test plan - [ ] Cache hit, chat path: Configure a dialog with TTS enabled, ask the same question twice with `stream=false`. Verify the second response returns the same `audio_binary` and that the second invocation doesn't hit the TTS provider (e.g., observe provider-side logs / usage counters; check no `LLMBundle.tts can't update token usage` log line on the second run). - [ ] Cache hit, agent path: Same exercise via a Conversational Agent that includes a Message component playing back the answer. - [ ] Cache isolation per model: Switch tenant's `tts_id` between two models, run the same text against each — confirm the second model's first synthesis still happens (no cross-model hits). - [ ] TTL override: Set `RAGFLOW_TTS_CACHE_TTL_SECONDS=120`, confirm the entry expires after 2 minutes. - [ ] Redis unavailable: Stop Redis (or break the connection). Verify the TTS endpoint still works — synthesis falls back to direct calls, with a `TTS cache lookup failed` / `TTS cache store failed` warning logged. - [ ] Failure path: Configure a TTS model with an invalid API key, ensure the response still returns successfully with `audio_binary=None` (no regression vs. current behavior).	2026-05-19 14:20:40 +08:00
Magicbook1108	b69a6a5d80	Feat: full optimization on connector dashboard (#14979 ) ### What problem does this PR solve? This PR improves the connector dashboard task management experience and adds better visibility into connector execution logs. ### Overview: #### Before <img width="700" alt="image" src="https://github.com/user-attachments/assets/e4a8ed6f-2e18-4f0f-8528-41a514550052" /> #### Now: <img width="700" alt="Screenshot from 2026-05-18 16-31-30" src="https://github.com/user-attachments/assets/d4ca193b-847a-49ae-9e4f-5fbca60ea627" /> ### 1. Add a new logging page to the connector dashboard A new logging page has been added so users can view connector task execution logs directly from the connector dashboard. ### 2. Merge the Resume button into Confirm The separate Resume button has been removed. The Confirm button now represents different actions depending on the current task state: - Save: Save form changes and reschedule tasks. - Stop: Cancel currently scheduled or running tasks. - Resume: Create new scheduled tasks after the previous tasks have been stopped. - Start: Start tasks when no task has been started yet. ### 3. Separate syncing and pruning tasks Connector tasks are now separated into syncing and pruning. Pruning is controlled by the Sync deleted files option: - When Sync deleted files is disabled, only syncing tasks are shown. - When Sync deleted files is enabled, both syncing and pruning tasks are shown. Now: Sync deleted files disabled <img width="700" alt="Sync deleted files disabled" src="https://github.com/user-attachments/assets/dbd9232e-614a-407f-a0b1-c109e5fa567d" /> Now: Sync deleted files enabled <img width="700" alt="Sync deleted files enabled" src="https://github.com/user-attachments/assets/1f527f48-ccb3-4ee8-97ca-086891489296" /> ### 4. Update logs in backend <img width="700" alt="image" src="https://github.com/user-attachments/assets/10a95a3f-98c1-4e67-8afa-ddf6cda5b0b2" /> ### 5. Remove connector resume API - Removed: `POST /v1/connectors/<connector_id>/resume` - Replaced by: `PATCH /v1/connectors/<connector_id>` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-19 10:07:11 +08:00
Wang Qi	13b422037f	Refactor: enhance graphrag - part 2 (#14972 ) ### What problem does this PR solve? 1. expose batch_chunk_token_size for configuration 2. retrieve chunks when build subgraph for the doc, not retreive all docs chunks at the begining 3. get all chunks for a document, used to be hard coded 10000 4. delete not used method run_graphrag ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring Follow on: #14617	2026-05-18 16:10:21 +08:00
qinling0210	f1d2383572	Push metadata filters down to Infinity (#14974 ) ### What problem does this PR solve? Push metadata filters down to Infinity ### Type of change - [x] Refactoring	2026-05-18 14:22:04 +08:00
Kevin Hu	7cdc74bbe5	Refactor: Drop the vector fetch for ES (#14970 ) ## Summary - Stop pulling chunk vectors (`q__vec`) back from Elasticsearch in the main retrieval path. ES already knows them; shipping them was pure bandwidth/memory overhead. - Recover the per-chunk cosine similarity via a second KNN-only ES call filtered by the candidate chunk ids. The new `_score` is merged with locally computed term similarity using the user-configured `vector_similarity_weight`. - Lazily fetch the chunk embedding only for the chunks `insert_citations` actually needs. ## Details `rag/nlp/search.py`* - `Dealer.search`: no longer appends `q__vec` to the ES select list. OceanBase still gets it (its rerank path is unchanged). - New `Dealer._knn_scores(sres, idx_names, kb_ids)`: a `MatchDenseExpr` over the cached query vector filtered by `id IN sres.ids`, returning `{chunk_id: cosine_score}` via ES `_score`. - New `Dealer.rerank_with_knn(...)`: term similarity from `qryr.token_similarity` plus the ES-supplied KNN score, combined with `tkweight`/`vtweight` and the existing rank-feature bonus. - New `Dealer.fetch_chunk_vectors(chunk_ids, tenant_ids, kb_ids, dim)`: on-demand vector fetch for citation use. - `Dealer.retrieval` routes Infinity → unchanged, OceanBase → existing local `rerank`, ES → new KNN-score path. `common/doc_store/es_conn_base.py`* - New `get_scores(res)` helper returning `{_id: _score}` directly from hit headers (ES doesn't surface `_score` through `get_fields`). `api/db/services/dialog_service.py` - New top-level `_hydrate_chunk_vectors(...)` helper. On ES it back-fills `ck["vector"]` from `fetch_chunk_vectors` right before `insert_citations`. No-op on Infinity / OB (their chunks already carry vectors). - Both `decorate_answer` closures became `async` and are `await`-ed at all call sites in `async_chat` and `async_ask`. ## Backend behavior \| Backend \| Returns chunk vec in main search \| Sim source \| Vectors for citations \| \|---\|---\|---\|---\| \| ES \| No \| second KNN call (`_score`) merged with term sim \| fetched on demand \| \| Infinity \| No (unchanged) \| normalized `_score` \| already on chunks \| \| OceanBase \| Yes (kept) \| local hybrid rerank \| already on chunks \| ## Test plan	2026-05-18 14:21:56 +08:00
07heco	e194027b01	refactor: optimize BaseTitleChunker to improve RAG document chunk quality (#14247 ) ## RAG Optimization Description Optimize the core `BaseTitleChunker` in `rag/flow/chunker/title_chunker/common.py` to improve RAG document chunking quality and retrieval accuracy. ## Key Changes 1. Format-branched text processing: Preserve original whitespace & indentation for Markdown/HTML payloads to maintain document semantics and chunk fidelity; only perform full whitespace cleaning on plain text content. 2. Empty chunk filtering: Thoroughly filter invalid pure-blank lines to reduce noisy data in vector database. 3. Code deduplication: Unified markdown/text/html payload extraction logic, removed redundant repeated code blocks. 4. None serialization fix: Avoid converting `None` value into literal `"None"` string in chunk text fields. 5. Production logging: Added input/output line count logging for filter logic, observable in online environment. 6. 100% backward compatible: No changes to chunking hierarchy rules, output format and all existing workflows. ## RAG Business Value - Preserves document format fidelity for structured Markdown/HTML files - Reduces invalid noisy chunks → improves RAG retrieval precision - Cleans plain text data → optimizes vector embedding quality - Improves code maintainability with no breaking changes - Provides observable logging for chunk filtering behavior ## Compatibility - ✅ No API changes - ✅ No chunk logic modifications - ✅ All document parsing/chunking workflows unaffected - ✅ All pre-checks passed, no code conflicts ### Type of change - [x] Refactoring - [x] Performance Improvement	2026-05-18 10:00:18 +08:00
wdeveloper16	14c0985182	feat: bump Python minimum from 3.12 to 3.13, drop strenum backport (#14767 ) Closes #14753 ## What changed \| File \| Change \| \|---\|---\| \| `pyproject.toml` \| `requires-python` → `>=3.13,<3.15`; remove `strenum==0.4.15` \| \| `Dockerfile` \| `uv python install 3.13`, `uv sync --python 3.13` \| \| `.github/workflows/tests.yml` \| `uv sync --python 3.13` on both matrix legs \| \| `CLAUDE.md` \| dev setup command + requirements note updated \| \| `deepdoc/parser/mineru_parser.py` \| `from strenum import StrEnum` → `from enum import StrEnum` \| \| `agent/tools/code_exec.py` \| same \| `StrEnum` has been in the stdlib since Python 3.11 — the `strenum` backport package is no longer needed once the floor is 3.13. ## Why uv.lock is not regenerated `uv lock --python 3.13` fails because: 1. The infiniflow/graspologic fork pins `numpy>=1.26.4,<2.0.0` 2. `tensorflow-cpu>=2.20.0` (the first release with cp313 wheels) depends on `ml-dtypes>=0.5.1`, which requires `numpy>=2.1.0` 3. These two constraints are irreconcilable on Python 3.13 The lockfile regeneration requires loosening the `numpy` upper bound in the `infiniflow/graspologic` fork. Once that fork commit is updated and the SHA in `pyproject.toml:49` is bumped, `uv lock --python 3.13` will succeed. ## RFC corrections Two claims in the original RFC (#14753) did not hold up under code review: - "graspologic hard-blocks 3.13" — the infiniflow fork at the pinned commit has no `<3.13` Python constraint. The blocker is the transitive `numpy<2.0.0` conflict with tensorflow-cpu's test dependency, not a direct Python version cap. - "free-threading throughput gains for I/O-bound workload" — Python 3.13 free-threading requires a special `--disable-gil` build and provides no benefit for async I/O code (the GIL is already released during I/O). The real motivation is forward compatibility and improved error messages.	2026-05-15 14:40:53 +08:00
Ricardo-M-L	cb606e1c38	fix: correct attribute name typo model_speciess to model_species (#13929 ) ## Summary - Rename misspelled attribute `model_speciess` to `model_species` across 4 files - The extra `s` is a typo — `species` is already plural ## Test plan - [ ] Verify PDF parsing with laws/manual/paper parser types still works correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuj <yuj@ztjzsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-15 14:19:41 +08:00
Octopus	eaa5d9921b	fix: enable GitHub connector to sync PRs and issues by default (#14062 ) Fixes #13975 ## Problem The GitHub data source connector had both `include_pull_requests` and `include_issues` defaulting to `false` in both the frontend form and the backend sync code. This meant that with the default configuration, no content was synced at all from a GitHub repository — silently producing zero results. Additionally, the form field labels contained a typo: "Inlcude" instead of "Include". ## Solution - Changed `include_pull_requests` default from `false` to `true` in the frontend form fields and default values - Changed `include_issues` default from `false` to `true` in the frontend form fields and default values - Changed both backend defaults in `sync_data_source.py` from `False` to `True` - Fixed label typos: "Inlcude Pull Requests" → "Include Pull Requests" and "Inlcude Issues" → "Include Issues" This makes the GitHub connector consistent with the GitLab connector, which already defaults `include_mrs`, `include_issues`, and `include_code_files` all to `true`. ## Testing - The connector now syncs both pull requests and issues by default when a new GitHub data source is created - Users who want to exclude PRs or issues can uncheck the corresponding checkboxes in the form Co-authored-by: octo-patch <octo-patch@github.com>	2026-05-15 13:26:31 +08:00
sham-sr	ef2969a462	fix(llm): Tongyi-Qianwen embeddings use correct DashScope native API for intl URLs (#14784 ) ## Summary - Fixes Tongyi-Qianwen (`QWenEmbed`) text embeddings when the configured `base_url` points at DashScope international (`dashscope-intl.aliyuncs.com`) or China (`dashscope.aliyuncs.com`) hosts, including values copied from Model Studio that use the OpenAI-compatible path (`.../compatible-mode/v1`). - The `dashscope` Python SDK (`TextEmbedding.call`) expects the native HTTP root (`https://<host>/api/v1`), not the OpenAI-compatible base URL. Without mapping, international accounts could hit the wrong host or path. ## Implementation - Added `_dashscope_native_http_api_url()` to normalize known DashScope hosts to `.../api/v1`, and wired `QWenEmbed` to set `dashscope.base_http_api_url` before each embedding call (document and query). ## Notes - In-code comments document the Tongyi-Qianwen / DashScope intl vs CN behavior for future maintainers. --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-15 10:07:48 +08:00
07heco	8dc5b1b42d	fix: optimize reranking module robustness and bug fixes (#14264 ) ## Description This PR fixes critical bugs and improves the robustness of the RAG reranking module while maintaining 100% backward compatibility with all existing functionality and providers. ## Key Changes 1. Network Stability: Added 30s timeout to all API requests to prevent service blocking 2. Boundary Protection: Added empty query/text validation for all rerank models 3. Response Fault Tolerance: Replaced hardcoded key access with `.get()` to avoid KeyError crashes 4. Bug Fixes: - Fixed `Ai302Rerank` (completely non-functional before) - Fixed `GPUStackRerank` incorrect exception catching - Fixed `_normalize_rank` empty array crash 5. Code Specification: Added type annotations, standardized unimplemented class prompts ## Compatibility - ✅ No changes to any class/method names - ✅ All rerank providers (Jina/Cohere/NVIDIA/HuggingFace etc.) work as before - ✅ No breaking changes, zero impact on existing workflows ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-14 11:56:09 +08:00
Ahmad Intisar	e994051eb9	Feature/generic api connector (#13545 ) # feat: Add Generic REST API Connector ## What problem does this PR solve? RAGFlow supports many specific data source connectors (MySQL, Slack, Google Drive, etc.), but there was no way to connect an arbitrary REST API as a data source. Users with custom or third-party APIs had to write a new connector class for each one. This PR adds a generic, configuration-driven REST API connector that lets users connect any REST API as a data source entirely through the UI — no code changes needed per API. --- ## Features ### Core Connector (`common/data_source/rest_api_connector.py`) - Implements `LoadConnector` and `PollConnector` interfaces for full and incremental sync - Configurable authentication: None, API Key (custom header), Bearer Token, Basic Auth - Pluggable pagination: Page-based, Offset-based, Cursor-based, or None - Smart page-size inference from user's query parameters to avoid duplicate/conflicting params - Configurable request delay between pages to prevent API rate limiting - Auto-detection of the items array in JSON responses (`items`, `results`, `data`, `records`, or first list found) - Advanced field mapping with dot-notation (`country.name`), array wildcards (`newsType[].name`), type hints, and default values - Optional content template rendering (`"Title: {title}\nBody: {body}"`) - HTML stripping for content fields - Stable document IDs via `hash128` from a configurable ID field or auto-generated from item content - Pydantic configuration schema with automatic coercion of UI string inputs to dicts/lists ### Backend Registration (`rag/svr/sync_data_source.py`, `common/constants.py`, `common/data_source/config.py`) - `REST_API` sync class wired into RAGFlow's `func_factory` - Full sync (`load_from_state`) and incremental polling (`poll_source`) support - Credentials and config passed from task to connector following existing patterns (MySQL, SeaFile, etc.) ### Test Connection Endpoint (`api/apps/connector_app.py`) - `POST /v1/connector/<id>/test` validates config schema, authentication, and API connectivity without triggering a sync - Clear error messages for auth failures vs. config issues ### Frontend UI (`web/src/pages/user-setting/data-source/constant/`) - Postman-style configuration:* Base URL, Query Parameters (key=value per line), Auth, Content Fields, Metadata Fields, Pagination Type - Auth-type-aware form: fields for API key header/value, Bearer token, or Basic username/password appear only when relevant - Advanced Settings toggle for: Custom Headers, Max Pages, Request Delay, Poll Timestamp Field, Request Body (POST) - Connector icon (SVG) and i18n strings (English) - "Test Connection" button to validate before syncing --- ## Controls & Safety - Configurable max pages safety cap (default: 1000, adjustable in UI) - Configurable request delay between pages (default: 0.5s, adjustable in UI) - Auth errors (401/403) fail immediately without retries; transient errors retry with exponential backoff - Diagnostic logging: auth setup confirmation, request details on failure, content field extraction status --- ## Type of change - [x] New Feature (non-breaking change which adds functionality) ##Visual Screenshots of Features <img width="482" height="510" alt="Screenshot 2026-03-11 at 5 19 52 PM" src="https://github.com/user-attachments/assets/dcb7ab4a-1622-44f3-bb02-d6f0527314c4" /> (Connector can be configured within the external data sources tab) Configuration Parameters: <img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 46 PM" src="https://github.com/user-attachments/assets/5e154e71-4ab5-4872-bfb2-04f02b73c18a" /> <img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 54 PM" src="https://github.com/user-attachments/assets/00cb14b7-0bcf-4b94-9d71-34e93369ecb2" /> Connection can be tested before attaching to dataset: <img width="981" height="681" alt="Screenshot 2026-03-11 at 5 21 40 PM" src="https://github.com/user-attachments/assets/aaa6eeeb-89a7-4349-bc34-2423bf8be9ee" /> Ingestion tested with API connector (works perfectly fine): <img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 22 30 PM" src="https://github.com/user-attachments/assets/afcd0d58-cadd-4152-badc-d2f14d96fbec" /> Search & Retrieval works as well with metadata flow: <img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 23 05 PM" src="https://github.com/user-attachments/assets/d41ee935-dcf7-4456-b317-22a76ca032c0" /> --------- Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-05-13 20:35:01 +08:00
Idriss Sbaaoui	09e1fd290a	Chore: migrate tests to restful api (#14871 ) ### What problem does this PR solve? add new testing suite for the new restful api endpoints meant to replace http and web api tests ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): test	2026-05-13 15:07:23 +08:00
shawnxiao105-afk	8b6dd6a5c2	fix: guard whitespace-only chunks before embedding (#13938 ) ## Problem When parsing DOCX files with many tables, DeepDOC generates chunks containing only empty HTML table tags, such as: ```html <table><tr><td></td></tr><tr><td></td></tr><tr><td></td></tr><tr><td></td></tr></table> ``` After the regex cleanup at `task_executor.py:584`, this becomes `" "` (whitespace only). The guard at line 585 (`if not c`) only catches empty strings `""`, but whitespace strings are truthy in Python and pass through. When sent to Zhipu `embedding-3` API, it rejects them with error 1213: `未正常接收到prompt参数`. ## Root Cause ```python c = re.sub(r"</?(table\|td\|caption\|tr\|th)( [^<>]{0,12})?>", " ", c) if not c: # ← only catches "", not " " / "\n" / "\t" c = "None" ``` Verified with Zhipu `embedding-3`: \| Input \| Result \| \|---\|---\| \| `""` \| error 1213 \| \| `" "` \| error 1213 \| \| `"\n"` \| error 1213 \| \| `"None"` \| OK \| ## Fix ```diff - if not c: + if not c.strip(): c = "None" ``` ## Testing Reproduced with a 678KB DOCX file (166 tables, 270 chunks). Chunk #89 is the empty table above. After fix, `"None"` is sent instead and embedding succeeds. --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-13 11:47:50 +08:00
Paul Yao	c34c81e8e6	fix: remove duplicate .wav and .aac in audio supported extensions list (#14791 ) What problem does this PR solve? In rag/app/audio.py, the supported audio extensions list contains duplicate entries: .wav appears twice (positions 3 and 5) and .aac appears twice (positions 6 and 14). While this does not affect runtime behavior, it is redundant and makes the code harder to maintain. This PR removes the duplicate entries to keep the list clean and consistent. Type of change - [X] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 09:42:31 +08:00
Wang Qi	4374e07a29	Speed up start time (#14833 ) ### What problem does this PR solve? Speed up start time ### Type of change - [x] Refactoring	2026-05-12 17:00:45 +08:00
CaptainTimon	2717ee283f	feat(raptor): add Psi tree builder with original-space ranking and safe migration (#14679 ) ### What problem does this PR solve? Closes #14674. This PR improves RAPTOR configuration and tree construction while preserving the existing RAPTOR behavior as the default. RAPTOR currently builds summary layers with the original UMAP + GMM clustering path. This PR keeps that default path, and adds: - A hidden backend tree-builder option: - `tree_builder="raptor"`: default, existing RAPTOR behavior. - `tree_builder="psi"`: rank-aware Psi-style tree builder using original embedding-space cosine ranking. - A user-facing clustering method option for the default RAPTOR builder: - `clustering_method="gmm"`: existing default. - `clustering_method="ahc"`: agglomerative hierarchical clustering path. - A RAPTOR UI setting for `Clustering method` and `Max cluster`. ### What changed #### Backend - Added `tree_builder` support for RAPTOR/Psi. - Added `clustering_method` support for GMM/AHC. - Kept existing RAPTOR + GMM as the default. - Added Psi tree building from original-space cosine similarity. - Added bucketed Psi building controls for large inputs: - `raptor.ext.psi_exact_max_leaves` - `raptor.ext.psi_bucket_size` - Added method-aware RAPTOR summary metadata using existing `extra.raptor_method`. - Avoided adding a dedicated DB schema field for experimental method tracking. - Added cleanup/migration logic to avoid mixing stale RAPTOR summary trees. - Added defensive checks for Psi tree construction and summary failures. #### Frontend/UI - Added `Clustering method` in RAPTOR settings with `GMM` and `AHC`. - Added/kept `Max cluster` in RAPTOR settings. - Enlarged max cluster UI limit to `1024`, matching backend validation. - Kept AHC editable even when a RAPTOR task has already finished. - Fixed the UI save payload so `clustering_method` and `tree_builder` are serialized through `parser_config.raptor.ext`, avoiding backend validation errors for extra top-level RAPTOR fields. Example saved RAPTOR config: ```json { "raptor": { "max_cluster": 317, "ext": { "clustering_method": "ahc", "tree_builder": "raptor" } } } Co-authored-by: CaptainTimon <CaptainTimon@users.noreply.github.com>	2026-05-12 09:42:31 +08:00

1 2 3 4 5 ...

1494 Commits