ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-02 00:35:46 +08:00

Author	SHA1	Message	Date
Yufeng He	5db1b296fb	fix: fall back from empty Docling native chunks (#15601 ) ## Summary - keep the native Docling chunking path when it returns usable chunks - fall back to the standard Docling response parser when a chunked request gets HTTP 200 but returns no usable chunks - add a regression test for older Docling servers that accept the chunking request but return a standard conversion payload ## Why Older external Docling servers can accept a request containing `do_chunking: true` and still return the standard conversion response shape. The current code treats any HTTP 200 from the chunked request as a native chunk response, finds no chunk entries, and returns zero sections without trying the standard response parser. Fixes #15569. ## Validation - `python -m pytest test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py -q` - `python -m py_compile deepdoc\\parser\\docling_parser.py test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py` - `python -m ruff check deepdoc\\parser\\docling_parser.py test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py` - `git diff --check`	2026-06-04 13:42:58 +08:00
bitloi	01a5598aa5	Fix: markdown fenced code block extraction (#15630 ) ### What problem does this PR solve? Markdown extraction currently applies custom delimiters before respecting fenced code blocks. When a delimiter such as a newline is configured, fenced code can be split into separate chunks, and longer outer fences can be closed incorrectly by shorter nested fences. This PR keeps the fix intentionally narrow for the Markdown chunking discussion in #15482: - preserve fenced code blocks when delimiter-based extraction is used - support both backtick and tilde fences - respect fence length so longer outer fences can contain shorter inner fences - keep delimiter splitting unchanged outside fenced blocks Refs #15482 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Testing - `ruff check deepdoc/parser/markdown_parser.py test/unit_test/deepdoc/parser/test_markdown_parser.py` - `python3 run_tests.py -t test/unit_test/deepdoc/parser/test_markdown_parser.py`	2026-06-04 13:33:46 +08:00
buua436	c70f19e138	Fix: remove duplicate document preview access check (#15625 ) ### What problem does this PR solve? remove duplicate document preview access check ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-04 13:05:15 +08:00
Lynn	597ac1e900	Fix: search bot and verify model instance (#15588 ) ### What problem does this PR solve? Fix: - Verify provider with empty llm list in llm_factories.json - Set search bot's chat_llm_name, use tenant default chat model as default ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-04 11:59:55 +08:00
dripsmvcp	2196f2260a	fix(api): restore DocumentService.accessible check on /preview (#15508 ) ## Summary Restore the `DocumentService.accessible(doc_id, current_user.id)` check that PR #15146 dropped from the REST document preview handler. Any authenticated caller could download any tenant's document bytes by guessing/knowing the `doc_id`. ## Root cause `api/apps/restful_apis/document_api.py` — the `GET /documents/<doc_id>/preview` handler called `DocumentService.get_by_id` and went straight to `File2DocumentService.get_storage_address` + `STORAGE_IMPL.get`, with no tenant check between the lookup and the read. The handler's docstring even promises "user must belong to the tenant that owns the document's knowledge base" — the code didn't enforce it. ## Fix - Add `current_user` to the existing `api.apps` import. - Immediately after `get_by_id`, call `DocumentService.accessible(doc_id, current_user.id)`; on denial, return the same `get_data_error_result(message="Document not found!")` shape used for the missing-doc branch. That makes a cross-tenant probe indistinguishable from a missing-doc probe, preventing ID enumeration (the issue body calls this out explicitly). - Emit `logging.warning` with caller user + doc_id for audit. - Restores symmetry with peer routes that already call `accessible(doc_id, user_id)` (e.g. `_run_sync` at `document_api.py:1380`). ## Test plan Adds `test/unit_test/api/apps/restful_apis/test_document_preview_accessible.py`: - `test_cross_tenant_preview_is_denied` — owner tenant ≠ caller tenant; asserts the response shape is `Document not found!` and the storage backend (`thread_pool_exec(STORAGE_IMPL.get, ...)`) is never invoked. - `test_missing_doc_returns_not_found` — missing-doc behaviour unchanged. Stub-loader pattern mirrors `test/unit_test/api/apps/sdk/test_dify_retrieval.py` (added in #15028, passing in CI). ## Provenance — how this fix was produced This PR was authored against a small cited knowledge base committed in the working tree as a `.vouch/` (see [vouchdev/vouch](https://github.com/vouchdev/vouch)). The loop used here: 1. Grounding first. Before reading the handler, queried the KB for prior context: `vouch context "tenant scoped accessible authorization"` → retrieved a cited claim distilled from PR #15028 (which restored the same `accessible()` check on `/dify/retrieval`). The retrieved rule: > ragflow REST endpoints that load by tenant-scoped id must call `<Service>.accessible(id, tenant_id)` after `get_by_id` and before storage/DB read; deny with code 109 'No authorization.' and log a warning. Established by PR #15028. 2. Applied the pattern with a domain refinement. For an API/JSON endpoint, `No authorization.` is the right denial shape. For a byte-streaming, browser-facing endpoint like `/preview`, leaking existence itself enables enumeration — so per the issue's expected behaviour, this PR denies with `Document not found!` (indistinguishable from missing) instead. Same auth check, narrower response. 3. Recorded the refinement back into the KB as a new cited claim, so the next IDOR-class issue starts already grounded in both the general pattern and the byte-route nuance. Net effect of the workflow: the fix replicates a known-good pattern instead of reinventing it, and the place where the pattern was nuanced is now retrievable for the next pass. Mechanism is fully independent of this PR — it's not a runtime dependency, just process discipline. Closes #15501	2026-06-04 09:58:26 +08:00
Wang Qi	b946df8ba2	Fix: consolidate beta auth (#15581 ) Fix: consolidate beta auth	2026-06-03 19:58:06 +08:00
bohdansolovie	ae316b3415	fix(api): guard document rename when linked file row is missing (#15536 ) ## Summary Fixes #15534 — `update_document_name_only()` crashes with `AttributeError` when `File2Document` exists but the linked `File` row was deleted. `update_document_name_only()` in `document_api_service.py` called `FileService.get_by_id()` when a `File2Document` row existed, then accessed `file.id` without checking the lookup result. An orphan `File2Document` link (file deleted, mapping left behind) caused document rename via `PATCH /api/v1/datasets/{dataset_id}/documents/{document_id}` to return HTTP 500. This PR mirrors guards used in `file2document_api.py` and `file_api_service.py`: skip the optional file rename when the file is missing, and still update the document record and search index. ## Changes - `api/apps/services/document_api_service.py` — check `e and file` before `FileService.update_by_id` - `test/unit_test/api/apps/services/test_update_document_name_only.py` — regression tests (orphan link + happy path) ## Test plan - [x] `pytest test/unit_test/api/apps/services/test_update_document_name_only.py -v` - [ ] Manual: PATCH document `name` when `File2Document` points to a non-existent `file_id` → 200, document/index renamed, no 500	2026-06-03 17:57:19 +08:00
Jack	b363146997	refactor: overhaul task executor with layered architecture and comprehensive test suite (#15471 ) ## Summary Decomposes the monolithic `task_executor.py` (1945 lines) into a 6-layer architecture with clear separation of concerns. The refactored code is functionally equivalent to the original, verified through 400 passing tests and a production-vs-dry-run comparison framework. ## Architecture ``` entry (task_manager) └─ orchestration (task_handler) ├─ services (chunk_service, embedding_service, dataflow_service, raptor_service, post_processor) │ └─ utilities (chunk_builder, chunk_post_processor, embedding_utils) └─ infrastructure (task_context, recording_context, interceptor) ``` Key design decisions: - TaskContext — typed facade over raw task dict, injects rate limiters + callbacks via composition - RecordingContext + Comparator — enables side-by-side production vs dry-run execution for safe migration - NullRecordingContext — zero-allocation no-op for production, uses `__slots__` - WriteOperationInterceptor — FIFO replay of previous runs function returns for comparison mode ## Migration Strategy The original `handle_task()` in `task_executor.py` uses a 3-way switch via `TE_RUN_MODE`: - `TE_RUN_MODE=0` (default) → runs refactored code - `TE_RUN_MODE=1` → runs both original + refactored, compares all intermediate results - `TE_RUN_MODE=2` → runs original code (fallback) The comparison mode (`TE_RUN_MODE=1`) records ~40 intermediate values (chunks, vectors, token counts, func return values) from the production run and replays them during dry-run, then uses `ContextComparator` to report mismatches. ## Functional Equivalence Fixes All divergences between original and refactored code were identified and fixed: - Timeout decorators (handle/build_chunks/raptor/embedding) - NullRecordingContext leak in finally block causing RuntimeError - MinIO None-binary check with proper FileNotFoundError - Dataflow dispatch after embedding binding + init_kb - Memory task missing return after processing - RAPTOR checkpoint progress reporting - Tag cache (get_tags_from_cache/set_tags_to_cache) restoration - dataflow_id correction in _load_dsl - Language default Chinese, dead code guard removal - embed_chunks made async with proper thread_pool_exec - Full GraphRAG default configuration (10 parameters) - Hardcoded q_768_vec fallback removal in RAPTOR ## Test Changes - 20 new tests covering table parser manual mode, tag cache, embedding edge cases, RAPTOR checkpoint, dataflow_id correction, storage binary None, cancel cleanup, metadata=None boundary - Unified `make_task_context`/`make_task_dict` factories eliminated 10+ duplicated helpers - DataflowService tests migrated from internal method mocks to IO boundary mocks (real orchestration code executes) - Parametrized duplicate build_chunks post-processor tests - 7 raptor tests modernized to @pytest.mark.asyncio - Mock count per test reduced through boundary-level mocking strategy Test count: 400 passing, 0 warnings, 0 skips ## Files Changed \| File \| Change \| \|------\|--------\| \| `rag/svr/task_executor.py` \| +1 line (NullRecordingContext fix) \| \| `rag/svr/task_executor_refactor/task_handler.py` \| Orchestration layer, 8 logic fixes \| \| `rag/svr/task_executor_refactor/chunk_service.py` \| +timeout + None-check \| \| `rag/svr/task_executor_refactor/embedding_service.py` \| sync→async rewrite \| \| `rag/svr/task_executor_refactor/dataflow_service.py` \| dataflow_id fix + timeout \| \| `rag/svr/task_executor_refactor/raptor_service.py` \| checkpoint fix + assert \| \| `rag/svr/task_executor_refactor/chunk_post_processor.py` \| tag cache restore \| \| `rag/svr/task_executor_refactor/task_context.py` \| language default fix \| \| `test/.../conftest.py` \| +294 lines shared helpers \| \| `test/.../*.py` \| 15 test files refactored, 20 new tests \| --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 17:18:31 +08:00
Wang Qi	d6fc50a469	Fix: no more @token_required (#15562 ) Fix: no more @token_required	2026-06-03 16:24:08 +08:00
Jin Hai	e1f19f6679	Go: fix gitee balance api (#15554 ) ``` RAGFlow(user)> create provider 'gitee' instance 'intl' key 'api-token' url 'https://ai.gitee.com/v1' region 'intl'; SUCCESS ``` --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-06-03 13:23:20 +08:00
bitloi	a75ea7ba7c	Fix: Chat completion generation parameter overrides (#15389 ) ### What problem does this PR solve? Closes #15388. Chat completion routes did not reliably honor per-request generation settings: - `/api/v1/chat/completions` copied generation settings with a truthiness check, so valid zero values such as `temperature: 0`, `top_p: 0`, `frequency_penalty: 0`, `presence_penalty: 0`, and `max_tokens: 0` were dropped. - `/api/v1/openai/{chat_id}/chat/completions` did not forward standard generation settings into the request-specific dialog LLM settings before calling `async_chat`. This PR preserves explicitly supplied generation parameters, including zero values, and merges request-level overrides into existing dialog settings where appropriate. The supported generation parameter keys and merge behavior live in a shared REST API helper to keep both completion routes aligned. Validation: - `git diff --check` - `python3 -m py_compile api/apps/restful_apis/_generation_params.py api/apps/restful_apis/chat_api.py api/apps/restful_apis/openai_api.py test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py` - `uv run ruff check api/apps/restful_apis/_generation_params.py api/apps/restful_apis/chat_api.py api/apps/restful_apis/openai_api.py test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py` - `ZHIPU_AI_API_KEY=dummy uv run pytest test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py -q -k generation_params` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-03 11:46:10 +08:00
kpdev	76968af0ba	Guard missing storage blobs on preview and image endpoints (#15366 ) Fixes [#15365](https://github.com/infiniflow/ragflow/issues/15365) — `get_document_image()` and document preview call `make_response(None)` when storage returns no bytes, causing HTTP 500.	2026-06-03 11:33:03 +08:00
nickmopen	5b02fe4841	fix(api): stop duplicating answer in openai-compatible chat completions stream (#15286 ) (#15443 ) ### What problem does this PR solve? Fixes #15286. When calling `/api/v1/openai/<chat_id>/chat/completions` with `"stream": true`, the response contains the answer twice — the final message repeats everything that was already streamed. #### Root cause RAGFlow's `async_chat` streams the body as incremental `delta.content` chunks, then emits a terminating `final` event whose `answer` is the complete (decorated) message. The handler re-emitted that full answer as one more `delta.content` chunk: ```python if ans.get("final"): if ans.get("answer"): full_content = ans["answer"] response["choices"][0]["delta"]["content"] = full_content # <-- whole answer again yield ... ``` So a client accumulating `delta.content` ends up with the message duplicated. #### Fix Drop the re-emission. The complete answer from the `final` event is now surfaced only through the trailing chunk's `final_content` and `reference` fields, which matches OpenAI streaming semantics: deltas are incremental, and the final chunk carries only `finish_reason` / `usage` (plus RAGFlow's `reference` / `final_content` extensions). This matches the expected behavior described in the issue: "The stream should only yield content chunks once, and the final message should only contain reference, usage, and finish_reason." #### Testability refactor The streaming SSE assembly was a closure inside the request handler, so it could only be exercised against a live server + real LLM. I extracted it into a module-level `_stream_chat_completion_sse` async generator (behavior-preserving) so it can be unit-tested with a fake event stream. #### Tests Adds `test/unit_test/api/apps/restful_apis/test_openai_stream_no_duplicate.py` (same import-stub pattern as the existing `test_get_agent_session.py`): - body is streamed exactly once (the regression); - the complete answer is never re-emitted as a content chunk; - the terminating chunk has `finish_reason="stop"`, `content=None`, and correct `usage`; - `final_content` / `reference` are present on the trailing chunk; - reasoning (`think`) deltas stream separately and are not duplicated. > Note: this is unrelated to #15442, which only changes the `stream` default — it does not touch the duplication logic. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Added test cases --------- Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-02 13:20:40 +08:00
kpdev	0f6f7b3c3c	fix(api): document image_id parsing for hyphenated thumbnail keys (#15115 ) (#15116 ) ### What problem does this PR solve? Fixes #15115. `GET /api/v1/documents/images/<image_id>` returned Image not found when the thumbnail storage object key contained hyphens (e.g. `page-1.png`). Document APIs build URLs as `{dataset_id}-{thumbnail}`, but `get_document_image()` used `image_id.split("-")` and required exactly two segments, so keys like `<kb_id>-page-1.png` were rejected even though the blob existed. This PR splits only on the first hyphen (`split("-", 1)`) and sets `Content-Type` from the object key extension via `CONTENT_TYPE_MAP` instead of hardcoding `image/JPEG`.	2026-06-02 10:54:14 +08:00
kpdev	a4bc066f74	fix(rag): id2image parsing for hyphenated storage object keys (#15117 ) (#15118 ) ### What problem does this PR solve? Fixes #15117. Chunk images are stored with `img_id = f"{bucket}-{objname}"` in `image2id()` (`rag/utils/base64_image.py`). When loading via `id2image()`, the code used `image_id.split("-")` and required exactly two segments. Object keys that contain hyphens (e.g. `page-1.jpg`) produce more than two segments, so `id2image` returns `None` and chunk image previews fail even though the blob exists. This is the same parsing issue as #15115 (HTTP thumbnail route); this PR fixes the indexing/retrieval path. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Test plan - [x] `pytest test/unit_test/rag/utils/test_base64_image.py` - [ ] Manual: index a chunk with an `objname` containing hyphens and confirm `img_id` resolves to an image in retrieval Fixes #15117.	2026-06-02 10:52:51 +08:00
Hernandez Avelino	09d0a17453	fix(api): handle array message content on OpenAI chat completions (#15359 ) ### Related issues Closes #15358 <!-- After filing upstream, replace XXXX with your issue number. --> --- ### What problem does this PR solve? `POST /api/v1/openai/<chat_id>/chat/completions` forwards `messages` to `async_chat` without normalizing `content`. Downstream, `dialog_service` assumes string content: ```python re.sub(r"##\d+\$\$", "", m["content"]) ``` OpenAI-compatible clients may send `content` as an array of parts (text, `image_url`, etc.), including text-only arrays. That causes `TypeError` and HTTP 500 instead of a valid response or a clear 400. `openai_api.py` also reads `messages[-1]["content"]` directly for `prompt` without handling list-shaped content. This PR normalizes array `content` to a string (concatenating `type: text` parts) before calling `async_chat`, matching a minimal OpenAI-compat path. Image parts can be documented as unsupported or handled in a follow-up if vision integration is required.	2026-06-02 10:27:03 +08:00
Rene Arredondo	e1403171f1	fix(chat): sanitize NaN/Inf scores before serializing chat completions (#15245 ) (#15266 ) ## Summary Fixes #15245 — `POST /api/v1/chat/completions` with `stream=true` intermittently returns 500: ``` data:{"code": 500, "message": "failed to encode response: json: unsupported value: NaN (status code: 500)", "data": {...}} ``` …even though "the same question" works on retry. ## Root cause The streaming path serialized the answer with bare `json.dumps(...)` (`api/apps/restful_apis/chat_api.py:1221`). `json.dumps` defaults to `allow_nan=True` and emits the literal token `NaN` for NaN / Infinity float values. That is valid Python-flavored JSON but invalid per RFC 8259, so downstream consumers reject it. The reporter's gateway is Go-based and the error wording (`failed to encode response: json: unsupported value: NaN`) is straight from Go's `encoding/json`. How NaN gets into the payload: retrieval scoring in `rag/nlp/search.py` runs `np.mean(...)` over aggregations that can be empty, and similarity denominators can be zero. Reference chunk fields like `similarity`, `vector_similarity`, `term_similarity` can therefore be NaN depending on which chunks a given query retrieves — which is exactly why the failure is intermittent for the same question. The non-streaming branch (`get_json_result(data=answer)`, `chat_api.py:1243`) has the same vulnerability — Quart's `jsonify` also defaults to `allow_nan=True` and the same retrieval pipeline feeds both branches. `agent/tools/exesql.py:88-102` already has the same NaN/Inf guard for SQL results. This PR brings the chat completions path up to parity. ## Fix Add a small `_sanitize_json_floats(obj)` helper near the top of `api/apps/restful_apis/chat_api.py`. It walks `dict` / `list` / `tuple` and replaces any `float` that is `NaN` or `±Infinity` with `None`. Apply it at the two serialization boundaries: - Streaming branch (`stream()`): sanitize the SSE payload before `json.dumps`. - Non-streaming branch: sanitize the `answer` dict before `get_json_result(data=...)`. The terminal `data:True` frame and the `code:500` error frame carry no scores and are left untouched. Added `import math` to the existing alphabetical import block. No change to retrieval logic — replacing NaN with `null` at the serialization boundary is conservative: clients still parse the JSON, a missing-score chunk is a strictly better failure mode than a 500 that kills the whole reply. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-02 10:08:34 +08:00
nickmopen	bebf6ed244	fix(llm): strip non-generation keys from gen_conf for LiteLLM providers (#15427 ) (#15432 ) ### What problem does this PR solve? Fixes #15427. All LiteLLM-routed chats fail with: - Anthropic: `litellm.BadRequestError: AnthropicException - {"type":"invalid_request_error","message":"model_type: Extra inputs are not permitted"}` - OpenAI: `litellm.BadRequestError: OpenAIException - Unknown parameter: 'model_type'` This is a regression from v0.25.4. #### Root cause A chat assistant's `llm_setting` is forwarded to the model as `gen_conf`. `llm_setting` can legitimately carry RAGFlow-internal metadata such as `model_type` (the chat REST APIs in `api/apps/restful_apis/` read it back out of `llm_setting`), so that key ends up inside `gen_conf`. `Base._clean_conf` (OpenAI-compatible providers) already whitelists the keys it forwards, so direct-OpenAI providers were unaffected. `LiteLLMBase._clean_conf` only dropped `max_tokens` and passed everything else straight through to `litellm.acompletion`, which forwarded `model_type` to the upstream provider — and Anthropic / OpenAI reject it. Because both Claude and GPT route through LiteLLM, every chat broke. #### Fix - Extract the allowed-key set into a shared `ALLOWED_GEN_CONF_KEYS` constant and reuse it in `Base._clean_conf`. - Apply the same whitelist in `LiteLLMBase._clean_conf`, plus the LiteLLM-specific reasoning params (`thinking`, `reasoning_effort`, `extra_body`) that the model-family policies inject for reasoning models. This covers all four LiteLLM completion paths (`async_chat`, `async_chat_streamly`, `async_chat_with_tools`, `async_chat_streamly_with_tools`), since they all route through `_clean_conf`. #### Tests Adds `test/unit_test/rag/llm/test_clean_conf_whitelist.py` covering both backends: `model_type` (and other stray keys) are dropped, genuine generation params and `thinking` survive, `max_tokens` is removed, and the whitelist invariants hold. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Added test cases	2026-06-02 10:04:11 +08:00
monsterDavid	d398d617ca	fix(mineru): skip page chrome blocks to prevent duplicate chunks (#15387 ) ## Summary - Skip MinerU `header`, `footer`, and `page_number` blocks when converting `content_list.json` into sections. - Ignore unsupported block types explicitly so future MinerU output types cannot re-emit the previous text block. Fixes duplicate text in General/naive chunks when parsing PDFs via MinerU (reported with repeated page headers and body text in slices). Closes #15335 ## Test plan - [x] `pytest test/unit_test/deepdoc/parser/test_mineru_parser.py -v` (4/4 passed)	2026-06-01 20:15:04 +08:00
Wang Qi	1a6df01b53	Bug fix: Enhance embeding model to give better error message (#15346 ) To resolve https://github.com/infiniflow/ragflow/issues/15343 enhance the model embedding message to give extact failure message to customer. # QWen ## Retrieval <img width="3321" height="1033" alt="image" src="https://github.com/user-attachments/assets/6b82921a-a3a7-4a33-a383-1cf316398ee2" /> ## Chat <img width="2241" height="311" alt="image" src="https://github.com/user-attachments/assets/ec311365-62d5-407a-8915-5c8d72be9716" /> # SiliconFlow ## Retrieval <img width="3321" height="1033" alt="image" src="https://github.com/user-attachments/assets/ee2cd191-a27d-4729-b53d-2fbdb4e352cd" /> ## Chat <img width="1562" height="210" alt="image" src="https://github.com/user-attachments/assets/10376a8e-a3f4-422f-bc2e-96f2a8a96448" /> # Baichuan ## Retrieval <img width="3321" height="1107" alt="image" src="https://github.com/user-attachments/assets/dcb5409d-f7fc-4804-b186-5e1ee11e09c4" /> ## Chat <img width="2241" height="311" alt="image" src="https://github.com/user-attachments/assets/ec311365-62d5-407a-8915-5c8d72be9716" /> # Zhipu zhipu is good.	2026-06-01 19:18:16 +08:00
kpdev	252cc19f93	Infer Content-Type for document image endpoint (#15368 ) ## Summary Fixes [#15367](https://github.com/infiniflow/ragflow/issues/15367) — `GET /api/v1/documents/images/<image_id>` always returned `Content-Type: image/JPEG` even for PNG/WebP chunk images and extensioned thumbnails. ## Related Issue Fixes #15367 ## Change Type - [x] Bug fix - [x] Regression tests - [ ] New feature - [ ] Refactor ## What Changed - Added `_detect_image_content_type_from_bytes()` — PNG/JPEG/GIF/WebP/BMP magic-byte detection - Added `_content_type_for_document_image()` — object-key extension via `CONTENT_TYPE_MAP`, then magic bytes, else `application/octet-stream` - `get_document_image()` — set inferred `Content-Type` instead of hardcoded `image/JPEG` - Also guards missing storage blob (`Image not found.`) to avoid `make_response(None)` (same handler; complements #15365) ## Files Changed \| File \| Change \| \|------\|--------\| \| `api/apps/restful_apis/document_api.py` \| MIME inference helpers + handler update \| \| `test/testcases/test_web_api/test_document_app/test_document_metadata.py` \| 3 unit tests \| ## Validation ```bash cd /root/gittensor/ragflow pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_content_type_from_object_extension_unit -v pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_content_type_from_magic_bytes_unit -v pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_missing_blob_unit -v ``` ## Test Plan - [x] `.png` object key → `image/png` - [x] Extensionless chunk key + PNG bytes → `image/png` (magic bytes) - [x] Missing blob → 4xx `"Image not found."` - [ ] CI green	2026-06-01 19:08:32 +08:00
kpdev	b35266e9a5	Return 4xx when file download storage blob is missing (#15371 ) ## Summary Fixes [#15369](https://github.com/infiniflow/ragflow/issues/15369) — `GET /api/v1/files/<file_id>` calls `make_response(None)` when both primary and fallback storage lookups return empty, causing HTTP 500. ## Related Issue Fixes #15369 ## Change Type - [x] Bug fix - [x] Regression tests ## What Changed - `file_api.download()` — after fallback `STORAGE_IMPL.get`, return `get_error_data_result(message="This file is empty.")` when `not blob`, matching document REST download semantics. ## Files Changed \| File \| Change \| \|------\|--------\| \| `api/apps/restful_apis/file_api.py` \| Empty-blob guard before `make_response()` \| \| `test/testcases/test_web_api/test_file_app/test_file_routes_unit.py` \| Regression test \| ## Validation ```bash cd /root/gittensor/ragflow pytest test/testcases/test_web_api/test_file_app/test_file_routes_unit.py::test_download_missing_blob_returns_error -v pytest test/testcases/test_web_api/test_file_app/test_file_routes_unit.py::test_download_falls_back_to_document_storage -v ``` ## Test Plan - [x] Both storage paths empty → `"This file is empty."` (no `make_response(None)`) - [x] Existing fallback success test still passes - [ ] CI green	2026-06-01 19:08:06 +08:00
Idriss Sbaaoui	da1ed6f0e7	Feat: add new tests and tescases for restful api suite (#15347 ) ### What problem does this PR solve? extend restful api suite ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): test	2026-06-01 11:02:40 +08:00
web-dev0521	cd18cfab79	feat(connector): implement Outlook data source connector (issue #15332 ) (#15333 ) ### What problem does this PR solve? Closes #15332. RAGFlow can index Gmail and generic IMAP mailboxes but had no native connector for Outlook / Microsoft 365 mail. Organisations on Microsoft 365 had no way to bring mailbox content into a knowledge base through Microsoft Graph. This PR adds a net-new Outlook data source that: - Authenticates against Microsoft Graph with the same MSAL client-credentials flow already used by the SharePoint and Teams connectors (no new auth primitives). - Pages over `/users/{id}/mailFolders/{folder}/messages/delta` per mailbox and persists `@odata.deltaLink` values in `OutlookCheckpoint.delta_links`, so incremental syncs only fetch changed messages. - Supports two scoping modes: - Tenant-wide (default): enumerates every user in the tenant via `/users` and syncs each mailbox. Requires `User.Read.All`. - Targeted: when `user_ids` is provided (comma-separated UPNs or object IDs), only those mailboxes are synced. `User.Read.All` is not needed in this mode. - Lets the caller pick the mail folder (`inbox`, `sentitems`, `archive`, ...). Defaults to `inbox`. - Maps each message to a `Document` shaped after the Gmail connector: one `TextSection` carrying `From/To/Cc/Subject` headers + body, with HTML bodies stripped to text inline (no extra dependency). - Surfaces typed errors on the validation probe: 401 → `ConnectorMissingCredentialError`, 403 → `InsufficientPermissionsError` (with `Mail.Read` / `User.Read.All` hint), 404 on a configured mailbox → `ConnectorValidationError`, 5xx → `UnexpectedValidationError`. - Skips messages flagged `@removed` by the delta semantics and messages whose `receivedDateTime` is older than `poll_range_start`. #### Files \| File \| Change \| \|------\|--------\| \| `common/data_source/outlook_connector.py` \| New — `OutlookConnector` (`CheckpointedConnectorWithPermSync` + `SlimConnectorWithPermSync`) + `OutlookCheckpoint` + tiny `_strip_html` helper. \| \| `common/data_source/config.py` \| `DocumentSource.OUTLOOK = "outlook"`. \| \| `common/constants.py` \| `FileSource.OUTLOOK = "outlook"`. \| \| `common/data_source/__init__.py` \| Export `OutlookConnector`. \| \| `rag/svr/sync_data_source.py` \| `Outlook(SyncBase)` with `batch_size` normalisation, CSV/list parsing of `user_ids`; registered in `func_factory`. \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `DataSourceKey.OUTLOOK`, visibility map (`syncDeletedFiles: true`), info entry, form fields (tenant_id, client_id, client_secret, folder, user_ids, batch_size), default values. \| \| `web/src/locales/en.ts`, `web/src/locales/zh.ts` \| `outlookDescription` + 5 tooltip keys (EN + ZH). \| \| `test/unit_test/data_source/test_outlook_connector_unit.py` \| New — 19 unit tests (`p1`/`p2`/`p3`) covering auth, validation (tenant-wide vs specific user vs error paths), checkpoint helpers, user enumeration pagination, message filtering, HTML body stripping. \| #### Required Azure AD permissions - `Mail.Read` (Application, admin-granted) — always. - `User.Read.All` (Application, admin-granted) — only when `user_ids` is left blank so the connector can enumerate mailboxes. #### Out of scope - Attachment indexing. The current connector emits message body + headers; binary attachments are flagged via `metadata.has_attachments` but not pulled. Adding attachment hydration is straightforward but scoped out per the issue's "decide whether attachments are indexed in the first version" note. - Delegated (per-user) OAuth. The connector uses app-only credentials, consistent with the SharePoint / Teams precedent in this codebase. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-29 21:52:29 +08:00
galuis116	d1f6594618	Fix: JWT algorithm-confusion in OIDC ID token verification (#15181 ) ### What problem does this PR solve? Closes #15180. `OIDCClient.parse_id_token` in `api/apps/auth/oidc.py` read the JWT signing algorithm from the unverified JWT header and passed it through to `jwt.decode(..., algorithms=[alg], ...)` as the trust anchor. This is the textbook JWT algorithm-confusion vulnerability (CWE-345 / CWE-347). Any unauthenticated client capable of reaching the OIDC callback could take over an arbitrary account on any RAGFlow deployment with OIDC login enabled: 1. `alg: "none"` — present a JWT with `{"alg": "none"}` and no signature segment → `jwt.decode(..., algorithms=["none"])` → PyJWT's `NoneAlgorithm` accepts the token without verification → login as any user. 2. RSA / HMAC confusion — fetch the public RSA key from the provider's JWKS (it's public), forge a JWT with `{"alg": "HS256"}` HMAC-signed using the public-key bytes as the secret → `jwt.decode(..., algorithms=["HS256"], key=public_key)` → verifier accepts → login as any user. (Modern PyJWT independently refuses to use a PEM-formatted key as an HMAC secret, which mitigates this leg for PEM key formats; the fix here is the only mitigation for raw / DER / JWK octet keys and for older PyJWT versions.) ### What changed `api/apps/auth/oidc.py`: - New module constants `_ALLOWED_OIDC_SIGNING_ALGS` (asymmetric-only: `RS`, `ES`, `PS`, `EdDSA` — explicitly excludes `none` and `HS`) and `_DEFAULT_OIDC_SIGNING_ALGS = ("RS256",)` (the OIDC Core 1.0 §2 spec default). - New helper `_resolve_id_token_signing_algs(metadata)` — intersects the provider's advertised `id_token_signing_alg_values_supported` from `/.well-known/openid-configuration` with the safe allowlist; falls back to RS256 when the field is missing or contains only unsafe values. - `OIDCClient.__init__` now stores the resolved allowlist on `self.id_token_signing_algs` — pinned once, from a trusted source, at construction time. - `parse_id_token` no longer calls `jwt.get_unverified_header` and no longer reads `alg` from the JWT header. It passes `self.id_token_signing_algs` to `jwt.decode(..., algorithms=...)`. `PyJWKClient.get_signing_key_from_jwt` still reads the `kid` from the header internally for JWKS lookup — that's fine, `kid` is not a security decision; the signature still proves which key was actually used. `test/testcases/test_web_api/test_auth_app/test_oidc_client_unit.py`: - Existing `test_parse_id_token_success_and_error` drops its `jwt.get_unverified_header` mock (no longer called by `parse_id_token`). - `_metadata` and `_make_client` helpers grew an optional `signing_algs` parameter so tests can configure what the discovery document advertises. - New `TestSSRFValidation` / algorithm-confusion regression block (7 tests): - `test_id_token_signing_algs_default_to_rs256_when_metadata_missing` - `test_id_token_signing_algs_intersect_metadata_with_safe_allowlist` - `test_id_token_signing_algs_fall_back_when_only_unsafe_advertised` - `test_id_token_signing_algs_ignores_non_string_entries` - `test_id_token_signing_algs_handles_non_list_metadata_field` - `test_parse_id_token_passes_pinned_algorithms_to_jwt_decode` — sabotages `jwt.get_unverified_header` to raise on call, proving the verification path never consults the unverified header. - `test_parse_id_token_rejects_alg_none` — uses real PyJWT to encode an `alg: "none"` token; `parse_id_token` raises `ValueError("Error parsing ID Token: …")` instead of accepting it. - `test_parse_id_token_rejects_hs256_when_allowlist_is_asymmetric` — uses real PyJWT to forge an `alg: "HS256"` token with a non-PEM shared secret (so PyJWT's incidental PEM-as-HMAC refusal isn't what blocks it); `parse_id_token` raises because `HS256` is not in the pinned allowlist. Sanity-checked end-to-end with real PyJWT outside the project test runner: - `alg=none` forged token + `algorithms=["RS256"]` → `InvalidAlgorithmError` ✓ - `alg=HS256` forged token + `algorithms=["RS256"]` → `InvalidAlgorithmError` ✓ - Same `alg=HS256` token + `algorithms=["HS256"]` → accepted ({'sub': 'admin'}) — confirming the attack path was real before the fix. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: galuis116 <contact@duerrimports.com>	2026-05-29 19:37:01 +08:00
kpdev	cb1ea5a47f	Validate chunk image_base64 before doc-store write (#15364 ) ## Summary Fixes [#15363](https://github.com/infiniflow/ragflow/issues/15363) — `add_chunk` / `update_chunk` indexed chunks with `image_id` before validating or storing `image_base64`, leaving orphan chunks on invalid input. ## Related Issue Fixes #15363 ## Change Type - [x] Bug fix - [x] Regression tests ## What Changed - Added `_decode_chunk_image_base64()` — strict base64 decode with structured 4xx errors - Added `_store_chunk_image_or_error()` — catches `store_chunk_image` failures - `add_chunk` / `update_chunk`: decode + store image before `docStoreConn.insert` / `update`; only set `img_id` after successful storage ## Files Changed \| File \| Change \| \|------\|--------\| \| `api/apps/restful_apis/chunk_api.py` \| Helpers + reorder image handling \| \| `test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py` \| 3 regression tests \| ## Validation ```bash cd /root/gittensor/ragflow pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_add_chunk_invalid_image_base64_does_not_index_chunk -v pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_update_chunk_invalid_image_base64_does_not_update_chunk -v pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_add_chunk_valid_image_base64_stores_before_insert -v pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py -v ``` ## Test Plan - [x] Invalid `image_base64` on add → 4xx, no doc-store insert - [x] Invalid `image_base64` on update → 4xx, no doc-store update - [x] Valid PNG base64 on add → image stored, chunk indexed with `img_id` - [ ] CI green	2026-05-29 19:36:46 +08:00
monsterDavid	53bb2bd9e8	fix(metadata): preserve empty AND results across filter conditions (#15386 ) ## Summary - Fix `meta_filter()` AND logic so an empty result from an early condition is not overwritten when a later condition matches. - Add regression tests for empty-first AND, successful AND intersection, and OR behavior after an empty first condition. Fixes incorrect `/retrieval` metadata filtering when multiple AND conditions are used and the first condition matches no documents. Closes #15360 ## Test plan - [x] `pytest test/unit_test/common/test_metadata_filter_operators.py -v` (19/19 passed)	2026-05-29 19:33:26 +08:00
web-dev0521	bda2117a25	feat(connector): implement OneDrive data source connector (issue #15330 ) (#15331 ) ### What problem does this PR solve? Closes #15330. RAGFlow had no connector for OneDrive / OneDrive for Business. Users who store working documents in OneDrive could not index them into a knowledge base without manually downloading and re-uploading files. This PR adds a net-new OneDrive data source that: - Authenticates against Microsoft Graph with the same MSAL client-credentials flow already used by the SharePoint and Teams connectors (no new auth primitives). - Enumerates every drive visible to the service principal and pages through `/drives/{id}/root/delta`, persisting `@odata.deltaLink` values per drive so subsequent syncs only fetch changed items. - Optionally narrows ingestion to a sub-folder (`folder_path`) without needing a separate code path. - Surfaces typed errors on the validation probe (`GET /drives?$top=1`): 401 → `ConnectorMissingCredentialError`, 403 → `InsufficientPermissionsError` (with a `Files.Read.All` hint), 5xx → `UnexpectedValidationError`. - Filters folders, soft-deleted items, and unsupported extensions (`.pdf .docx .doc .xlsx .xls .pptx .ppt .txt .md .csv`). #### Files \| File \| Change \| \|------\|--------\| \| `common/data_source/onedrive_connector.py` \| New — `OneDriveConnector` + `OneDriveCheckpoint`. \| \| `common/data_source/config.py` \| `DocumentSource.ONEDRIVE = "onedrive"`. \| \| `common/constants.py` \| `FileSource.ONEDRIVE = "onedrive"`. \| \| `common/data_source/__init__.py` \| Export `OneDriveConnector`. \| \| `rag/svr/sync_data_source.py` \| `OneDrive(SyncBase)` with `batch_size` normalisation; registered in `func_factory`. \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `DataSourceKey.ONEDRIVE`, visibility map (`syncDeletedFiles: true`), info entry, form fields (tenant_id, client_id, client_secret, folder_path, batch_size), default values. \| \| `web/src/locales/en.ts`, `web/src/locales/zh.ts` \| `onedriveDescription` + 4 tooltip keys (EN + ZH). \| \| `test/unit_test/data_source/test_onedrive_connector_unit.py` \| New — 13 unit tests (`p1`/`p2`) covering auth, validation, checkpoint helpers, and document filtering. \| #### Required Azure AD permission `Files.Read.All` (Application, admin-granted). #### Out of scope - Interactive end-user OAuth (delegated permissions) — the connector uses app-only credentials, consistent with the SharePoint / Teams precedent. - Binary download of file contents — the sync layer emits `Document`s carrying `webUrl` + metadata; bytes are hydrated downstream by the parse pipeline. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-29 19:26:06 +08:00
buua436	bd6251f462	Fix: default OpenAI chat completions to non-stream (#15394 ) ### What problem does this PR solve? default OpenAI chat completions to non-stream when `stream` is omitted https://github.com/infiniflow/ragflow/issues/15356 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-29 17:47:47 +08:00
Lynn	dc4b82523b	Feat: tenant llm provider (#14595 ) ### What problem does this PR solve? Python implementation of the Go-based model_provider API suite. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: bill <yibie_jingnian@163.com>	2026-05-29 17:39:41 +08:00
web-dev0521	98bc9ca6ac	feat: implement Microsoft Teams data source connector (#15193 ) ### What problem does this PR solve? Closes #15191. RAGFlow shipped a Microsoft Teams connector stub (`common/data_source/teams_connector.py`) whose document-loading methods all returned `[]`, `Teams._generate()` was a `pass`, and Teams was commented out of the data-source settings UI. As a result there was no way to index Teams channel conversations into a knowledge base. This PR implements the connector end to end on top of Microsoft Graph (Office365-REST-Python-Client). It shares the MSAL client-credentials auth shape with the SharePoint connector. Backend - `common/data_source/teams_connector.py` - `load_credentials()` now builds the Graph client using an MSAL client-credentials token callback — the form `GraphClient` actually expects. (The previous stub passed a raw access-token string to `GraphClient(...)`, which is not how that client is driven.) Token acquisition is lazy, so credential loading performs no network call. - `validate_connector_settings()` lists teams via Graph. - `load_from_checkpoint()` is now a generator that pages teams → channels → messages, flattens each top-level post together with its replies into one blob-based `Document` (`extension` `.txt`/`.html`, `blob`, `size_bytes`, `doc_updated_at`). Incremental syncs are bounded by message `lastModifiedDateTime` (falling back to `createdDateTime`). Per-message errors surface as `ConnectorFailure` instead of aborting the run. - `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument` batches and the checkpoint helpers return proper `TeamsCheckpoint`s. - ACL → `ExternalAccess` mapping is intentionally left best-effort (`load_from_checkpoint_with_perm_sync` delegates to the standard load) because the sync pipeline does not currently persist `ExternalAccess`. - `rag/svr/sync_data_source.py` - Implemented `Teams._generate()` using the existing `CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google Drive), supporting full reindex and incremental polling from `poll_range_start`. - `TeamsConnector` is already exported from `common/data_source/__init__.py`. Frontend (`web/`) - Enabled the `TEAMS` data-source enum and added its form fields (`tenant_id`, `client_id`, `client_secret`), default values, display metadata, and a Teams icon. - Added `teamsDescription` / `teamsTenantIdTip` to `en.ts` and `zh.ts`. Tests - `test/unit_test/data_source/test_teams_connector_unit.py`: mock-based unit tests covering credential loading (incomplete creds raise, happy path sets the Graph client, fetch-without-creds raises), post/reply flattening (incl. the HTML vs text extension), incremental `lastModifiedDateTime` filtering, and slim-doc listing. All 6 pass; `ruff check` is clean. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-28 17:10:38 +08:00
web-dev0521	5de021ebb4	feat: implement Slack data source connector (#15188 ) ### What problem does this PR solve? Closes #15187. RAGFlow shipped a Slack connector (`common/data_source/slack_connector.py`) but it was never usable: `Slack._generate()` in the sync worker was a `pass` stub, the connector's document-generating code was incompatible with the current data model, and Slack was commented out of the data-source settings UI. As a result, teams had no way to index Slack channels/threads into a knowledge base. This PR completes the connector end to end. Backend - `common/data_source/slack_connector.py` - Rewrote `thread_to_doc` to produce a blob-based `Document` (`extension`/`blob`/`size_bytes`). The previous implementation built the doc with a `sections=[...]` argument and omitted the now-required `blob`/`extension`/ `size_bytes` fields, so it raised a validation error against the current `Document` model. Thread messages are now cleaned and flattened into a single UTF-8 text blob. - Added `load_from_state()` / `poll_source(start, end)` generators. The connector's checkpoint interface is a no-op stub, so both full and incremental syncs run through a single channel-iterating generator built on the existing module helpers (`get_channels`, `filter_channels`, `get_channel_messages`, `_process_message`), with per-channel thread de-duplication. - `rag/svr/sync_data_source.py` - Implemented `Slack._generate()`. Credentials are loaded via `StaticCredentialsProvider` (the connector requires `slack_bot_token` and does not support `load_credentials`). Supports full reindex and incremental polling from `poll_range_start`, plus the optional channel filter. Modeled on the Confluence/Dropbox wrappers. - `SlackConnector` was already exported from `common/data_source/__init__.py`. Frontend (`web/`) - Enabled the `SLACK` data-source enum and added its form fields (Slack bot token + optional channel filter), default values, display metadata, and a Slack icon. - Added `slackDescription` / `slackBotTokenTip` / `slackChannelsTip` strings to `en.ts` and `zh.ts`. Tests - `test/unit_test/data_source/test_slack_connector_unit.py`: unit tests covering credential loading (`load_credentials` raises, `set_credentials_provider` initializes clients, missing credentials raises) and document generation (standalone message + flattened thread, blob/extension/size_bytes/metadata, and the incremental poll time window). All 5 pass; `ruff check` is clean. Required Slack scopes: `channels:read`, `channels:history`, `users:read`. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-28 15:46:07 +08:00
web-dev0521	c4c4e228e3	feat: implement SharePoint data source connector (#15190 ) ### What problem does this PR solve? Closes #15189. RAGFlow shipped a SharePoint connector stub (`common/data_source/sharepoint_connector.py`) whose document-loading methods all returned `[]`, `SharePoint._generate()` was a `pass`, and SharePoint was commented out of the data-source settings UI. As a result there was no way to index files stored in SharePoint document libraries. This PR implements the connector end to end on top of Microsoft Graph (Office365-REST-Python-Client). Backend - `common/data_source/sharepoint_connector.py` - `load_credentials()` now builds the Graph client using an MSAL client-credentials token callback — the form `GraphClient` actually expects. (The previous stub passed a raw access-token string to `GraphClient(...)`, which is not how that client is driven.) Token acquisition is lazy, so credential loading does no network call. - `validate_connector_settings()` resolves the configured site via Graph. - `load_from_checkpoint()` is now a generator that enumerates every document library under the site, walks folders depth-first, downloads each file, and yields blob-based `Document` objects (`extension` / `blob` / `size_bytes` / `doc_updated_at`). Incremental syncs are bounded by file `lastModifiedDateTime`. Per-file errors are surfaced as `ConnectorFailure` rather than aborting the run. - `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument` batches (no downloads) and the checkpoint helpers return proper checkpoints. - ACL → `ExternalAccess` mapping is intentionally left best-effort (`load_from_checkpoint_with_perm_sync` delegates to the standard load) because the sync pipeline does not currently persist `ExternalAccess`; this can be extended once that plumbing exists. - `rag/svr/sync_data_source.py` - Implemented `SharePoint._generate()` using the existing `CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google Drive), supporting full reindex and incremental polling from `poll_range_start`. - `SharePointConnector` is already exported from `common/data_source/__init__.py`. Frontend (`web/`) - Enabled the `SHAREPOINT` data-source enum and added its form fields `site_url`, `tenant_id`, `client_id`, `client_secret`), default values, display metadata, and a SharePoint icon. - Added `sharepointDescription` / `sharepointSiteUrlTip` to `en.ts` and `zh.ts`. Tests - `test/unit_test/data_source/test_sharepoint_connector_unit.py`: mock-based unit tests covering credential loading (incomplete creds raise, happy path sets the Graph client, fetch-without-creds raises), drive traversal + file download, incremental `lastModifiedDateTime` filtering, and slim-doc listing. All 6 pass; `ruff check` is clean. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-28 13:26:08 +08:00
Wang Qi	0aff6a3f32	Feature: Allow page_size max value 100 (#15292 ) Feature: Allow page_size max value 100	2026-05-28 11:13:01 +08:00
Idriss Sbaaoui	0940f1a135	Feat: add new tests and tescases for restful api suite (#15299 ) ### What problem does this PR solve? extend restful api suite ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): test	2026-05-28 11:03:12 +08:00
Jack	f0cb7a544b	Refactor: Task Executor (#15154 ) ### What problem does this PR solve? 1. Break huge function into smaller pieces 2. Add unit test for the smaller pieces function 3. Layer-ed design a. infra layer - task_context.py, recording_context.py, write_operation_interceptor.py, ... b. service layer - *_service.py c. business layer - task_handler.py 4. Default behavior: use "refactor-ed version" - can switch to original version by change env variable ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement --------- Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-05-27 21:54:17 +08:00
Idriss Sbaaoui	1f34a18242	Feat: add new tests and tescases for restful api suite (#15277 ) ### What problem does this PR solve? extend restful api suite ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): test	2026-05-27 13:07:49 +08:00
Liu An	0639dba89a	Docs: Update version references to v0.25.6 in READMEs and docs (#15248 ) ### What problem does this PR solve? - Update version tags in README files (including translations) from v0.25.5 to v0.25.6 - Modify Docker image references and documentation to reflect new version - Update version badges and image descriptions - Maintain consistency across all language variants of README files ### Type of change - [x] Documentation Update	2026-05-26 19:45:43 +08:00
Idriss Sbaaoui	036ed5b236	Feat: add new tests and tescases for restful api suite (#15230 ) ### What problem does this PR solve? extend restful api suite ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): test	2026-05-26 13:24:22 +08:00
天海蒼灆	0d2a17254c	fix(api): allow canvas_type in agent create and update APIs (#15201 ) ### What problem does this PR solve? Creating or updating an agent via `POST /api/v1/agents` and `PUT /api/v1/agents/{agent_id}` did not persist `canvas_type` because the handler `req` dict never assigned the field before `UserCanvasService.save` / `update_by_id`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-26 11:31:46 +08:00
Idriss Sbaaoui	c3b38d397f	Feat: add new tests and tescases for restful api suite (#15223 ) ### What problem does this PR solve? extend restful api suite ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): test	2026-05-26 10:08:45 +08:00
Idriss Sbaaoui	7d200d5bd7	Feat: add new tests and tescases for restful api suite (#15208 ) ### What problem does this PR solve? extend restful api suite ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): test	2026-05-25 19:03:56 +08:00
Wang Qi	f4d36f7082	Fix #15170 cannot filter document status (#15216 ) Fix #15170 cannot filter document status ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-25 18:58:37 +08:00
Wang Qi	4776bfa8a2	Fix: Correct the API path (#15204 ) Follow on PR #15146 to reslove the backwad compatability issue. 1. /agents/<attachment_id>/download -> /agents/attachments/<attachment_id>/download ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-25 17:11:24 +08:00
Jonathan Chang	9d1006e4ec	fix: The output of the parser in the ingestion pipeline contains HTML tags (#14920 ) ## Summary This change fixes ingestion quality issues where MinerU parser output may contain HTML fragments (for example, table-related tags like `<tr>`, `<td>`, `<br>`), which were previously passed directly into chunking/tokenization and degraded chunk quality. The fix adds a sanitization step in the MinerU parser path so parsed sections are normalized to clean text before chunking. ## Change Type (select all) - [x] Bug fix - [x] Ingestion pipeline improvement - [x] Parser/chunking quality fix ## Related Issue - https://github.com/infiniflow/ragflow/issues/14831	2026-05-25 16:06:36 +08:00
Wang Qi	5069561abc	Fix /chat/completions to allow send only the latest message (#15197 ) ### What problem does this PR solve? 1. Fix /chat/completions to send only the latest message 2. Allo chat stream=False ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-25 14:23:33 +08:00
Wang Qi	bb148edf4c	Revert "Fix: /openai/<chat_id>/chat/completions not aware of session_id" (#15205 ) Reverts infiniflow/ragflow#15155 because this is never supported, keep it as it is.	2026-05-25 14:23:10 +08:00
Wang Qi	e6dd397531	Fix: /openai/<chat_id>/chat/completions not aware of session_id (#15155 ) ### What problem does this PR solve? Fix: /openai/<chat_id>/chat/completions not aware of session_id ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-22 20:38:56 +08:00
Wang Qi	87918650ff	Refactor: Move API files (#15151 ) Refactor: Move API files	2026-05-22 17:44:05 +08:00
kpdev	faf77a5a8a	feat(evaluation): track token usage in evaluation results (#13487 ) ## Summary Implements the TODO in `evaluation_service.py`: Track token usage in evaluation results. ## Changes - Import `num_tokens_from_string` from `common.token_utils` - Prompt tokens: Use the full prompt returned by `async_chat` when available (includes system prompt + knowledge base + query), otherwise fall back to the question token count - Completion tokens: Count tokens in the generated answer - Storage: Store `token_usage` as `{prompt_tokens, completion_tokens, total_tokens}` in each `EvaluationResult` instead of `None` ## Why The evaluation pipeline previously saved `token_usage: None` for every result. This change allows downstream consumers (e.g. evaluation dashboards, cost tracking) to see approximate token usage per test case using the same tokenizer (tiktoken cl100k_base) used elsewhere in RAGFlow. ## Testing - No new tests added; existing evaluation flow unchanged - Token counting uses existing `num_tokens_from_string` utility --------- Co-authored-by: kiannidev <kiannidev@users.noreply.github.com>	2026-05-22 15:19:53 +08:00

1 2 3 4 5 ...

353 Commits