ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Author	SHA1	Message	Date
Wang Qi	214ee319f8	Revert "fix(api): authorize owner_ids for list chats and search apps (#14775 ) (#15698 ) This reverts PR #14775 commit `5a5e766386`.	2026-06-05 17:26:02 +08:00
Wang Qi	4cbe597d7e	Refactor: consolidate to use @login_required (#15652 ) Refactor: consolidate to use @login_required	2026-06-05 11:35:00 +08:00
kpdev	76968af0ba	Guard missing storage blobs on preview and image endpoints (#15366 ) Fixes [#15365](https://github.com/infiniflow/ragflow/issues/15365) — `get_document_image()` and document preview call `make_response(None)` when storage returns no bytes, causing HTTP 500.	2026-06-03 11:33:03 +08:00
kpdev	0f6f7b3c3c	fix(api): document image_id parsing for hyphenated thumbnail keys (#15115 ) (#15116 ) ### What problem does this PR solve? Fixes #15115. `GET /api/v1/documents/images/<image_id>` returned Image not found when the thumbnail storage object key contained hyphens (e.g. `page-1.png`). Document APIs build URLs as `{dataset_id}-{thumbnail}`, but `get_document_image()` used `image_id.split("-")` and required exactly two segments, so keys like `<kb_id>-page-1.png` were rejected even though the blob existed. This PR splits only on the first hyphen (`split("-", 1)`) and sets `Content-Type` from the object key extension via `CONTENT_TYPE_MAP` instead of hardcoding `image/JPEG`.	2026-06-02 10:54:14 +08:00
kpdev	252cc19f93	Infer Content-Type for document image endpoint (#15368 ) ## Summary Fixes [#15367](https://github.com/infiniflow/ragflow/issues/15367) — `GET /api/v1/documents/images/<image_id>` always returned `Content-Type: image/JPEG` even for PNG/WebP chunk images and extensioned thumbnails. ## Related Issue Fixes #15367 ## Change Type - [x] Bug fix - [x] Regression tests - [ ] New feature - [ ] Refactor ## What Changed - Added `_detect_image_content_type_from_bytes()` — PNG/JPEG/GIF/WebP/BMP magic-byte detection - Added `_content_type_for_document_image()` — object-key extension via `CONTENT_TYPE_MAP`, then magic bytes, else `application/octet-stream` - `get_document_image()` — set inferred `Content-Type` instead of hardcoded `image/JPEG` - Also guards missing storage blob (`Image not found.`) to avoid `make_response(None)` (same handler; complements #15365) ## Files Changed \| File \| Change \| \|------\|--------\| \| `api/apps/restful_apis/document_api.py` \| MIME inference helpers + handler update \| \| `test/testcases/test_web_api/test_document_app/test_document_metadata.py` \| 3 unit tests \| ## Validation ```bash cd /root/gittensor/ragflow pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_content_type_from_object_extension_unit -v pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_content_type_from_magic_bytes_unit -v pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_missing_blob_unit -v ``` ## Test Plan - [x] `.png` object key → `image/png` - [x] Extensionless chunk key + PNG bytes → `image/png` (magic bytes) - [x] Missing blob → 4xx `"Image not found."` - [ ] CI green	2026-06-01 19:08:32 +08:00
kpdev	b35266e9a5	Return 4xx when file download storage blob is missing (#15371 ) ## Summary Fixes [#15369](https://github.com/infiniflow/ragflow/issues/15369) — `GET /api/v1/files/<file_id>` calls `make_response(None)` when both primary and fallback storage lookups return empty, causing HTTP 500. ## Related Issue Fixes #15369 ## Change Type - [x] Bug fix - [x] Regression tests ## What Changed - `file_api.download()` — after fallback `STORAGE_IMPL.get`, return `get_error_data_result(message="This file is empty.")` when `not blob`, matching document REST download semantics. ## Files Changed \| File \| Change \| \|------\|--------\| \| `api/apps/restful_apis/file_api.py` \| Empty-blob guard before `make_response()` \| \| `test/testcases/test_web_api/test_file_app/test_file_routes_unit.py` \| Regression test \| ## Validation ```bash cd /root/gittensor/ragflow pytest test/testcases/test_web_api/test_file_app/test_file_routes_unit.py::test_download_missing_blob_returns_error -v pytest test/testcases/test_web_api/test_file_app/test_file_routes_unit.py::test_download_falls_back_to_document_storage -v ``` ## Test Plan - [x] Both storage paths empty → `"This file is empty."` (no `make_response(None)`) - [x] Existing fallback success test still passes - [ ] CI green	2026-06-01 19:08:06 +08:00
galuis116	d1f6594618	Fix: JWT algorithm-confusion in OIDC ID token verification (#15181 ) ### What problem does this PR solve? Closes #15180. `OIDCClient.parse_id_token` in `api/apps/auth/oidc.py` read the JWT signing algorithm from the unverified JWT header and passed it through to `jwt.decode(..., algorithms=[alg], ...)` as the trust anchor. This is the textbook JWT algorithm-confusion vulnerability (CWE-345 / CWE-347). Any unauthenticated client capable of reaching the OIDC callback could take over an arbitrary account on any RAGFlow deployment with OIDC login enabled: 1. `alg: "none"` — present a JWT with `{"alg": "none"}` and no signature segment → `jwt.decode(..., algorithms=["none"])` → PyJWT's `NoneAlgorithm` accepts the token without verification → login as any user. 2. RSA / HMAC confusion — fetch the public RSA key from the provider's JWKS (it's public), forge a JWT with `{"alg": "HS256"}` HMAC-signed using the public-key bytes as the secret → `jwt.decode(..., algorithms=["HS256"], key=public_key)` → verifier accepts → login as any user. (Modern PyJWT independently refuses to use a PEM-formatted key as an HMAC secret, which mitigates this leg for PEM key formats; the fix here is the only mitigation for raw / DER / JWK octet keys and for older PyJWT versions.) ### What changed `api/apps/auth/oidc.py`: - New module constants `_ALLOWED_OIDC_SIGNING_ALGS` (asymmetric-only: `RS`, `ES`, `PS`, `EdDSA` — explicitly excludes `none` and `HS`) and `_DEFAULT_OIDC_SIGNING_ALGS = ("RS256",)` (the OIDC Core 1.0 §2 spec default). - New helper `_resolve_id_token_signing_algs(metadata)` — intersects the provider's advertised `id_token_signing_alg_values_supported` from `/.well-known/openid-configuration` with the safe allowlist; falls back to RS256 when the field is missing or contains only unsafe values. - `OIDCClient.__init__` now stores the resolved allowlist on `self.id_token_signing_algs` — pinned once, from a trusted source, at construction time. - `parse_id_token` no longer calls `jwt.get_unverified_header` and no longer reads `alg` from the JWT header. It passes `self.id_token_signing_algs` to `jwt.decode(..., algorithms=...)`. `PyJWKClient.get_signing_key_from_jwt` still reads the `kid` from the header internally for JWKS lookup — that's fine, `kid` is not a security decision; the signature still proves which key was actually used. `test/testcases/test_web_api/test_auth_app/test_oidc_client_unit.py`: - Existing `test_parse_id_token_success_and_error` drops its `jwt.get_unverified_header` mock (no longer called by `parse_id_token`). - `_metadata` and `_make_client` helpers grew an optional `signing_algs` parameter so tests can configure what the discovery document advertises. - New `TestSSRFValidation` / algorithm-confusion regression block (7 tests): - `test_id_token_signing_algs_default_to_rs256_when_metadata_missing` - `test_id_token_signing_algs_intersect_metadata_with_safe_allowlist` - `test_id_token_signing_algs_fall_back_when_only_unsafe_advertised` - `test_id_token_signing_algs_ignores_non_string_entries` - `test_id_token_signing_algs_handles_non_list_metadata_field` - `test_parse_id_token_passes_pinned_algorithms_to_jwt_decode` — sabotages `jwt.get_unverified_header` to raise on call, proving the verification path never consults the unverified header. - `test_parse_id_token_rejects_alg_none` — uses real PyJWT to encode an `alg: "none"` token; `parse_id_token` raises `ValueError("Error parsing ID Token: …")` instead of accepting it. - `test_parse_id_token_rejects_hs256_when_allowlist_is_asymmetric` — uses real PyJWT to forge an `alg: "HS256"` token with a non-PEM shared secret (so PyJWT's incidental PEM-as-HMAC refusal isn't what blocks it); `parse_id_token` raises because `HS256` is not in the pinned allowlist. Sanity-checked end-to-end with real PyJWT outside the project test runner: - `alg=none` forged token + `algorithms=["RS256"]` → `InvalidAlgorithmError` ✓ - `alg=HS256` forged token + `algorithms=["RS256"]` → `InvalidAlgorithmError` ✓ - Same `alg=HS256` token + `algorithms=["HS256"]` → accepted ({'sub': 'admin'}) — confirming the attack path was real before the fix. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: galuis116 <contact@duerrimports.com>	2026-05-29 19:37:01 +08:00
kpdev	cb1ea5a47f	Validate chunk image_base64 before doc-store write (#15364 ) ## Summary Fixes [#15363](https://github.com/infiniflow/ragflow/issues/15363) — `add_chunk` / `update_chunk` indexed chunks with `image_id` before validating or storing `image_base64`, leaving orphan chunks on invalid input. ## Related Issue Fixes #15363 ## Change Type - [x] Bug fix - [x] Regression tests ## What Changed - Added `_decode_chunk_image_base64()` — strict base64 decode with structured 4xx errors - Added `_store_chunk_image_or_error()` — catches `store_chunk_image` failures - `add_chunk` / `update_chunk`: decode + store image before `docStoreConn.insert` / `update`; only set `img_id` after successful storage ## Files Changed \| File \| Change \| \|------\|--------\| \| `api/apps/restful_apis/chunk_api.py` \| Helpers + reorder image handling \| \| `test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py` \| 3 regression tests \| ## Validation ```bash cd /root/gittensor/ragflow pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_add_chunk_invalid_image_base64_does_not_index_chunk -v pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_update_chunk_invalid_image_base64_does_not_update_chunk -v pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_add_chunk_valid_image_base64_stores_before_insert -v pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py -v ``` ## Test Plan - [x] Invalid `image_base64` on add → 4xx, no doc-store insert - [x] Invalid `image_base64` on update → 4xx, no doc-store update - [x] Valid PNG base64 on add → image stored, chunk indexed with `img_id` - [ ] CI green	2026-05-29 19:36:46 +08:00
Lynn	dc4b82523b	Feat: tenant llm provider (#14595 ) ### What problem does this PR solve? Python implementation of the Go-based model_provider API suite. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: bill <yibie_jingnian@163.com>	2026-05-29 17:39:41 +08:00
Wang Qi	0aff6a3f32	Feature: Allow page_size max value 100 (#15292 ) Feature: Allow page_size max value 100	2026-05-28 11:13:01 +08:00
Wang Qi	f4d36f7082	Fix #15170 cannot filter document status (#15216 ) Fix #15170 cannot filter document status ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-25 18:58:37 +08:00
Wang Qi	4776bfa8a2	Fix: Correct the API path (#15204 ) Follow on PR #15146 to reslove the backwad compatability issue. 1. /agents/<attachment_id>/download -> /agents/attachments/<attachment_id>/download ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-25 17:11:24 +08:00
Jonathan Chang	9d1006e4ec	fix: The output of the parser in the ingestion pipeline contains HTML tags (#14920 ) ## Summary This change fixes ingestion quality issues where MinerU parser output may contain HTML fragments (for example, table-related tags like `<tr>`, `<td>`, `<br>`), which were previously passed directly into chunking/tokenization and degraded chunk quality. The fix adds a sanitization step in the MinerU parser path so parsed sections are normalized to clean text before chunking. ## Change Type (select all) - [x] Bug fix - [x] Ingestion pipeline improvement - [x] Parser/chunking quality fix ## Related Issue - https://github.com/infiniflow/ragflow/issues/14831	2026-05-25 16:06:36 +08:00
buua436	ea1764a7dc	Revert "fix(api): infer /documents/{id}/download Content-Type from filename when ext is omitted (#15052 )" (#15138 ) Reverts infiniflow/ragflow#15053	2026-05-22 11:46:01 +08:00
kpdev	6932615852	fix(api): infer /documents/{id}/download Content-Type from filename when ext is omitted (#15052 ) (#15053 ) ## Summary - Align GET `/api/v1/documents/<doc_id>/download` with `/preview`: resolve extension and MIME type from the stored document name when the `ext` query parameter is omitted, instead of defaulting to `markdown`. - When `?ext=` is present, behavior stays the same as before (explicit extension / `Content-Type` mapping). - Enforce the same access + document lookup pattern as preview (`accessible` + `get_by_id`). - Extend unit tests for the no-`ext` PDF filename case. ## Test plan - [x] `uv run pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_download_attachment_success_and_exception_unit` - [x] Optional: `curl -sSI` against `/api/v1/documents/<pdf_doc_id>/download` without `ext` and confirm `Content-Type: application/pdf` Fixes #15052.	2026-05-21 15:31:36 +08:00
jony376	198f3c4b9a	Fix: validate memory tenant model IDs on update and enforce tenant scope in memory pipeline (#14923 ) ### Related issues Closes #14922 ### What problem does this PR solve? `POST /memories` already resolves `tenant_llm_id` and `tenant_embd_id` through `ensure_tenant_model_id_for_params`, but `PUT /memories/<memory_id>` accepted client-supplied `tenant_llm_id` / `tenant_embd_id` without checking that those `tenant_llm` rows belong to the memory owner’s tenant. A caller could persist another tenant’s row IDs and later trigger extraction or embedding that loaded foreign model credentials via `get_model_config_by_id(tenant_model_id)` with no tenant allow-list. This change aligns the update path with create: updates that change models must go through `llm_id` / `embd_id` and `ensure_tenant_model_id_for_params` scoped to the memory’s `tenant_id` (not only the current user, so team-access cases stay correct). Direct `tenant_*` fields in the body without `llm_id` / `embd_id` are rejected. As defense in depth, `memory_message_service` passes `allowed_tenant_ids` / `requester_tenant_id` into `get_model_config_by_id` for LLM and embedding resolution so mismatched IDs cannot be used even if bad data existed. A regression test rejects payloads that set only `tenant_llm_id` / `tenant_embd_id`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: jony376 <jony376@gmail.com>	2026-05-19 10:11:46 +08:00
Magicbook1108	b69a6a5d80	Feat: full optimization on connector dashboard (#14979 ) ### What problem does this PR solve? This PR improves the connector dashboard task management experience and adds better visibility into connector execution logs. ### Overview: #### Before <img width="700" alt="image" src="https://github.com/user-attachments/assets/e4a8ed6f-2e18-4f0f-8528-41a514550052" /> #### Now: <img width="700" alt="Screenshot from 2026-05-18 16-31-30" src="https://github.com/user-attachments/assets/d4ca193b-847a-49ae-9e4f-5fbca60ea627" /> ### 1. Add a new logging page to the connector dashboard A new logging page has been added so users can view connector task execution logs directly from the connector dashboard. ### 2. Merge the Resume button into Confirm The separate Resume button has been removed. The Confirm button now represents different actions depending on the current task state: - Save: Save form changes and reschedule tasks. - Stop: Cancel currently scheduled or running tasks. - Resume: Create new scheduled tasks after the previous tasks have been stopped. - Start: Start tasks when no task has been started yet. ### 3. Separate syncing and pruning tasks Connector tasks are now separated into syncing and pruning. Pruning is controlled by the Sync deleted files option: - When Sync deleted files is disabled, only syncing tasks are shown. - When Sync deleted files is enabled, both syncing and pruning tasks are shown. Now: Sync deleted files disabled <img width="700" alt="Sync deleted files disabled" src="https://github.com/user-attachments/assets/dbd9232e-614a-407f-a0b1-c109e5fa567d" /> Now: Sync deleted files enabled <img width="700" alt="Sync deleted files enabled" src="https://github.com/user-attachments/assets/1f527f48-ccb3-4ee8-97ca-086891489296" /> ### 4. Update logs in backend <img width="700" alt="image" src="https://github.com/user-attachments/assets/10a95a3f-98c1-4e67-8afa-ddf6cda5b0b2" /> ### 5. Remove connector resume API - Removed: `POST /v1/connectors/<connector_id>/resume` - Replaced by: `PATCH /v1/connectors/<connector_id>` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-19 10:07:11 +08:00
dev	b12eaee38b	fix(api): enforce tenant access for connector routes (#14747 ) ### What problem does this PR solve? Fixes #14746. Adds tenant access checks for connector-by-id REST routes before reading connector details, mutating connector config/status, deleting connectors, rebuilding, or listing sync logs. Unauthorized callers now receive `RetCode.AUTHENTICATION_ERROR` with `No authorization.` without reaching the connector/log mutation paths. Validation: - `python3 -m pytest --confcutdir=test/testcases/test_web_api/test_connector_app test/testcases/test_web_api/test_connector_app/test_connector_routes_unit.py` - `uvx ruff check api/apps/restful_apis/connector_api.py api/db/services/connector_service.py test/testcases/test_web_api/test_connector_app/test_connector_routes_unit.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: dev111-actor <dev111-actor@users.noreply.github.com>	2026-05-18 16:09:26 +08:00
sirj0k3r	b2b63600f1	Adds gpt-5.4-mini and gpt-5.4-nano (#14908 ) ### What problem does this PR solve? Includes gpt-5.4-mini and gpt-5.4-nano to the OpenAI model list ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-14 10:16:24 +08:00
plind	dd76653dc1	feat: add tag management for Agents with filtering and sorting (#14774 ) (#14799 ) ## Summary Closes #14774. Adds free-form tags on agents (UserCanvas) with full UI + API: - Stored as comma-separated `tags` column on `UserCanvas` with online migration. - New endpoints: `GET /v1/agents/tags` (aggregate counts) and `PUT /v1/agent/<id>/tags` (write). `GET /v1/agents` accepts a `tags=` query. - "Edit tags" item in agent dropdown opens a chip-style editor dialog; tags render as badges on each agent card. - New "Tags" facet in the agents filter bar, with counts. ## Implementation notes - Tag matching is exact-token: the SQL filter wraps stored tags as `,…,` and matches `,ml,` so `ml` doesn't match `ml-ops`. - Server-side normalization in `UserCanvasService.update_tags`: dedup (case-insensitive), per-tag cap of 64 chars, total length capped at 512 chars to fit the column, commas inside tag values are replaced with spaces. - Tenant authorization: `PUT /v1/agent/<id>/tags` gates on `UserCanvasService.accessible(canvas_id, tenant_id)`. - Tag listing scope: `UserCanvasService.list_tags` follows the same own + team-shared rule as `get_by_tenant_ids`. - i18n: keys added to `en.ts` and `zh.ts` only (per project convention; other locales fall back). - `HomeCard` gets a non-breaking `extra?: ReactNode` slot for the chip row; no `src/components/ui/` files modified. ## Test plan - [ ] Backend boot runs `migrate_db` → confirm `user_canvas.tags` column exists (`DESCRIBE user_canvas`). - [ ] Agents page renders cards normally (no console error from missing field). - [ ] `⋯ → Edit tags` opens a dialog that stays open (regression: dialog was unmounting with the dropdown). - [ ] Typing a tag without pressing Enter and clicking Save persists it (regression: last typed tag was being dropped). - [ ] Chip input supports Enter/comma to commit, Backspace on empty to remove, `×` to remove individual chip. - [ ] Tag containing a comma sent via API is stored with the comma replaced by a space. - [ ] 20 long tags sent via API does not error (length cap silently truncates). - [ ] "Tags" filter in the filter bar shows counts and narrows the list. - [ ] Filtering by `ml` does not return agents tagged `ml-ops`. - [ ] UI in Chinese shows 编辑标签 / 添加标签以整理和筛选你的智能体 etc. - [ ] `PUT /v1/agent/<other-tenant-id>/tags` returns `Agent not found or no permission.`	2026-05-13 21:41:32 +08:00
dale053	5a5e766386	fix(api): authorize owner_ids for list chats and search apps (#14775 ) Closes #14768 ### What problem does this PR solve? The `list_chats` and `list_searches` REST API endpoints did not enforce authorization on the `owner_ids` query parameter. Any authenticated user could pass arbitrary tenant IDs to `owner_ids` and retrieve chats or search apps belonging to other tenants they are not a member of. This PR resolves the issue by: 1. Looking up the current user's authorized tenants via `TenantService.get_joined_tenants_by_user_id` and rejecting any `owner_ids` that fall outside that set. 2. When no `owner_ids` are provided, scoping the query to only the user's authorized tenants instead of returning an unfiltered result. 3. Adding unit tests that verify unauthorized `owner_ids` are rejected with `OPERATING_ERROR`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-13 09:43:44 +08:00
0xτensor	127aeac4aa	fix: expose gpt-5.5 and gpt-5.4 in OpenAI model list (#14828 ) ### What problem does this PR solve? OpenAI model catalogs used in provider selection flows were missing the latest GPT models (`gpt-5.5` and `gpt-5.4`). Because model availability is driven by seeded catalog data (`conf/llm_factories.json` → DB seed → API response), these models were not selectable in the UI or `/llm/list` responses. This PR updates and synchronizes the OpenAI catalog definitions across configuration sources and ensures the new models are correctly exposed through the API layer and validated in tests. --- ### Type of change * [x] New Feature (non-breaking change which adds functionality) --- ### Changes Made * Added `gpt-5.5` and `gpt-5.4` to OpenAI catalog definitions in: * `conf/llm_factories.json` * `conf/models/openai.json` (chat + vision support) * Ensured consistency between DB-seeded factory config and provider model configuration * Updated test coverage in: * `test_llm_list_unit.py` * seeded OpenAI catalog entries * added response-level assertion validating `/llm/list` includes both new model IDs under OpenAI grouping --- ### Root Cause OpenAI model listings in selection flows are generated from catalog data seeded via `conf/llm_factories.json`. The catalog had not been updated to include the latest GPT models, resulting in missing availability in UI and API responses. --- ### Testing * Created isolated test environment: * `python -m venv .venv-review` * installed `pytest` * Ran targeted and full test suite: * `test_list_app_grouping_availability_and_merge`: ✅ passed * Full `test_llm_list_unit.py`: ✅ 10 passed --- ### Risks / Limitations * Adding models to the catalog does not guarantee upstream provider availability or account entitlement. * Environments with pre-seeded DB catalogs may require reseed or refresh to reflect updated configuration. --- ### Notes * Changes are minimal and scoped strictly to catalog configuration and related test coverage. * Ensures `/llm/list` API remains aligned with expected latest OpenAI model availability. * Closes #14827	2026-05-12 18:03:47 +08:00
hyl64	02c2587ca4	fix(agent): support iteration item aliases in child nodes (#14146 ) ## Summary This PR fixes the iteration variable mismatch reported in #14142. Changes: - restore compatibility for `IterationItem@result` by exposing `result` alongside `item` - support bare iteration aliases like `{item}`, `{index}`, and `{result}` inside iteration child-node inputs - add focused unit/runtime tests covering both alias styles and multi-item iteration execution ## Validation ```bash pytest -q --noconftest \ test/testcases/test_web_api/test_canvas_app/test_iterationitem_unit.py \ test/testcases/test_web_api/test_canvas_app/test_iteration_runtime_unit.py \ test/testcases/test_web_api/test_canvas_app/test_invoke_component_unit.py ``` Result: `12 passed` Closes #14142	2026-05-12 13:05:21 +08:00
buua436	daf8a58c4b	Fix: add codeexec attachments output (#14787 ) ### What problem does this PR solve? add codeexec attachments output ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 19:16:33 +08:00
box4wangjing	292b0b8bce	chore: fix some comments to improve readability (#14756 ) ### What problem does this PR solve? fix some comments to improve readability ### Type of change - [x] Documentation Update --------- Signed-off-by: box4wangjing <box4wangjing@outlook.com>	2026-05-11 16:48:48 +08:00
buua436	a03b95f8c4	Fix: shared dataset chunk index lookup (#14764 ) ### What problem does this PR solve? shared dataset chunk index lookup ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 13:50:08 +08:00
Mehmet Karakose	7ec87f7cb7	fix(auth): fall back to session-based auth in _load_user (#14569 ) ## Summary Closes #13663. OAuth / OIDC callbacks call `login_user(user)` which writes `_user_id` into the session cookie, but `_load_user()` in `api/apps/__init__.py` only ever looked at the `Authorization` header. The SPA's response interceptor wipes the Authorization value from `localStorage` on the first 401 it sees — meaning that during the post-redirect window after an OAuth login, a single transient 401 sends every subsequent request back to the login page even though `login_user()` had already established a perfectly good server-side session. The reporter's analysis traces this all the way through the redirect → `navigate('/')` → first request → empty header → 401 → `removeAll()` → infinite-redirect-to-login chain. ## What changed - New `_load_user_from_session()` helper that reads `session["_user_id"]`, looks up the user in `UserService` (with the same `StatusEnum.VALID` and `access_token` checks already used elsewhere), and assigns `g.user`. - Every `return None` path in `_load_user()` now routes through that helper before giving up: - missing `Authorization` header - malformed `bearer ` prefix - empty / too-short JWT payload - JWT signature failure - JWT-resolved user not found / has no `access_token` - `APIToken.query()` fallback exhausted The JWT and API-token paths still take precedence — the session is only consulted when those can't authenticate the request. So existing local-login and SDK callers see no behaviour change; only OAuth / OIDC users that hit the original race now stay logged in. The Bearer-prefix issue called out in #13663 (lines 103-110) is already handled in the current code, so this PR only addresses the second half of the report. ## Test plan - [ ] Configure OIDC under `oauth` in `service_conf.yaml` - [ ] Click the OIDC login button, complete auth at the IdP - [ ] Confirm that navigating between pages no longer bounces back to `/login` - [ ] Confirm local email/password login still issues + accepts JWTs - [ ] Confirm SDK/API key callers still authenticate via `Authorization: Bearer <api-token>` --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-11 09:59:52 +08:00
euvre	f4b8f53b6d	Fix: restore embedding model switching for datasets with existing chunks (#14732 ) ### What problem does this PR solve? ## Problem During the REST API refactoring (#13690), the `/api/v2/kb/check_embedding` endpoint was removed and never migrated to the new RESTful structure. The frontend was pointed to the `/api/v1/datasets/{id}/embedding` endpoint (which is `run_embedding` — a completely different function). Additionally, a hard guard was introduced that rejects any `embd_id` change when `chunk_num > 0`, making it impossible to switch embedding models on datasets with existing chunks. ## Root Cause 1. Missing endpoint: The old `check_embedding` logic (sample random chunks, re-embed with the new model, compare cosine similarity) was not carried over to the new REST API service layer. 2. Wrong frontend URL: `checkEmbedding` in `api.ts` pointed to `/datasets/{id}/embedding` (`run_embedding`) instead of a dedicated check endpoint. 3. Overly restrictive guard: `dataset_api_service.py` line 310 blocked all `embd_id` updates when `chunk_num > 0`. This check did not exist in the pre-refactor code — it was incorrectly introduced during the refactor. ## Changes ### Backend - `api/apps/services/dataset_api_service.py` - Remove the `chunk_num > 0` hard guard on `embd_id` updates - Add `check_embedding()` service function: samples random chunks, re-embeds them with the candidate model, computes cosine similarity, returns compatibility result (avg ≥ 0.9 = compatible) - Add `import re` for the `_clean()` helper - `api/apps/restful_apis/dataset_api.py` - Add `POST /datasets/<dataset_id>/embedding/check` endpoint following the new REST API conventions - Clean up unused top-level imports (`random`, `re`, `numpy`) ### Frontend - `web/src/utils/api.ts` - Fix `checkEmbedding` URL from `/datasets/${datasetId}/embedding` → `/datasets/${datasetId}/embedding/check` ### Tests - `test/testcases/test_http_api/test_dataset_management/test_update_dataset.py` - Update `test_embedding_model_with_existing_chunks` to assert success (`code == 0`) instead of expecting the old `102` error - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py` - Update `test_update_route_branch_matrix_unit` to assert `RetCode.SUCCESS` when updating `embd_id` on a chunked dataset, replacing the old `chunk_num` error assertion ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-05-09 18:48:57 +08:00
akie	c11650bb4c	Fix IDOR: Add permission checks to file ancestry endpoints (#14725 ) Close #14292 ## Issue File ancestry endpoints return folder metadata without validating tenant permissions, allowing any authenticated user to query arbitrary `file_id` values across tenant boundaries. ## Affected Endpoints - `GET /v1/file/parent_folder?file_id={file_id}` - `GET /v1/file/all_parent_folder?file_id={file_id}` - `GET /api/v1/files/{id}/ancestors` ## Root Cause These endpoints skip the permission check that other file operations (Delete, Download, Move) perform. ## Expected Permission Check All file operations should follow this 3-step validation: - Check file.tenant_id - Check if user_id belongs to this tenant (via user_tenant join table) - Check KB permission type (team permission) Code reference: This is implemented in `checkFileTeamPermission()` and used by Delete/Download/Move, but missing from GetParentFolder/GetAllParentFolders. ## Reproduction ```bash # User B (tenant: BBB) accessing User A's file (tenant: AAA) curl -H "Authorization: Bearer USER_B_TOKEN" \ "http://localhost:9384/v1/file/parent_folder?file_id=AAA_FILE_123" # Result: Returns User A's folder metadata ❌ # Expected: "No authorization." ✅ Fix Pass userID from handler to service and call checkFileTeamPermission() — same as Download/Delete/Move handlers. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 16:03:23 +08:00
web-dev0521	d51fb88573	Fix: enforce tenant authorization on document download endpoint (#14618 ) (#14625 ) ### What problem does this PR solve? Closes #14618. The `GET /v1/document/get/<doc_id>` endpoint in `api/apps/document_app.py` was protected only by `@login_required` and called `DocumentService.get_by_id(doc_id)` without verifying that the document's knowledge base belonged to the requesting user's tenant. Any authenticated user who knew (or guessed) a document ID could download files belonging to any other tenant — a cross-tenant IDOR. This PR adds a `DocumentService.accessible(doc_id, current_user.id)` check before serving the file. The helper already exists and joins `Document` → `Knowledgebase` → `UserTenant` to verify the requesting user belongs to the tenant that owns the document's KB. The same pattern is already used by `api/apps/restful_apis/document_api.py` and mirrors the tenant scoping in the SDK route at `api/apps/sdk/doc.py`. The check returns the existing `"Document not found!"` error for both non-existent and inaccessible documents, so attackers cannot use the response to enumerate valid doc IDs across tenants. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Other (please describe): Security fix (cross-tenant IDOR / authorization bypass)	2026-05-08 14:24:03 +08:00
jony376	6547751936	Fix: missing authorization checks in `/files/link-to-datasets` (#14649 ) ### Related issues Closes #14648 ### What problem does this PR solve? This PR fixes an authorization flaw in `POST /files/link-to-datasets`. Before this change, the endpoint only checked whether the supplied `file_ids` and `kb_ids` existed. It did not verify whether the authenticated user was actually allowed to access those files or target datasets. As a result, an authenticated user who knew valid IDs could relink another user's files to arbitrary datasets. This was especially risky because the relinking flow is state-changing: the background worker removes existing file-document mappings and then recreates documents under the attacker-supplied dataset IDs. This change makes the route enforce the same permission model already used by nearby file and document operations: - each resolved file must pass `check_file_team_permission(...)` - each target dataset must pass `check_kb_team_permission(...)` - authorization is enforced before scheduling background relinking work ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Testing - Added regression coverage in `test/testcases/test_web_api/test_file_app/test_file2document_routes_unit.py` - Covered: - unauthorized file access is rejected - unauthorized dataset access is rejected - existing success path still returns immediately after scheduling background work - Attempted to run: - `python -m pytest test\\testcases\\test_web_api\\test_file_app\\test_file2document_routes_unit.py -q` - Local execution in this workspace is currently blocked by missing test dependencies during bootstrap, including `ragflow_sdk` --------- Co-authored-by: jony376 <jony376@gmail.com>	2026-05-08 13:49:23 +08:00
buua436	f703169117	Refa: migrate document preview/download to RESTful API (#14633 ) ### What problem does this PR solve? migrate document preview/download to RESTful API ### Type of change - [x] Refactoring	2026-05-08 13:26:13 +08:00
Jin Hai	94324afee9	Go: fix auth issue in hybrid mode (#14611 ) ### What problem does this PR solve? Since secret key get and set logic is updated, the go server also need to update. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-05-07 17:14:22 +08:00
Wang Qi	c50028b1f3	Fix team member cannot edit agent (#14612 ) ### What problem does this PR solve? Follow on PR: https://github.com/infiniflow/ragflow/pull/14602 to fix: team member cannot edit agent. new behavior: beside delete, everything is allowed for team member. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-07 15:09:13 +08:00
Jin Hai	1d0519d025	Fix secret key inconsistency cross the RAGFlow servers (#14591 ) ### What problem does this PR solve? A and B, two API servers and a REDIS server. If A and REDIS restart, B will hold the obsolete secret key and will lead to error. TODO: app.config['SECRET_KEY'] and app.secret_key still hold obsolete secret key. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-05-07 10:10:02 +08:00
Preston Percival	e8f19aa338	feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238 ) This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases. Fixes #14236. ## 1. Fix concurrent merge crash Long GraphRAG runs would crash near the end of entity resolution with: ``` RuntimeError: dictionary keys changed during iteration ``` in `Extractor._merge_graph_nodes`. Two changes: - `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)` via `list(...)` before iterating, so concurrent `add_edge` / `remove_node` mutations on the shared `nx.Graph` cannot invalidate the iterator. Also tracks each redirected neighbour in `node0_neighbors` so a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting via `add_edge`. - `rag/graphrag/entity_resolution.py`: serialize the merge step with a dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix. ## 2. Don't wipe partial graph on pause Previously the pause / cancel UI path called `settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`, destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already added `load_subgraph_from_store`. After main was merged in (which deleted `api/apps/kb_app.py` per #14394), the pause path now lives on the new REST surface `DELETE /v1/datasets/<id>/<index_type>`: - `api/apps/services/dataset_api_service.py`: `delete_index` accepts a `wipe: bool = True` parameter. When `False` the doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour. - `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false\|0\|no\|off` from the query string and forwards it. - `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`: `unbindPipelineTask` appends `?wipe=false` when explicitly false. - The GraphRAG pause action in `web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe: false` for `KnowledgeGraph`; raptor is unchanged. UX impact: the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in `GenerateLogButton` (trash icon behind a confirmation modal). ## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`) A small Redis-backed marker layer at `graphrag:phase:{kb_id}:{resolution_done\|community_done}` (7-day TTL). `run_graphrag_for_kb` consults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when: - new docs are merged into the graph (which invalidates prior resolution and community results), - `delete_index` wipes the graph, or - `delete_knowledge_graph` is called. Redis failures never block a run -- markers are an optimization, not a gate. ## 4. Idempotent community detection `extract_community` previously did `delete-then-insert` on `community_report` rows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from `(kb_id, community.title)`, the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix. ## 5. Tunable doc-store insert pipeline The GraphRAG insert loop in `rag/graphrag/utils.py` and the `community_report` insert in `rag/graphrag/general/index.py` were both hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead. - New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py` batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout semantics as the prior loop. - Defaults: 64 docs per batch, 4 batches in flight (matches the regular ingest pipeline in `document_service.py`). Tunable per-deployment via `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`. - Both `set_graph` and `extract_community` now use the helper. This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default). ## Tests - `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges. - `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure. - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`: new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers. ## Compatibility - Backward compatible: tasks queued before this change behave identically (default `wipe=true`, no markers expected). - No schema/migration changes; all new state lives in Redis. - New optional REST query param `wipe` on `DELETE /v1/datasets/<id>/<index_type>`. - New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour. ## Example of resume Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying. <img width="521" height="677" alt="image" src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a" /> ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-05-06 15:01:01 +08:00
buua436	05ee7f8bb6	Fix: remove delete_documents uuid validation (#14533 ) ### What problem does this PR solve? remove delete_documents uuid validation ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 18:56:33 +08:00
buua436	06c6da5d94	Fix: add document delete permission check (#14472 ) ### What problem does this PR solve? add document delete permission check ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 11:01:09 +08:00
buua436	47129fdd08	Fix: optimize file batch delete (#14473 ) ### What problem does this PR solve? optimize file batch delete ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 11:00:39 +08:00
Wang Qi	b684c89950	Add backward compat APIs (#14427 ) ### What problem does this PR solve? Add backward compat APIs: ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 15:15:49 +08:00
euvre	35f6d81b73	Refactor: migrate chunk retrieval_test and knowledge_graph to REST API endpoints (#14402 ) ### What problem does this PR solve? ## Summary Migrate two web API endpoints to REST-style HTTP API endpoints, following the pattern established in #14222: \| Old Endpoint \| New Endpoint \| \|---\|---\| \| `POST /v1/chunk/retrieval_test` \| `POST /api/v1/datasets/<dataset_id>/search` \| \| `GET /v1/chunk/knowledge_graph` \| `GET /api/v1/datasets/<dataset_id>/graph` \|	2026-04-28 20:00:26 +08:00
Magicbook1108	85575259ac	Fix: google authentication - gmail && google-drive (#14422 ) ### What problem does this PR solve? Fix: google authentication - gmail && google-drive ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 18:09:02 +08:00
Jack	c81081f8ef	Refactor: Doc change parser (#14327 ) ### What problem does this PR solve? Before migration Web API: POST /v1/document/change_parser HTTP API: PATCH /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API PATCH /api/v1/datasets/<dataset_id>/documents ### Type of change - [x] Refactoring	2026-04-27 23:42:57 +08:00
Jack	c5116b90e5	Refactor: migrate document thumbnails API (#14344 ) ### What problem does this PR solve? Before migration: GET /v1/document/thumbnails After migration: GET /api/v1/thumbnails ### Type of change - [x] Refactoring	2026-04-27 21:29:09 +08:00
Jack	49912a156e	Refactor: migrate document run api (#14351 ) ### What problem does this PR solve? Before migration: POST /v1/document/run After migration: POST /api/v1/documents/ingest/ ### Type of change - [x] Refactoring	2026-04-27 21:25:58 +08:00
Jack	343bda1119	Refactor: deco document upload_and_parse API (#14366 ) ### What problem does this PR solve? remove unused "POST /v1/document/upload_and_parse" ### Type of change - [x] Refactoring	2026-04-27 20:35:00 +08:00
Jack	a536980e22	Refactor: Doc batch change status (#14337 ) ### What problem does this PR solve? Before migration Web API: POST /v1/document/change_status After consolidation, Restful API POST /api/v1/datasets/<dataset_id>/documents/batch-update-status ### Type of change - [x] Refactoring	2026-04-27 20:00:23 +08:00
buua436	82313020c7	Refa: align list operations and strict mode (#14387 ) ### What problem does this PR solve? align list operations and strict mode ### Type of change - [x] Refactoring	2026-04-27 19:13:00 +08:00
Jack	c1941fd503	Refactor: deco doc-parse API that is not used any more (#14367 ) ### What problem does this PR solve? Delete un-used API "POST /v1/document/parse" ### Type of change - [x] Refactoring	2026-04-27 18:54:49 +08:00
Jack	61a24a2c14	Refactor: migrate doc upload info used in chat (#14359 ) ### What problem does this PR solve? Before migration: POST /v1/document/upload_info/ After migration: POST /api/v1/documentss/upload/ ### Type of change - [x] Refactoring	2026-04-27 16:58:42 +08:00

1 2 3

138 Commits