146 Commits

Author SHA1 Message Date
kpdev
212429bf9d fix(api): gate sandbox artifact download on agent session ownership (#16169)
Fixes #16168

## Summary
- Add session-scoped authorization for `GET
/api/v1/documents/artifact/<filename>`
- Allow download only when the artifact filename appears in the caller's
`api_4_conversation` message and
`UserCanvasService.accessible(dialog_id, user_id)` passes
- Deny with generic `"Artifact not found."` before storage access (no
cross-user enumeration)
- Return 4xx when the blob is missing (existing behavior preserved)

## Approach
Sandbox artifacts are runtime CodeExec outputs, not KB documents — this
uses the same session gate pattern as `agent_chat_completion`, not
`DocumentService.accessible`.

## Test plan
- [x] Unit: denied when filename not referenced in user sessions
- [x] Unit: denied when agent canvas is not accessible
- [x] Unit: authorized user receives bytes; missing blob returns
`"Artifact not found."`
- [ ] `pytest
test/testcases/test_web_api/test_document_app/test_document_metadata.py
-k get_artifact`

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-29 09:45:16 +08:00
Hernandez Avelino
660970b253 fix(agent): add SSRF guard to Invoke HTTP component (#15426)
## Summary

Closes #15425. The agent **Invoke** (HTTP Request) component now calls
`assert_url_is_safe` and `pin_dns` before `requests.*`, matching Crawler
and SearXNG.

## Changes

- `agent/component/invoke.py`: SSRF guard + DNS pinning on outbound
requests.
- `test_invoke_component_unit.py`: unit test blocks loopback URL without
calling `requests.get`.

## Test plan

- [x] `pytest
test/testcases/test_web_api/test_canvas_app/test_invoke_component_unit.py::test_invoke_blocks_loopback_url_with_ssrf_guard`
(requires project test env / `ZHIPU_AI_API_KEY` in CI)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-29 09:45:16 +08:00
Renzo
6079ded70b fix: require explicit anonymous webhook access (#14890)
### What problem does this PR solve?

Fixes #14882

Agent webhook execution currently fails open when the saved webhook
`security` block is missing/empty, or when `auth_type` is set to `none`.
This allows unauthenticated webhook invocation without an explicit
operator opt-in.

This PR makes anonymous webhook access explicit:
- Rejects missing or empty webhook security config.
- Requires `allow_anonymous: true` when `auth_type` is `none`.
- Preserves explicit anonymous webhooks by having the frontend serialize
`allow_anonymous: true` when the user selects `None` auth.
- Updates webhook unit tests to cover both denied implicit-anonymous
configs and allowed explicit-anonymous configs.

### Type of change

- [x] Bug Fix
- [x] Security hardening
- [x] Test

### Tests

- [x] `ZHIPU_AI_API_KEY=dummy uv run python -m pytest
--confcutdir=test/testcases/test_web_api/test_agent_app
test/testcases/test_web_api/test_agent_app/test_agents_webhook_unit.py`
- [x] `uv run ruff check api/apps/restful_apis/agent_api.py
test/testcases/test_web_api/test_agent_app/test_agents_webhook_unit.py`
- [x] `npm exec eslint src/pages/agent/utils.ts
src/pages/agent/form/begin-form/schema.ts`

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-29 09:45:16 +08:00
Zhichang Yu
faef22c18a Harden closed-advisory fixes (#16409)
## Summary
- harden reopened advisory fixes across REST connector, invoke, document
downloads, and markdown rendering
- add targeted regression coverage for redirect-safe SSRF handling,
invoke SSRF checks, document access control, and markdown sanitization
- verify each referenced GHSA against the original GitHub advisory text
and align the closed-advisory plan with the implemented remediation

## What changed
- add tenant access checks to document download endpoints to avoid
cross-tenant document disclosure
- add per-hop SSRF validation, DNS pinning, redirect handling, and
redirect limits to the REST API connector
- ensure invoke requests validate and pin the resolved host and never
follow redirects implicitly
- keep the generic rate-limited request path wrapped, not just GET and
POST helpers
- sanitize markdown HTML before rendering in the highlight markdown
component

## Validation
- `cd web && npm test -- --runInBand
src/components/highlight-markdown/__tests__/index.test.tsx`
- `.venv/bin/python -m pytest -q
test/unit_test/data_source/test_rest_api_connector.py`
- targeted `test/testcases/test_web_api/...` unit additions were
reviewed, but the suite cannot be executed end-to-end in this
environment because parent `test/testcases/conftest.py` requires a local
service on `127.0.0.1:9380`

## Notes
- all GHSA entries referenced by the plan were checked against the
original GitHub advisory text, not sampled
- the closed-advisory plan document was updated locally during review,
but is intentionally not included in this PR
2026-06-29 09:45:16 +08:00
Zhichang Yu
f58fae5fb7 feat(go-agent): Ported retrieval node, added Keenable web search tool (#16396)
Ported retrieval node, added Keenable web search tool
- [x] New Feature (non-breaking change which adds functionality)
2026-06-29 09:45:16 +08:00
Muhammad Furqan
3747a6bfeb fix(agent/tools): PubMed tool always returns "Unknown Authors" (#16330)
### What problem does this PR solve?

Fixes the PubMed tool always emitting `Authors: Unknown Authors`. The
`safe_find` closure in `_format_pubmed_content` was hardcoded to search
from the article root, so the per-author `LastName`/`ForeName` lookups
never matched.

`safe_find` now accepts an optional `base` node (defaults to `child`,
preserving the existing field lookups), and the author loop passes the
current `<Author>` element.

Closes #16328

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Add test cases

### Testing

Added `test/testcases/test_web_api/test_canvas_app/test_pubmed_unit.py`
covering per-author parsing, intact title/journal/DOI fields, and the
no-authors fallback.

Before: `Authors: Unknown Authors`
After:  `Authors: Furqan Khan, Jane Smith`
2026-06-25 14:34:37 +08:00
Carl Harris
a2de880b6d fix(profile): enforce profile name validation and input constraints (#15694)
### What problem does this PR solve?

The Profile **Name** field currently lacks application-level validation
and allows users to save excessively long names and unsupported special
characters.

While the database enforces a maximum length of 100 characters, neither
the frontend nor backend validates nickname format before persistence.
This can result in inconsistent user data, poor user experience, and UI
layout issues when long names wrap across multiple lines.

This PR introduces consistent frontend and backend validation for
profile names, enforces length and character constraints, provides clear
validation feedback, and prevents invalid values from being saved.

Fixes #15693

### Type of change

* [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-12 11:13:18 +08:00
monsterDavid
a851228ded fix(preview): authenticate markdown document preview requests (#15589)
## Summary

Fixes [#15585](https://github.com/infiniflow/ragflow/issues/15585).

- Route markdown preview through the shared `request` client (same as
txt/image previewers) so `Authorization` headers and interceptors are
applied consistently.
- Add a unit test covering `AUTH_BETA` token loading for embedded search
auth.

## Root cause

Search result preview for `.md`/`.mdx` used raw `fetch`, which did not
apply the same auth path as other preview types. That led to `401` on
`GET /api/v1/documents/{id}/preview` even when the user was logged in or
using an embedded search `auth` query param.

## Test plan

- [ ] Log in, run a search, open a markdown citation link — preview
loads (no 401).
- [ ] Open an embedded shared search URL with `auth` query param,
preview a markdown file — preview loads.
- [ ] Confirm PDF/txt preview still works in the same search UI.

---------

Co-authored-by: MkDev11 <89318445+bitloi@users.noreply.github.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-06-11 15:46:20 +08:00
Wang Qi
214ee319f8 Revert "fix(api): authorize owner_ids for list chats and search apps (#14775) (#15698)
This reverts PR #14775  commit 5a5e766386.
2026-06-05 17:26:02 +08:00
Wang Qi
4cbe597d7e Refactor: consolidate to use @login_required (#15652)
Refactor: consolidate to use @login_required
2026-06-05 11:35:00 +08:00
kpdev
76968af0ba Guard missing storage blobs on preview and image endpoints (#15366)
Fixes [#15365](https://github.com/infiniflow/ragflow/issues/15365) —
`get_document_image()` and document preview call `make_response(None)`
when storage returns no bytes, causing HTTP 500.
2026-06-03 11:33:03 +08:00
kpdev
0f6f7b3c3c fix(api): document image_id parsing for hyphenated thumbnail keys (#15115) (#15116)
### What problem does this PR solve?

Fixes #15115.

`GET /api/v1/documents/images/<image_id>` returned **Image not found**
when the thumbnail storage object key contained hyphens (e.g.
`page-1.png`). Document APIs build URLs as `{dataset_id}-{thumbnail}`,
but `get_document_image()` used `image_id.split("-")` and required
exactly two segments, so keys like `<kb_id>-page-1.png` were rejected
even though the blob existed.

This PR splits only on the first hyphen (`split("-", 1)`) and sets
`Content-Type` from the object key extension via `CONTENT_TYPE_MAP`
instead of hardcoding `image/JPEG`.
2026-06-02 10:54:14 +08:00
kpdev
252cc19f93 Infer Content-Type for document image endpoint (#15368)
## Summary

Fixes [#15367](https://github.com/infiniflow/ragflow/issues/15367) —
`GET /api/v1/documents/images/<image_id>` always returned `Content-Type:
image/JPEG` even for PNG/WebP chunk images and extensioned thumbnails.

## Related Issue

Fixes #15367

## Change Type

- [x] Bug fix
- [x] Regression tests
- [ ] New feature
- [ ] Refactor

## What Changed

- Added `_detect_image_content_type_from_bytes()` —
PNG/JPEG/GIF/WebP/BMP magic-byte detection
- Added `_content_type_for_document_image()` — object-key extension via
`CONTENT_TYPE_MAP`, then magic bytes, else `application/octet-stream`
- **`get_document_image()`** — set inferred `Content-Type` instead of
hardcoded `image/JPEG`
- Also guards missing storage blob (`Image not found.`) to avoid
`make_response(None)` (same handler; complements #15365)

## Files Changed

| File | Change |
|------|--------|
| `api/apps/restful_apis/document_api.py` | MIME inference helpers +
handler update |
|
`test/testcases/test_web_api/test_document_app/test_document_metadata.py`
| 3 unit tests |

## Validation

```bash
cd /root/gittensor/ragflow
pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_content_type_from_object_extension_unit -v
pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_content_type_from_magic_bytes_unit -v
pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_missing_blob_unit -v
```

## Test Plan

- [x] `.png` object key → `image/png`
- [x] Extensionless chunk key + PNG bytes → `image/png` (magic bytes)
- [x] Missing blob → 4xx `"Image not found."`
- [ ] CI green
2026-06-01 19:08:32 +08:00
kpdev
b35266e9a5 Return 4xx when file download storage blob is missing (#15371)
## Summary

Fixes [#15369](https://github.com/infiniflow/ragflow/issues/15369) —
`GET /api/v1/files/<file_id>` calls `make_response(None)` when both
primary and fallback storage lookups return empty, causing HTTP 500.

## Related Issue

Fixes #15369

## Change Type

- [x] Bug fix
- [x] Regression tests

## What Changed

- **`file_api.download()`** — after fallback `STORAGE_IMPL.get`, return
`get_error_data_result(message="This file is empty.")` when `not blob`,
matching document REST download semantics.

## Files Changed

| File | Change |
|------|--------|
| `api/apps/restful_apis/file_api.py` | Empty-blob guard before
`make_response()` |
| `test/testcases/test_web_api/test_file_app/test_file_routes_unit.py` |
Regression test |

## Validation

```bash
cd /root/gittensor/ragflow
pytest test/testcases/test_web_api/test_file_app/test_file_routes_unit.py::test_download_missing_blob_returns_error -v
pytest test/testcases/test_web_api/test_file_app/test_file_routes_unit.py::test_download_falls_back_to_document_storage -v
```

## Test Plan

- [x] Both storage paths empty → `"This file is empty."` (no
`make_response(None)`)
- [x] Existing fallback success test still passes
- [ ] CI green
2026-06-01 19:08:06 +08:00
galuis116
d1f6594618 Fix: JWT algorithm-confusion in OIDC ID token verification (#15181)
### What problem does this PR solve?

Closes #15180.

`OIDCClient.parse_id_token` in `api/apps/auth/oidc.py` read the JWT
signing
algorithm from the **unverified** JWT header and passed it through to
`jwt.decode(..., algorithms=[alg], ...)` as the trust anchor. This is
the
textbook JWT algorithm-confusion vulnerability (CWE-345 / CWE-347). Any
unauthenticated client capable of reaching the OIDC callback could take
over
an arbitrary account on any RAGFlow deployment with OIDC login enabled:

1. **`alg: "none"`** — present a JWT with `{"alg": "none"}` and no
   signature segment → `jwt.decode(..., algorithms=["none"])` → PyJWT's
   `NoneAlgorithm` accepts the token without verification → login as any
   user.
2. **RSA / HMAC confusion** — fetch the public RSA key from the
provider's
   JWKS (it's public), forge a JWT with `{"alg": "HS256"}` HMAC-signed
   using the public-key bytes as the secret → `jwt.decode(...,
   algorithms=["HS256"], key=public_key)` → verifier accepts → login as
   any user. (Modern PyJWT independently refuses to use a PEM-formatted
   key as an HMAC secret, which mitigates this leg for PEM key formats;
the fix here is the only mitigation for raw / DER / JWK octet keys and
   for older PyJWT versions.)

### What changed

**`api/apps/auth/oidc.py`:**

- New module constants `_ALLOWED_OIDC_SIGNING_ALGS` (asymmetric-only:
  `RS*`, `ES*`, `PS*`, `EdDSA` — explicitly excludes `none` and `HS*`)
  and `_DEFAULT_OIDC_SIGNING_ALGS = ("RS256",)` (the OIDC Core 1.0 §2
  spec default).
- New helper `_resolve_id_token_signing_algs(metadata)` — intersects the
  provider's advertised `id_token_signing_alg_values_supported` from
`/.well-known/openid-configuration` with the safe allowlist; falls back
  to RS256 when the field is missing or contains only unsafe values.
- `OIDCClient.__init__` now stores the resolved allowlist on
  `self.id_token_signing_algs` — pinned once, from a trusted source, at
  construction time.
- `parse_id_token` no longer calls `jwt.get_unverified_header` and no
  longer reads `alg` from the JWT header. It passes
  `self.id_token_signing_algs` to `jwt.decode(..., algorithms=...)`.
  `PyJWKClient.get_signing_key_from_jwt` still reads the `kid` from the
  header internally for JWKS lookup — that's fine, `kid` is not a
  security decision; the signature still proves which key was actually
  used.


**`test/testcases/test_web_api/test_auth_app/test_oidc_client_unit.py`:**

- Existing `test_parse_id_token_success_and_error` drops its
`jwt.get_unverified_header` mock (no longer called by `parse_id_token`).
- `_metadata` and `_make_client` helpers grew an optional `signing_algs`
parameter so tests can configure what the discovery document advertises.
- New `TestSSRFValidation` / algorithm-confusion regression block (7
  tests):
  - `test_id_token_signing_algs_default_to_rs256_when_metadata_missing`
  - `test_id_token_signing_algs_intersect_metadata_with_safe_allowlist`
  - `test_id_token_signing_algs_fall_back_when_only_unsafe_advertised`
  - `test_id_token_signing_algs_ignores_non_string_entries`
  - `test_id_token_signing_algs_handles_non_list_metadata_field`
  - `test_parse_id_token_passes_pinned_algorithms_to_jwt_decode` —
    sabotages `jwt.get_unverified_header` to raise on call, proving the
    verification path never consults the unverified header.
- `test_parse_id_token_rejects_alg_none` — uses real PyJWT to encode an
    `alg: "none"` token; `parse_id_token` raises `ValueError("Error
    parsing ID Token: …")` instead of accepting it.
  - `test_parse_id_token_rejects_hs256_when_allowlist_is_asymmetric` —
    uses real PyJWT to forge an `alg: "HS256"` token with a non-PEM
    shared secret (so PyJWT's incidental PEM-as-HMAC refusal isn't what
    blocks it); `parse_id_token` raises because `HS256` is not in the
    pinned allowlist.

Sanity-checked end-to-end with real PyJWT outside the project test
runner:

- `alg=none` forged token + `algorithms=["RS256"]` →
`InvalidAlgorithmError` ✓
- `alg=HS256` forged token + `algorithms=["RS256"]` →
`InvalidAlgorithmError` ✓
- Same `alg=HS256` token + `algorithms=["HS256"]` → **accepted**
({'sub': 'admin'})
  — confirming the attack path was real before the fix.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: galuis116 <contact@duerrimports.com>
2026-05-29 19:37:01 +08:00
kpdev
cb1ea5a47f Validate chunk image_base64 before doc-store write (#15364)
## Summary

Fixes [#15363](https://github.com/infiniflow/ragflow/issues/15363) —
`add_chunk` / `update_chunk` indexed chunks with `image_id` before
validating or storing `image_base64`, leaving orphan chunks on invalid
input.

## Related Issue

Fixes #15363

## Change Type

- [x] Bug fix
- [x] Regression tests

## What Changed

- Added `_decode_chunk_image_base64()` — strict base64 decode with
structured 4xx errors
- Added `_store_chunk_image_or_error()` — catches `store_chunk_image`
failures
- **`add_chunk` / `update_chunk`**: decode + store image **before**
`docStoreConn.insert` / `update`; only set `img_id` after successful
storage

## Files Changed

| File | Change |
|------|--------|
| `api/apps/restful_apis/chunk_api.py` | Helpers + reorder image
handling |
| `test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py`
| 3 regression tests |

## Validation

```bash
cd /root/gittensor/ragflow
pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_add_chunk_invalid_image_base64_does_not_index_chunk -v
pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_update_chunk_invalid_image_base64_does_not_update_chunk -v
pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_add_chunk_valid_image_base64_stores_before_insert -v
pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py -v
```

## Test Plan

- [x] Invalid `image_base64` on add → 4xx, no doc-store insert
- [x] Invalid `image_base64` on update → 4xx, no doc-store update
- [x] Valid PNG base64 on add → image stored, chunk indexed with
`img_id`
- [ ] CI green
2026-05-29 19:36:46 +08:00
Lynn
dc4b82523b Feat: tenant llm provider (#14595)
### What problem does this PR solve?

Python implementation of the Go-based model_provider API suite.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: bill <yibie_jingnian@163.com>
2026-05-29 17:39:41 +08:00
Wang Qi
0aff6a3f32 Feature: Allow page_size max value 100 (#15292)
Feature: Allow page_size max value 100
2026-05-28 11:13:01 +08:00
Wang Qi
f4d36f7082 Fix #15170 cannot filter document status (#15216)
Fix #15170 cannot filter document status

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-25 18:58:37 +08:00
Wang Qi
4776bfa8a2 Fix: Correct the API path (#15204)
Follow on PR #15146 to reslove the backwad compatability issue.

1. /agents/<attachment_id>/download ->
/agents/attachments/<attachment_id>/download

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-25 17:11:24 +08:00
Jonathan Chang
9d1006e4ec fix: The output of the parser in the ingestion pipeline contains HTML tags (#14920)
## Summary
This change fixes ingestion quality issues where MinerU parser output
may contain HTML fragments (for example, table-related tags like `<tr>`,
`<td>`, `<br>`), which were previously passed directly into
chunking/tokenization and degraded chunk quality.

The fix adds a sanitization step in the MinerU parser path so parsed
sections are normalized to clean text before chunking.

## Change Type (select all)
- [x] Bug fix
- [x] Ingestion pipeline improvement
- [x] Parser/chunking quality fix

## Related Issue
- https://github.com/infiniflow/ragflow/issues/14831
2026-05-25 16:06:36 +08:00
buua436
ea1764a7dc Revert "fix(api): infer /documents/{id}/download Content-Type from filename when ext is omitted (#15052)" (#15138)
Reverts infiniflow/ragflow#15053
2026-05-22 11:46:01 +08:00
kpdev
6932615852 fix(api): infer /documents/{id}/download Content-Type from filename when ext is omitted (#15052) (#15053)
## Summary

- Align **GET `/api/v1/documents/<doc_id>/download`** with
**`/preview`**: resolve extension and MIME type from the stored document
name when the **`ext` query parameter is omitted**, instead of
defaulting to `markdown`.
- When **`?ext=`** is present, behavior stays the same as before
(explicit extension / `Content-Type` mapping).
- Enforce the same access + document lookup pattern as preview
(**`accessible`** + **`get_by_id`**).
- Extend unit tests for the no-`ext` PDF filename case.

## Test plan

- [x] `uv run pytest
test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_download_attachment_success_and_exception_unit`
- [x] Optional: `curl -sSI` against
`/api/v1/documents/<pdf_doc_id>/download` without `ext` and confirm
`Content-Type: application/pdf`

Fixes #15052.
2026-05-21 15:31:36 +08:00
jony376
198f3c4b9a Fix: validate memory tenant model IDs on update and enforce tenant scope in memory pipeline (#14923)
### Related issues

Closes #14922

### What problem does this PR solve?

`POST /memories` already resolves `tenant_llm_id` and `tenant_embd_id`
through `ensure_tenant_model_id_for_params`, but `PUT
/memories/<memory_id>` accepted client-supplied `tenant_llm_id` /
`tenant_embd_id` without checking that those `tenant_llm` rows belong to
the memory owner’s tenant. A caller could persist another tenant’s row
IDs and later trigger extraction or embedding that loaded foreign model
credentials via `get_model_config_by_id(tenant_model_id)` with no tenant
allow-list.

This change aligns the update path with create: updates that change
models must go through `llm_id` / `embd_id` and
`ensure_tenant_model_id_for_params` scoped to the **memory’s**
`tenant_id` (not only the current user, so team-access cases stay
correct). Direct `tenant_*` fields in the body without `llm_id` /
`embd_id` are rejected. As defense in depth, `memory_message_service`
passes `allowed_tenant_ids` / `requester_tenant_id` into
`get_model_config_by_id` for LLM and embedding resolution so mismatched
IDs cannot be used even if bad data existed. A regression test rejects
payloads that set only `tenant_llm_id` / `tenant_embd_id`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: jony376 <jony376@gmail.com>
2026-05-19 10:11:46 +08:00
Magicbook1108
b69a6a5d80 Feat: full optimization on connector dashboard (#14979)
### What problem does this PR solve?

This PR improves the connector dashboard task management experience and
adds better visibility into connector execution logs.

### Overview:

#### Before
<img width="700" alt="image"
src="https://github.com/user-attachments/assets/e4a8ed6f-2e18-4f0f-8528-41a514550052"
/>

#### Now:
<img width="700" alt="Screenshot from 2026-05-18 16-31-30"
src="https://github.com/user-attachments/assets/d4ca193b-847a-49ae-9e4f-5fbca60ea627"
/>

### 1. Add a new logging page to the connector dashboard

A new logging page has been added so users can view connector task
execution logs directly from the connector dashboard.

### 2. Merge the Resume button into Confirm

The separate **Resume** button has been removed. The **Confirm** button
now represents different actions depending on the current task state:

- **Save**: Save form changes and reschedule tasks.
- **Stop**: Cancel currently scheduled or running tasks.
- **Resume**: Create new scheduled tasks after the previous tasks have
been stopped.
- **Start**: Start tasks when no task has been started yet.

### 3. Separate syncing and pruning tasks

Connector tasks are now separated into **syncing** and **pruning**.

Pruning is controlled by the **Sync deleted files** option:

- When **Sync deleted files** is disabled, only syncing tasks are shown.
- When **Sync deleted files** is enabled, both syncing and pruning tasks
are shown.

**Now: Sync deleted files disabled**

<img width="700" alt="Sync deleted files disabled"
src="https://github.com/user-attachments/assets/dbd9232e-614a-407f-a0b1-c109e5fa567d"
/>

**Now: Sync deleted files enabled**

<img width="700" alt="Sync deleted files enabled"
src="https://github.com/user-attachments/assets/1f527f48-ccb3-4ee8-97ca-086891489296"
/>

### 4. Update logs in backend

<img width="700" alt="image"
src="https://github.com/user-attachments/assets/10a95a3f-98c1-4e67-8afa-ddf6cda5b0b2"
/>

### 5. Remove connector resume API

- Removed: `POST /v1/connectors/<connector_id>/resume`
- Replaced by: `PATCH /v1/connectors/<connector_id>`


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-19 10:07:11 +08:00
dev
b12eaee38b fix(api): enforce tenant access for connector routes (#14747)
### What problem does this PR solve?

Fixes #14746.

Adds tenant access checks for connector-by-id REST routes before reading
connector details, mutating connector config/status, deleting
connectors, rebuilding, or listing sync logs. Unauthorized callers now
receive `RetCode.AUTHENTICATION_ERROR` with `No authorization.` without
reaching the connector/log mutation paths.

Validation:
- `python3 -m pytest
--confcutdir=test/testcases/test_web_api/test_connector_app
test/testcases/test_web_api/test_connector_app/test_connector_routes_unit.py`
- `uvx ruff check api/apps/restful_apis/connector_api.py
api/db/services/connector_service.py
test/testcases/test_web_api/test_connector_app/test_connector_routes_unit.py`

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: dev111-actor <dev111-actor@users.noreply.github.com>
2026-05-18 16:09:26 +08:00
sirj0k3r
b2b63600f1 Adds gpt-5.4-mini and gpt-5.4-nano (#14908)
### What problem does this PR solve?
Includes gpt-5.4-mini and gpt-5.4-nano to the OpenAI model list

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-14 10:16:24 +08:00
plind
dd76653dc1 feat: add tag management for Agents with filtering and sorting (#14774) (#14799)
## Summary
Closes #14774.

Adds free-form tags on agents (UserCanvas) with full UI + API:
- Stored as comma-separated `tags` column on `UserCanvas` with online
migration.
- New endpoints: `GET /v1/agents/tags` (aggregate counts) and `PUT
/v1/agent/<id>/tags` (write). `GET /v1/agents` accepts a `tags=` query.
- "Edit tags" item in agent dropdown opens a chip-style editor dialog;
tags render as badges on each agent card.
- New "Tags" facet in the agents filter bar, with counts.

## Implementation notes
- **Tag matching is exact-token**: the SQL filter wraps stored tags as
`,…,` and matches `,ml,` so `ml` doesn't match `ml-ops`.
- **Server-side normalization** in `UserCanvasService.update_tags`:
dedup (case-insensitive), per-tag cap of 64 chars, total length capped
at 512 chars to fit the column, commas inside tag values are replaced
with spaces.
- **Tenant authorization**: `PUT /v1/agent/<id>/tags` gates on
`UserCanvasService.accessible(canvas_id, tenant_id)`.
- **Tag listing scope**: `UserCanvasService.list_tags` follows the same
own + team-shared rule as `get_by_tenant_ids`.
- **i18n**: keys added to `en.ts` and `zh.ts` only (per project
convention; other locales fall back).
- **`HomeCard`** gets a non-breaking `extra?: ReactNode` slot for the
chip row; no `src/components/ui/` files modified.

## Test plan
- [ ] Backend boot runs `migrate_db` → confirm `user_canvas.tags` column
exists (`DESCRIBE user_canvas`).
- [ ] Agents page renders cards normally (no console error from missing
field).
- [ ] `⋯ → Edit tags` opens a dialog that stays open (regression: dialog
was unmounting with the dropdown).
- [ ] Typing a tag without pressing Enter and clicking Save persists it
(regression: last typed tag was being dropped).
- [ ] Chip input supports Enter/comma to commit, Backspace on empty to
remove, `×` to remove individual chip.
- [ ] Tag containing a comma sent via API is stored with the comma
replaced by a space.
- [ ] 20 long tags sent via API does not error (length cap silently
truncates).
- [ ] "Tags" filter in the filter bar shows counts and narrows the list.
- [ ] Filtering by `ml` does **not** return agents tagged `ml-ops`.
- [ ] UI in Chinese shows 编辑标签 / 添加标签以整理和筛选你的智能体 etc.
- [ ] `PUT /v1/agent/<other-tenant-id>/tags` returns `Agent not found or
no permission.`
2026-05-13 21:41:32 +08:00
dale053
5a5e766386 fix(api): authorize owner_ids for list chats and search apps (#14775)
Closes #14768
### What problem does this PR solve?
The `list_chats` and `list_searches` REST API endpoints did not enforce
authorization on the `owner_ids` query parameter. Any authenticated user
could pass arbitrary tenant IDs to `owner_ids` and retrieve chats or
search apps belonging to other tenants they are not a member of.

This PR resolves the issue by:
1. Looking up the current user's authorized tenants via
`TenantService.get_joined_tenants_by_user_id` and rejecting any
`owner_ids` that fall outside that set.
2. When no `owner_ids` are provided, scoping the query to only the
user's authorized tenants instead of returning an unfiltered result.
3. Adding unit tests that verify unauthorized `owner_ids` are rejected
with `OPERATING_ERROR`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-13 09:43:44 +08:00
0xτensor
127aeac4aa fix: expose gpt-5.5 and gpt-5.4 in OpenAI model list (#14828)
### What problem does this PR solve?

OpenAI model catalogs used in provider selection flows were missing the
latest GPT models (`gpt-5.5` and `gpt-5.4`).
Because model availability is driven by seeded catalog data
(`conf/llm_factories.json` → DB seed → API response), these models were
not selectable in the UI or `/llm/list` responses.

This PR updates and synchronizes the OpenAI catalog definitions across
configuration sources and ensures the new models are correctly exposed
through the API layer and validated in tests.

---

### Type of change

* [x] New Feature (non-breaking change which adds functionality)

---

### Changes Made

* Added `gpt-5.5` and `gpt-5.4` to OpenAI catalog definitions in:

  * `conf/llm_factories.json`
  * `conf/models/openai.json` (chat + vision support)
* Ensured consistency between DB-seeded factory config and provider
model configuration
* Updated test coverage in:

  * `test_llm_list_unit.py`

    * seeded OpenAI catalog entries
* added response-level assertion validating `/llm/list` includes both
new model IDs under OpenAI grouping

---

### Root Cause

OpenAI model listings in selection flows are generated from catalog data
seeded via `conf/llm_factories.json`.
The catalog had not been updated to include the latest GPT models,
resulting in missing availability in UI and API responses.

---

### Testing

* Created isolated test environment:

  * `python -m venv .venv-review`
  * installed `pytest`
* Ran targeted and full test suite:

  * `test_list_app_grouping_availability_and_merge`:  passed
  * Full `test_llm_list_unit.py`:  10 passed

---

### Risks / Limitations

* Adding models to the catalog does not guarantee upstream provider
availability or account entitlement.
* Environments with pre-seeded DB catalogs may require reseed or refresh
to reflect updated configuration.

---

### Notes

* Changes are minimal and scoped strictly to catalog configuration and
related test coverage.
* Ensures `/llm/list` API remains aligned with expected latest OpenAI
model availability.
* Closes #14827
2026-05-12 18:03:47 +08:00
hyl64
02c2587ca4 fix(agent): support iteration item aliases in child nodes (#14146)
## Summary
This PR fixes the iteration variable mismatch reported in #14142.

Changes:
- restore compatibility for `IterationItem@result` by exposing `result`
alongside `item`
- support bare iteration aliases like `{item}`, `{index}`, and
`{result}` inside iteration child-node inputs
- add focused unit/runtime tests covering both alias styles and
multi-item iteration execution

## Validation
```bash
pytest -q --noconftest \
  test/testcases/test_web_api/test_canvas_app/test_iterationitem_unit.py \
  test/testcases/test_web_api/test_canvas_app/test_iteration_runtime_unit.py \
  test/testcases/test_web_api/test_canvas_app/test_invoke_component_unit.py
```

Result: `12 passed`

Closes #14142
2026-05-12 13:05:21 +08:00
buua436
daf8a58c4b Fix: add codeexec attachments output (#14787)
### What problem does this PR solve?

add codeexec attachments output

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-11 19:16:33 +08:00
box4wangjing
292b0b8bce chore: fix some comments to improve readability (#14756)
### What problem does this PR solve?

fix some comments to improve readability

### Type of change

- [x] Documentation Update

---------

Signed-off-by: box4wangjing <box4wangjing@outlook.com>
2026-05-11 16:48:48 +08:00
buua436
a03b95f8c4 Fix: shared dataset chunk index lookup (#14764)
### What problem does this PR solve?

shared dataset chunk index lookup

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-11 13:50:08 +08:00
Mehmet Karakose
7ec87f7cb7 fix(auth): fall back to session-based auth in _load_user (#14569)
## Summary

Closes #13663.

OAuth / OIDC callbacks call `login_user(user)` which writes `_user_id`
into the session cookie, but `_load_user()` in `api/apps/__init__.py`
only ever looked at the `Authorization` header. The SPA's response
interceptor wipes the Authorization value from `localStorage` on the
first 401 it sees — meaning that during the post-redirect window after
an OAuth login, a single transient 401 sends every subsequent request
back to the login page even though `login_user()` had already
established a perfectly good server-side session.

The reporter's analysis traces this all the way through the redirect →
`navigate('/')` → first request → empty header → 401 → `removeAll()` →
infinite-redirect-to-login chain.

## What changed

- New `_load_user_from_session()` helper that reads
`session["_user_id"]`, looks up the user in `UserService` (with the same
`StatusEnum.VALID` and `access_token` checks already used elsewhere),
and assigns `g.user`.
- Every `return None` path in `_load_user()` now routes through that
helper before giving up:
  - missing `Authorization` header
  - malformed `bearer ` prefix
  - empty / too-short JWT payload
  - JWT signature failure
  - JWT-resolved user not found / has no `access_token`
  - `APIToken.query()` fallback exhausted

The JWT and API-token paths still take precedence — the session is only
consulted when those can't authenticate the request. So existing
local-login and SDK callers see no behaviour change; only OAuth / OIDC
users that hit the original race now stay logged in.

The Bearer-prefix issue called out in #13663 (lines 103-110) is already
handled in the current code, so this PR only addresses the second half
of the report.

## Test plan

- [ ] Configure OIDC under `oauth` in `service_conf.yaml`
- [ ] Click the OIDC login button, complete auth at the IdP
- [ ] Confirm that navigating between pages no longer bounces back to
`/login`
- [ ] Confirm local email/password login still issues + accepts JWTs
- [ ] Confirm SDK/API key callers still authenticate via `Authorization:
Bearer <api-token>`

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2026-05-11 09:59:52 +08:00
euvre
f4b8f53b6d Fix: restore embedding model switching for datasets with existing chunks (#14732)
### What problem does this PR solve?

## Problem

During the REST API refactoring (#13690), the
`/api/v2/kb/check_embedding` endpoint was removed and never migrated to
the new RESTful structure. The frontend was pointed to the
`/api/v1/datasets/{id}/embedding` endpoint (which is `run_embedding` — a
completely different function). Additionally, a hard guard was
introduced that rejects any `embd_id` change when `chunk_num > 0`,
making it impossible to switch embedding models on datasets with
existing chunks.

## Root Cause

1. **Missing endpoint**: The old `check_embedding` logic (sample random
chunks, re-embed with the new model, compare cosine similarity) was not
carried over to the new REST API service layer.
2. **Wrong frontend URL**: `checkEmbedding` in `api.ts` pointed to
`/datasets/{id}/embedding` (`run_embedding`) instead of a dedicated
check endpoint.
3. **Overly restrictive guard**: `dataset_api_service.py` line 310
blocked all `embd_id` updates when `chunk_num > 0`. This check did not
exist in the pre-refactor code — it was incorrectly introduced during
the refactor.

## Changes

### Backend

- **`api/apps/services/dataset_api_service.py`**
  - Remove the `chunk_num > 0` hard guard on `embd_id` updates
- Add `check_embedding()` service function: samples random chunks,
re-embeds them with the candidate model, computes cosine similarity,
returns compatibility result (avg ≥ 0.9 = compatible)
  - Add `import re` for the `_clean()` helper

- **`api/apps/restful_apis/dataset_api.py`**
- Add `POST /datasets/<dataset_id>/embedding/check` endpoint following
the new REST API conventions
  - Clean up unused top-level imports (`random`, `re`, `numpy`)

### Frontend

- **`web/src/utils/api.ts`**
- Fix `checkEmbedding` URL from `/datasets/${datasetId}/embedding` →
`/datasets/${datasetId}/embedding/check`

### Tests

-
**`test/testcases/test_http_api/test_dataset_management/test_update_dataset.py`**
- Update `test_embedding_model_with_existing_chunks` to assert success
(`code == 0`) instead of expecting the old `102` error

-
**`test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`**
- Update `test_update_route_branch_matrix_unit` to assert
`RetCode.SUCCESS` when updating `embd_id` on a chunked dataset,
replacing the old `chunk_num` error assertion

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-05-09 18:48:57 +08:00
akie
c11650bb4c Fix IDOR: Add permission checks to file ancestry endpoints (#14725)
Close #14292

## Issue

File ancestry endpoints return folder metadata without validating tenant
permissions, allowing any authenticated user to query arbitrary
`file_id` values across tenant boundaries.

## Affected Endpoints
- `GET /v1/file/parent_folder?file_id={file_id}`
- `GET /v1/file/all_parent_folder?file_id={file_id}`  
- `GET /api/v1/files/{id}/ancestors`

## Root Cause

These endpoints **skip the permission check** that other file operations
(Delete, Download, Move) perform.

## Expected Permission Check

All file operations should follow this 3-step validation:

- Check file.tenant_id
- Check if user_id belongs to this tenant (via user_tenant join table)
- Check KB permission type (team permission)


**Code reference:** This is implemented in `checkFileTeamPermission()`
and used by Delete/Download/Move, but **missing** from
GetParentFolder/GetAllParentFolders.

## Reproduction

```bash
# User B (tenant: BBB) accessing User A's file (tenant: AAA)
curl -H "Authorization: Bearer USER_B_TOKEN" \
  "http://localhost:9384/v1/file/parent_folder?file_id=AAA_FILE_123"

# Result: Returns User A's folder metadata 
# Expected: "No authorization." 
Fix
Pass userID from handler to service and call checkFileTeamPermission() — same as Download/Delete/Move handlers.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 16:03:23 +08:00
web-dev0521
d51fb88573 Fix: enforce tenant authorization on document download endpoint (#14618) (#14625)
### What problem does this PR solve?

Closes #14618.

The `GET /v1/document/get/<doc_id>` endpoint in
`api/apps/document_app.py` was protected only by `@login_required` and
called `DocumentService.get_by_id(doc_id)` without verifying that the
document's knowledge base belonged to the requesting user's tenant. Any
authenticated user who knew (or guessed) a document ID could download
files belonging to any other tenant — a cross-tenant IDOR.

This PR adds a `DocumentService.accessible(doc_id, current_user.id)`
check before serving the file. The helper already exists and joins
`Document` → `Knowledgebase` → `UserTenant` to verify the requesting
user belongs to the tenant that owns the document's KB. The same pattern
is already used by `api/apps/restful_apis/document_api.py` and mirrors
the tenant scoping in the SDK route at `api/apps/sdk/doc.py`.

The check returns the existing `"Document not found!"` error for both
non-existent and inaccessible documents, so attackers cannot use the
response to enumerate valid doc IDs across tenants.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Other (please describe): Security fix (cross-tenant IDOR /
authorization bypass)
2026-05-08 14:24:03 +08:00
jony376
6547751936 Fix: missing authorization checks in /files/link-to-datasets (#14649)
### Related issues
Closes #14648

### What problem does this PR solve?

This PR fixes an authorization flaw in `POST /files/link-to-datasets`.

Before this change, the endpoint only checked whether the supplied
`file_ids` and `kb_ids` existed. It did not verify whether the
authenticated user was actually allowed to access those files or target
datasets. As a result, an authenticated user who knew valid IDs could
relink another user's files to arbitrary datasets.

This was especially risky because the relinking flow is state-changing:
the background worker removes existing file-document mappings and then
recreates documents under the attacker-supplied dataset IDs.

This change makes the route enforce the same permission model already
used by nearby file and document operations:

- each resolved file must pass `check_file_team_permission(...)`
- each target dataset must pass `check_kb_team_permission(...)`
- authorization is enforced before scheduling background relinking work

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Testing

- Added regression coverage in
`test/testcases/test_web_api/test_file_app/test_file2document_routes_unit.py`
- Covered:
  - unauthorized file access is rejected
  - unauthorized dataset access is rejected
- existing success path still returns immediately after scheduling
background work
- Attempted to run:
- `python -m pytest
test\\testcases\\test_web_api\\test_file_app\\test_file2document_routes_unit.py
-q`
- Local execution in this workspace is currently blocked by missing test
dependencies during bootstrap, including `ragflow_sdk`

---------

Co-authored-by: jony376 <jony376@gmail.com>
2026-05-08 13:49:23 +08:00
buua436
f703169117 Refa: migrate document preview/download to RESTful API (#14633)
### What problem does this PR solve?

migrate document preview/download to RESTful API

### Type of change
- [x] Refactoring
2026-05-08 13:26:13 +08:00
Jin Hai
94324afee9 Go: fix auth issue in hybrid mode (#14611)
### What problem does this PR solve?

Since secret key get and set logic is updated, the go server also need
to update.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-05-07 17:14:22 +08:00
Wang Qi
c50028b1f3 Fix team member cannot edit agent (#14612)
### What problem does this PR solve?

Follow on PR: https://github.com/infiniflow/ragflow/pull/14602
to fix: team member cannot edit agent.
new behavior: beside delete, everything is allowed for team member.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-07 15:09:13 +08:00
Jin Hai
1d0519d025 Fix secret key inconsistency cross the RAGFlow servers (#14591)
### What problem does this PR solve?

A and B, two API servers and a REDIS server.
If A and REDIS restart, B will hold the obsolete secret key and will
lead to error.

TODO:
app.config['SECRET_KEY'] and app.secret_key still hold obsolete secret
key.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2026-05-07 10:10:02 +08:00
Preston Percival
e8f19aa338 feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238)
This PR addresses three related GraphRAG reliability issues that
together allow long-running GraphRAG tasks (10+ hours of LLM extraction)
to be resumed after a crash or pause without re-doing completed work. It
builds on #14096 (per-doc subgraph cache) and extends the same idea to
the resolution and community-detection phases.

Fixes #14236.

## 1. Fix concurrent merge crash

Long GraphRAG runs would crash near the end of entity resolution with:
```
RuntimeError: dictionary keys changed during iteration
```
in `Extractor._merge_graph_nodes`. Two changes:

- `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)`
via `list(...)` before iterating, so concurrent `add_edge` /
`remove_node` mutations on the shared `nx.Graph` cannot invalidate the
iterator. Also tracks each redirected neighbour in `node0_neighbors` so
a later merged node sharing the same external neighbour takes the
edge-merge branch instead of overwriting via `add_edge`.
- `rag/graphrag/entity_resolution.py`: serialize the merge step with a
dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and
concurrent merges on overlapping neighbourhoods can produce incorrect
results even with the snapshot fix.

## 2. Don't wipe partial graph on pause

Previously the pause / cancel UI path called
`settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`,
destroying every subgraph, entity, relation, and graph row.
Re-triggering then started GraphRAG from scratch even though #14096 had
already added `load_subgraph_from_store`.

After main was merged in (which deleted `api/apps/kb_app.py` per
#14394), the pause path now lives on the new REST surface `DELETE
/v1/datasets/<id>/<index_type>`:

- `api/apps/services/dataset_api_service.py`: `delete_index` accepts a
`wipe: bool = True` parameter. When `False` the doc-store rows and
GraphRAG phase markers are left intact and only the running task is
cancelled. Default preserves historical behaviour.
- `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false|0|no|off`
from the query string and forwards it.
- `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`:
`unbindPipelineTask` appends `?wipe=false` when explicitly false.
- The GraphRAG pause action in
`web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe:
false` for `KnowledgeGraph`; raptor is unchanged.

**UX impact:** the pause icon next to a running GraphRAG task no longer
wipes graph data. The only path that still wipes is the explicit Delete
action in `GenerateLogButton` (trash icon behind a confirmation modal).

## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`)

A small Redis-backed marker layer at
`graphrag:phase:{kb_id}:{resolution_done|community_done}` (7-day TTL).
`run_graphrag_for_kb` consults the markers on entry and skips phases
that already completed in a prior run. Markers are cleared automatically
when:
- new docs are merged into the graph (which invalidates prior resolution
and community results),
- `delete_index` wipes the graph, or
- `delete_knowledge_graph` is called.

Redis failures never block a run -- markers are an optimization, not a
gate.

## 4. Idempotent community detection

`extract_community` previously did `delete-then-insert` on
`community_report` rows; a crash mid-insert left the dataset with no
reports. Now report IDs are derived deterministically from `(kb_id,
community.title)`, the existing report IDs are snapshotted before
insert, new rows are written, then only stale rows are pruned. A failure
at any step leaves either the prior or the new report set intact --
never a partial mix.

## 5. Tunable doc-store insert pipeline

The GraphRAG insert loop in `rag/graphrag/utils.py` and the
`community_report` insert in `rag/graphrag/general/index.py` were both
hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real
KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure
round-trip overhead.

- New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py`
batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout
semantics as the prior loop.
- Defaults: 64 docs per batch, 4 batches in flight (matches the regular
ingest pipeline in `document_service.py`). Tunable per-deployment via
`GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`.
- Both `set_graph` and `extract_community` now use the helper.

This dropped the same 1077-chunk insert from minutes to seconds in local
testing without measurable extra pressure on Infinity (total in-flight
docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default).

## Tests

- `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests):
dense neighbourhood merge, neighbour-snapshot regression, concurrent
serialized merges.
- `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has
round-trip, kb-scoped clear, no-op on empty input, graceful Redis
failure.
-
`test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`:
new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both
GraphRAG and raptor on the new REST route, and confirms the default
still wipes and clears phase markers.

## Compatibility

- Backward compatible: tasks queued before this change behave
identically (default `wipe=true`, no markers expected).
- No schema/migration changes; all new state lives in Redis.
- New optional REST query param `wipe` on `DELETE
/v1/datasets/<id>/<index_type>`.
- New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and
`GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour.

## Example of resume

Screenshot below shows a test resuming knowledge graph generation after
applying the concurrency fix and re-deploying.

<img width="521" height="677" alt="image"
src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a"
/>

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2026-05-06 15:01:01 +08:00
buua436
05ee7f8bb6 Fix: remove delete_documents uuid validation (#14533)
### What problem does this PR solve?

remove delete_documents uuid validation

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 18:56:33 +08:00
buua436
06c6da5d94 Fix: add document delete permission check (#14472)
### What problem does this PR solve?

add document delete permission check

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 11:01:09 +08:00
buua436
47129fdd08 Fix: optimize file batch delete (#14473)
### What problem does this PR solve?

optimize file batch delete

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 11:00:39 +08:00
Wang Qi
b684c89950 Add backward compat APIs (#14427)
### What problem does this PR solve?

Add backward compat APIs:

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-29 15:15:49 +08:00
euvre
35f6d81b73 Refactor: migrate chunk retrieval_test and knowledge_graph to REST API endpoints (#14402)
### What problem does this PR solve?

## Summary

Migrate two web API endpoints to REST-style HTTP API endpoints,
following the pattern established in #14222:

| Old Endpoint | New Endpoint |
|---|---|
| `POST /v1/chunk/retrieval_test` | `POST
/api/v1/datasets/<dataset_id>/search` |
| `GET /v1/chunk/knowledge_graph` | `GET
/api/v1/datasets/<dataset_id>/graph` |
2026-04-28 20:00:26 +08:00
Magicbook1108
85575259ac Fix: google authentication - gmail && google-drive (#14422)
### What problem does this PR solve?

Fix: google authentication - gmail && google-drive

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 18:09:02 +08:00