Commit Graph

77 Commits

Author SHA1 Message Date
euvre
9bd53ce675 fix: return full record in get_ingestion_log (#16120)
### What problem does this PR solve?

The `get_ingestion_log` endpoint (both Python
`dataset_api_service.get_ingestion_log` and Go
`DatasetService.GetIngestionLog`) was returning only the
**dataset-level** field set, which omits critical fields such as `dsl`,
`document_id`, `parser_id`, `document_name`, `pipeline_id`, etc.

This caused the front-end **dataflow-result page** to be unable to
render the pipeline timeline and chunks when viewing a single ingestion
log, regardless of whether the log was a dataset-level operation
(graph/raptor/mindmap) or a per-file parse.

### Background

`PipelineOperationLogService` provides two field sets:

| Method | Fields |
|---|---|
| `get_dataset_logs_fields` | Minimal set (progress, status, timestamps,
etc.) |
| `get_file_logs_fields` | Superset — includes `document_id`, `dsl`,
`parser_id`, `document_name`, `pipeline_id`, … |

When listing logs, the API correctly distinguishes dataset-level vs
file-level logs and uses the appropriate converter. However, when
**fetching a single log by ID**, both the Python and Go implementations
were hardcoded to the dataset-level set, dropping the extra fields that
the front-end needs.
2026-06-17 13:03:51 +08:00
Lynn
b4a161b50e Fix: filter unsupported model_type (#16062)
### What problem does this PR solve?

As title.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-16 13:15:42 +08:00
Lynn
47495c1f6a Feat: model provider (#16028)
### What problem does this PR solve?

Feat:
- Allow upsert model_type for instance model

Fix:
- Allow create instance with duplicate api_key

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2026-06-15 19:10:33 +08:00
balibabu
70ae25fc7b Fix: Remove the pagination from the search and retrieval pages. (#15942)
### What problem does this PR solve?

Fix: Remove the pagination from the search and retrieval pages.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-11 16:36:05 +08:00
bohdansolovie
47fb462e46 fix(api): guard dataset delete when File2Document row is missing (#15533)
## Summary
Fixes #15532 — `delete_datasets()` crashes with `IndexError` when a
document has no `File2Document` row.
`delete_datasets()` in `dataset_api_service.py` called
`File2DocumentService.get_by_document_id()` and immediately accessed
`f2d[0].file_id` without checking whether the lookup returned any rows.
Documents created via API ingestion or connector sync may exist without
a linked file record, causing dataset deletion to abort with HTTP 500.
This PR mirrors the existing guard already used in `file_service.py` and
`document_api_service.py`.
2026-06-11 15:18:08 +08:00
Idriss Sbaaoui
9871a7e0b6 fix: replicate model provider (#15933)
### What problem does this PR solve?

FIx replicate model provider failing with valid api key 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-06-11 15:08:33 +08:00
Wang Qi
238a01d9e3 Fix multiple tags (#15931)
Fix multiple tags
2026-06-11 10:55:28 +08:00
Lynn
32559d2dfc Fix: model list (#15914)
### What problem does this PR solve?

Display OCR tag for model providers.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-11 09:40:45 +08:00
Wang Qi
acaeb416ca Fix cannot add fish audio (#15913)
Fix cannot add fish audio
2026-06-10 20:27:43 +08:00
balibabu
aafe6c5534 Fix: The dataset retrieval test returned an incorrect total number. (#15901)
### What problem does this PR solve?

Fix: The dataset retrieval test returned an incorrect total number.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: balibabu <assassin_cike@163.com>
2026-06-10 19:11:31 +08:00
Wang Qi
3091d91cf7 Fix no need to put inactive models to bottom (#15903)
Fix no need to put inactive models to bottom
2026-06-10 16:55:02 +08:00
buua436
dcf623d60d feat: support multi-type factory models (#15893)
### What problem does this PR solve?
Support factory models with multiple model types, so visual chat models
can be exposed as both image2text and chat while preserving the database
model-type-per-record design.

This also updates the SILICONFLOW model list and adds a helper script to
refresh SiliconFlow models from the provider API.

### Type of change
- [x] New Feature (non-breaking change which adds functionality)
2026-06-10 15:35:21 +08:00
Lynn
478c9846a1 Fix: model list (#15860)
### What problem does this PR solve?

Remove tenant_llm call in rag.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-10 14:59:57 +08:00
Wang Qi
9aa81e7cad Fix paddle ocr / minerU cannot add (#15858)
Fix paddle ocr / minerU cannot add
2026-06-10 13:04:13 +08:00
Wang Qi
7ed1f1c865 Fix VLLM cannot add without /v1 (#15851)
Fix VLLM cannot add without /v1
2026-06-09 19:11:15 +08:00
Wang Qi
2773208159 Fix: MinerU cannot be added (#15841)
Fix: MinerU cannot be added
2026-06-09 19:06:51 +08:00
euvre
f97d6396b4 fix: BaiduYiyan API key validation fails in set_api_key (#15828)
### What problem does this PR solve?

When setting the API key for the BaiduYiyan provider, all model
validations fail with the error "Fail to access model using this api
key. No valid response received".

**Root cause:**

1. `BaiduYiyanChat` in `rag/llm/chat_model.py` does not override
`async_chat_streamly()`. The `verify_api_key()` function uses
`mdl.async_chat_streamly()` to validate, but `BaiduYiyanChat` inherits
`Base.async_chat_streamly()` which uses the OpenAI client, not the Baidu
Qianfan SDK (qianfan). Since BaiduYiyan has no OpenAI-compatible
base_url, validation always fails.

2. `verify_api_key()` in `provider_api_service.py` does not format the
raw API key string into the JSON format (`{"yiyan_ak": "...",
"yiyan_sk": "..."}`) that `BaiduYiyanChat.__init__()` expects via
`json.loads(key)`.

**Fix:**

1. Add `async_chat_streamly()` method to `BaiduYiyanChat` using the
qianfan SDK, consistent with the existing `chat_streamly()` method.
2. Add BaiduYiyan API key formatting in `provider_api_service.py`
`verify_api_key()` to match the format expected by
`BaiduYiyanChat.__init__()`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2026-06-09 19:05:58 +08:00
buua436
c1496ffd43 fix: propagate memory tenant id in task collect (#15837)
### What problem does this PR solve?
Propagate `tenant_id` from memory task messages into task collection so
refactored task execution can build a valid context.

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-09 17:47:48 +08:00
Lynn
1ab51a27bf Fix: list intl Tongyi-Qianwen base_url (#15831)
### What problem does this PR solve?

Display intl `base_url` for Tongyi-Qianwen

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-09 13:19:39 +08:00
Lynn
b9f06e6095 Feat: model list (#15774)
### What problem does this PR solve?

Support model list for VolcEngine.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-08 20:18:00 +08:00
buua436
0c5245e454 fix: await lmstudio embedding verification (#15772)
### What problem does this PR solve?

Fix LM-Studio provider connection verification so embedding checks await
the async wrapper correctly and log the full traceback on failures.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-08 19:17:47 +08:00
buua436
6bf7056422 feat: add placeholder model metas (#15753)
### What problem does this PR solve?

add placeholder model metas

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-08 14:54:59 +08:00
qinling0210
c960dc2a4c Refine handling of POST /api/v1/datasets/search in GO (#15583)
### What problem does this PR solve?

Refine handling of POST /api/v1/datasets/search in GO

### Type of change

- [x] Refactoring
2026-06-08 11:49:37 +08:00
Lynn
b05d5a5228 Feat: get model list from remote (#15711)
### What problem does this PR solve?

Feat:
- Get model list from remote provider. 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-08 11:02:40 +08:00
Lynn
794c1f4b25 Fix: volc engine and other json key factories (#15653)
### What problem does this PR solve?

Fix:
- VolcEngine adapt to new api_key format
- Save dict api_key as json

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-05 09:45:44 +08:00
Lynn
b65b18ba4c Fix: model provider (#15634)
### What problem does this PR solve?

Not display `success` when check not passed.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-04 16:05:00 +08:00
Lynn
597ac1e900 Fix: search bot and verify model instance (#15588)
### What problem does this PR solve?

Fix:
- Verify provider with empty llm list in llm_factories.json
- Set search bot's chat_llm_name, use tenant default chat model as
default

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-04 11:59:55 +08:00
euvre
9a9d3ddf5f fix: show default embedding model when provider is not yet registered (#15511)
### What problem does this PR solve?

### Problem

On the Model Providers page, the Embedding Model dropdown in System
Model Settings shows empty (no default selected), even though a default
embedding model is configured in `service_conf.yaml`.

### Root Cause

Two issues were identified:

1. **Backend: `_get_model_info` fails for unregistered providers**
The tenant's `embd_id` is set to `bge-m3@xxxx` during initialization
(from the placeholder config `factory: 'xxxx'`). The `_get_model_info`
function requires the provider to exist in `tenant_model_provider`
table, but `xxxx` is never a real provider. Even after the user adds a
real provider (e.g., ZHIPU-AI), the stale `embd_id` still references the
non-existent one, causing the function to return `None`.

2. **Frontend: default models cache not invalidated after adding
provider**
`useAddProviderInstance` only invalidates `addedProviders` and
`allModels` caches after adding a provider instance, but does **not**
invalidate the `defaultModels` cache. This means the default model list
is not re-fetched until the user manually refreshes the page.

### Fix

**`api/apps/services/models_api_service.py`**

- Added `_resolve_model_from_tenant_providers()` helper: when the
default model's provider doesn't exist (e.g., placeholder `xxxx`), it
searches through the tenant's actually registered providers for a model
of the same type and returns the first match.
- When an instance name doesn't match (e.g., `"default"` vs actual name
`"1"`), the function now auto-resolves to the first real instance under
that provider.
- Falls back to `FACTORY_LLM_INFOS` validation when neither provider nor
instance exists.

**`web/src/hooks/use-llm-request.tsx`**

- Added `queryClient.invalidateQueries({ queryKey:
LlmKeys.defaultModels() })` to `useAddProviderInstance` so that the
default model list is re-fetched immediately after a provider instance
is added, eliminating the need for a manual page refresh.

### Testing

- Verified with a tenant whose `embd_id=bge-m3@xxxx` and only provider
is ZHIPU-AI (instance `1`): `_resolve_model_from_tenant_providers`
correctly resolves to `embedding-2@1@ZHIPU-AI`.
- After adding a provider via the UI, the embedding model dropdown now
immediately shows the resolved default without requiring a page refresh.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-06-04 09:55:49 +08:00
bohdansolovie
ae316b3415 fix(api): guard document rename when linked file row is missing (#15536)
## Summary
Fixes #15534 — `update_document_name_only()` crashes with
`AttributeError` when `File2Document` exists but the linked `File` row
was deleted.

`update_document_name_only()` in `document_api_service.py` called
`FileService.get_by_id()` when a `File2Document` row existed, then
accessed `file.id` without checking the lookup result. An orphan
`File2Document` link (file deleted, mapping left behind) caused document
rename via `PATCH /api/v1/datasets/{dataset_id}/documents/{document_id}`
to return HTTP 500.

This PR mirrors guards used in `file2document_api.py` and
`file_api_service.py`: skip the optional file rename when the file is
missing, and still update the document record and search index.

## Changes
- `api/apps/services/document_api_service.py` — check `e and file`
before `FileService.update_by_id`
- `test/unit_test/api/apps/services/test_update_document_name_only.py` —
regression tests (orphan link + happy path)

## Test plan
- [x] `pytest
test/unit_test/api/apps/services/test_update_document_name_only.py -v`
- [ ] Manual: PATCH document `name` when `File2Document` points to a
non-existent `file_id` → 200, document/index renamed, no 500
2026-06-03 17:57:19 +08:00
Lynn
ac3964b6bc Feat: display intl url for siliconflow and verify model provider without llms in json (#15550)
### What problem does this PR solve?

As title.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-03 14:43:08 +08:00
Wang Qi
583daf47d5 Fix: model provider orders (#15524)
Fix: model provider orders
2026-06-03 10:17:12 +08:00
Lynn
36357a6afd Fix: model provider (#15517)
### What problem does this PR solve?

Fix:
- Handle siliconflow and siliconflow_intl api_key

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-02 19:04:20 +08:00
Lynn
3bc5ed282e Fix: model-provider bugs (#15460)
### What problem does this PR solve?

Fix:
- Use @ to avoid split  by `_` in model_name.
- Verify api_key when add instance.
- Pop api_key in list intances response.
- Remove useless index.
- Sort providers, instances and models by name.
- Get `is_tools` from llm_factories.json

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-02 13:24:53 +08:00
Lynn
dc4b82523b Feat: tenant llm provider (#14595)
### What problem does this PR solve?

Python implementation of the Go-based model_provider API suite.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: bill <yibie_jingnian@163.com>
2026-05-29 17:39:41 +08:00
Wang Qi
7e6844118b Fix search vector_similarity_weight (#15108)
### What problem does this PR solve?

Fix search vector_similarity_weight

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-22 16:05:13 +08:00
jony376
198f3c4b9a Fix: validate memory tenant model IDs on update and enforce tenant scope in memory pipeline (#14923)
### Related issues

Closes #14922

### What problem does this PR solve?

`POST /memories` already resolves `tenant_llm_id` and `tenant_embd_id`
through `ensure_tenant_model_id_for_params`, but `PUT
/memories/<memory_id>` accepted client-supplied `tenant_llm_id` /
`tenant_embd_id` without checking that those `tenant_llm` rows belong to
the memory owner’s tenant. A caller could persist another tenant’s row
IDs and later trigger extraction or embedding that loaded foreign model
credentials via `get_model_config_by_id(tenant_model_id)` with no tenant
allow-list.

This change aligns the update path with create: updates that change
models must go through `llm_id` / `embd_id` and
`ensure_tenant_model_id_for_params` scoped to the **memory’s**
`tenant_id` (not only the current user, so team-access cases stay
correct). Direct `tenant_*` fields in the body without `llm_id` /
`embd_id` are rejected. As defense in depth, `memory_message_service`
passes `allowed_tenant_ids` / `requester_tenant_id` into
`get_model_config_by_id` for LLM and embedding resolution so mismatched
IDs cannot be used even if bad data existed. A regression test rejects
payloads that set only `tenant_llm_id` / `tenant_embd_id`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: jony376 <jony376@gmail.com>
2026-05-19 10:11:46 +08:00
Wang Qi
732e4741c4 Bugfix: fix tag show (#14980)
### What problem does this PR solve?

Bugfix: fix tag show

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-18 18:55:01 +08:00
qinling0210
f1d2383572 Push metadata filters down to Infinity (#14974)
### What problem does this PR solve?

Push metadata filters down to Infinity

### Type of change

- [x] Refactoring
2026-05-18 14:22:04 +08:00
dale053
bd99a22661 fix: atomic chunk/token counter updates for documents and knowledge b… (#14867)
### What problem does this PR solve?

Fixes #14866.

Previously, `DocumentService.increment_chunk_num` and
`decrement_chunk_num` updated the `Document` row and its parent
`Knowledgebase` row in two separate, non-transactional statements. If
the second update failed (DB error, connection drop, etc.) after the
first one succeeded, the document and knowledge base chunk/token
counters would drift apart and stay inconsistent.

There was also a behavioral asymmetry between the two methods:

- `increment_chunk_num` only logged a warning when the document row was
missing and returned a value that callers usually treated as success.
- `decrement_chunk_num` raised `LookupError` in the same situation.

This PR makes the counter updates atomic and aligns the missing-document
behavior between the two methods:

- Wrap the `Document` and `Knowledgebase` updates in
`increment_chunk_num` / `decrement_chunk_num` inside a `DB.atomic()`
block so both succeed or both roll back together.
- Raise `LookupError` from `increment_chunk_num` when the target
document no longer exists, matching `decrement_chunk_num`.
- Update `reset_document_for_reparse` in `document_api_service.py` to
catch the new `LookupError` and return a proper "Document not found!"
API error instead of propagating the exception.

No schema changes, no API contract changes for the success path; only
the failure mode for a missing document during reparse is now a clean
error response instead of an uncaught exception.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-14 14:48:52 +08:00
jony376
7f699d1202 Fix: enforce tenant authorization for tenant_rerank_id in retrieval flows (#14782)
### Related issues

Closes #14781 

### What problem does this PR solve?

Some retrieval endpoints accepted caller-supplied `tenant_rerank_id` and
resolved it through `get_model_config_by_id(...)`. That helper loaded
`TenantLLM` rows by global database id and returned decoded model
configuration without checking whether the model belonged to the
authenticated tenant or the dataset owner tenant.

This meant dataset access was validated, but rerank-model selection was
not. A caller who knew or could guess another tenant's
`tenant_rerank_id` could attempt retrieval with a foreign rerank model
config, creating a cross-tenant authorization gap for model usage.

This PR closes that gap by making `tenant_rerank_id` resolution
tenant-aware across the retrieval paths that accept it.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Solution

- Extend `get_model_config_by_id(...)` to accept an optional
`allowed_tenant_ids` set and reject `TenantLLM` rows whose `tenant_id`
is outside that set.
- Pass the allowed tenant scope from retrieval endpoints that accept
`tenant_rerank_id`:
  - `api/apps/sdk/doc.py`
  - `api/apps/sdk/session.py`
  - `api/apps/services/dataset_api_service.py`
- Use the authenticated tenant plus dataset-owner tenant ids already
derived by each retrieval flow as the authorization boundary for rerank
model selection.
- Add focused unit coverage to assert unauthorized `tenant_rerank_id`
values are rejected and that the allowed tenant set is propagated
correctly.

### Testing

- `python -m py_compile` on:
  - `api/db/joint_services/tenant_model_service.py`
  - `api/apps/services/dataset_api_service.py`
  - `api/apps/sdk/doc.py`
  - `api/apps/sdk/session.py`
- Added unit tests in:
-
`test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py`
-
`test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py`

### Notes for reviewers

- This change is intentionally narrow: it affects only the
`tenant_rerank_id` path, not the normal `rerank_id` name-based
resolution path.
- Local lint/syntax checks passed.
- Full pytest execution could not be completed in this environment
because the local test runtime is missing `strenum`, so the route-test
files fail during collection before exercising the updated cases.

---------

Co-authored-by: jony376 <jony376@gmail.com>
2026-05-13 19:53:08 +08:00
Wang Qi
45d676bc05 Fix delete graphrag not take effect in UI (#14879)
### What problem does this PR solve?

Fix delete graphrag not take effect in UI

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-13 13:49:16 +08:00
euvre
f4b8f53b6d Fix: restore embedding model switching for datasets with existing chunks (#14732)
### What problem does this PR solve?

## Problem

During the REST API refactoring (#13690), the
`/api/v2/kb/check_embedding` endpoint was removed and never migrated to
the new RESTful structure. The frontend was pointed to the
`/api/v1/datasets/{id}/embedding` endpoint (which is `run_embedding` — a
completely different function). Additionally, a hard guard was
introduced that rejects any `embd_id` change when `chunk_num > 0`,
making it impossible to switch embedding models on datasets with
existing chunks.

## Root Cause

1. **Missing endpoint**: The old `check_embedding` logic (sample random
chunks, re-embed with the new model, compare cosine similarity) was not
carried over to the new REST API service layer.
2. **Wrong frontend URL**: `checkEmbedding` in `api.ts` pointed to
`/datasets/{id}/embedding` (`run_embedding`) instead of a dedicated
check endpoint.
3. **Overly restrictive guard**: `dataset_api_service.py` line 310
blocked all `embd_id` updates when `chunk_num > 0`. This check did not
exist in the pre-refactor code — it was incorrectly introduced during
the refactor.

## Changes

### Backend

- **`api/apps/services/dataset_api_service.py`**
  - Remove the `chunk_num > 0` hard guard on `embd_id` updates
- Add `check_embedding()` service function: samples random chunks,
re-embeds them with the candidate model, computes cosine similarity,
returns compatibility result (avg ≥ 0.9 = compatible)
  - Add `import re` for the `_clean()` helper

- **`api/apps/restful_apis/dataset_api.py`**
- Add `POST /datasets/<dataset_id>/embedding/check` endpoint following
the new REST API conventions
  - Clean up unused top-level imports (`random`, `re`, `numpy`)

### Frontend

- **`web/src/utils/api.ts`**
- Fix `checkEmbedding` URL from `/datasets/${datasetId}/embedding` →
`/datasets/${datasetId}/embedding/check`

### Tests

-
**`test/testcases/test_http_api/test_dataset_management/test_update_dataset.py`**
- Update `test_embedding_model_with_existing_chunks` to assert success
(`code == 0`) instead of expecting the old `102` error

-
**`test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`**
- Update `test_update_route_branch_matrix_unit` to assert
`RetCode.SUCCESS` when updating `embd_id` on a chunked dataset,
replacing the old `chunk_num` error assertion

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: noob <yixiao121314@outlook.com>
2026-05-09 18:48:57 +08:00
akie
c11650bb4c Fix IDOR: Add permission checks to file ancestry endpoints (#14725)
Close #14292

## Issue

File ancestry endpoints return folder metadata without validating tenant
permissions, allowing any authenticated user to query arbitrary
`file_id` values across tenant boundaries.

## Affected Endpoints
- `GET /v1/file/parent_folder?file_id={file_id}`
- `GET /v1/file/all_parent_folder?file_id={file_id}`  
- `GET /api/v1/files/{id}/ancestors`

## Root Cause

These endpoints **skip the permission check** that other file operations
(Delete, Download, Move) perform.

## Expected Permission Check

All file operations should follow this 3-step validation:

- Check file.tenant_id
- Check if user_id belongs to this tenant (via user_tenant join table)
- Check KB permission type (team permission)


**Code reference:** This is implemented in `checkFileTeamPermission()`
and used by Delete/Download/Move, but **missing** from
GetParentFolder/GetAllParentFolders.

## Reproduction

```bash
# User B (tenant: BBB) accessing User A's file (tenant: AAA)
curl -H "Authorization: Bearer USER_B_TOKEN" \
  "http://localhost:9384/v1/file/parent_folder?file_id=AAA_FILE_123"

# Result: Returns User A's folder metadata 
# Expected: "No authorization." 
Fix
Pass userID from handler to service and call checkFileTeamPermission() — same as Download/Delete/Move handlers.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 16:03:23 +08:00
Wang Qi
7d35e40c7b Refactor : Allow search multiple datasets (#14685)
### What problem does this PR solve?

Refactor : Allow search multiple datasets
1. support /datasets/search
2. get rid of /graph/search, use /graph

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
2026-05-08 19:01:35 +08:00
Lynn
ada6d47880 Fix: move file check (#14681)
### What problem does this PR solve?

Restrict file move operations: prevent moving a folder to itself or to
one of its own subfolders.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-08 17:58:37 +08:00
sxxtony
59c35100c5 Perf: push metadata filters down to Elasticsearch (#14576)
### What problem does this PR solve?

Fixes #14412.

`common.metadata_utils.meta_filter` evaluates user-defined metadata
conditions in Python after `DocMetadataService.get_flatted_meta_by_kbs`
loads the entire `meta_fields` table into memory. Past a few thousand
documents per knowledge base this becomes a memory bottleneck and a
wasted ES round-trip — every filter request currently fetches up to
10000 metadata rows even when the resulting `doc_ids` list is tiny.

This PR adds an ES push-down path that translates the same filter
language into a `bool` query and returns just the matching document IDs.

**Changes**

- `common/metadata_es_filter.py` *(new)*: pure-Python translator from
the RAGflow filter list to ES DSL. Covers every operator the in-memory
path supports (`=`, `≠`, `>`, `<`, `≥`, `≤`, `in`, `not in`, `contains`,
`not contains`, `start with`, `end with`, `empty`, `not empty`) with
`case_insensitive: true` on `prefix` and `wildcard` for parity with the
existing lower-cased Python comparisons. User wildcard metacharacters
are escaped before being injected into `wildcard` patterns. Negative
operators (`≠`, `not in`, `not contains`, ranges) are wrapped with an
`exists` guard so they do not accidentally match documents missing the
key, matching the legacy `if k not in metas` behaviour.
- `api/db/services/doc_metadata_service.py`: new
`DocMetadataService.filter_doc_ids_by_meta_pushdown(kb_ids, filters,
logic)` that returns the doc IDs ES matched, or `None` to signal the
caller should fall back to the in-memory path. Returns `None` when the
active doc store is Infinity (`meta_fields` is a JSON column, not a
dotted-object mapping), when any filter cannot be expressed in DSL
(`UnsupportedMetaFilter`), or when the ES request or metadata index
lookup errors.
- `common/metadata_utils.py`: `apply_meta_data_filter` accepts an
optional `kb_ids` argument. When supplied, conditions go through
push-down first via a new `_try_meta_pushdown` helper; on `None` the
function falls back to the original `meta_filter` call. Default
behaviour is unchanged for callers that don't pass `kb_ids`.
- Updated all four callers (`agent/tools/retrieval.py`,
`api/db/services/dialog_service.py` ×2,
`api/apps/services/dataset_api_service.py`, `api/apps/sdk/session.py`)
to forward `kb_ids` so the push-down path is exercised in production.
- `test/unit_test/common/test_metadata_es_filter.py` *(new)*: 35 unit
tests covering every operator's DSL shape, value coercion
(`ast.literal_eval`, lowercasing, ISO-date pass-through), wildcard
escaping, OR-logic wrapping that protects negative clauses, and the
doc-ID extractor.

**Behaviour preserved**

- The in-memory `meta_filter` is untouched and still services every
fallback case (Infinity backend, unknown operators, ES outages).
- The eligibility / credibility / issue-multiplier semantics described
in the LLM-driven `auto` and `semi_auto` modes still hand the LLM the
full in-memory `metas` dict to choose conditions from. Only the
*evaluation* of those generated conditions is pushed down.
- Existing tests in
`test/unit_test/common/test_metadata_filter_operators.py` continue to
pass (14/14).

**Test plan**

- `pytest test/unit_test/common/test_metadata_es_filter.py` — 35 passed.
- `pytest test/unit_test/common/test_metadata_filter_operators.py` — 14
passed.
- `ruff check` clean on every modified file.
- Reviewer please validate the ES query shapes against a live cluster —
particularly `case_insensitive` on `wildcard` and `prefix` (requires ES
7.10+) and the `exists` + `must_not` pairing for `≠`.

**Notes**

- The first cut caps each push-down request at 10000 results, matching
the existing `get_flatted_meta_by_kbs` limit, and logs a warning when
the cap is hit. A `search_after` follow-up would let us drop the cap
entirely once the push-down path is validated.
- Operator parity with the in-memory path is exact for the canonical
unicode operators (`≥`, `≤`, `≠`) used internally; the ASCII aliases
(`>=`, `<=`, `!=`) are normalised by `convert_conditions` before they
reach the translator.

### Type of change

- [x] Performance Improvement

---------

Co-authored-by: sxxtony <sxxtony@users.noreply.github.com>
2026-05-07 21:23:43 +08:00
Wang Qi
f32034e83e Refactor: completion -> completions (#14584)
### What problem does this PR solve?

Keep only /completions, deprecated /completion

### Type of change

- [x] Refactoring
2026-05-06 17:19:22 +08:00
Preston Percival
e8f19aa338 feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238)
This PR addresses three related GraphRAG reliability issues that
together allow long-running GraphRAG tasks (10+ hours of LLM extraction)
to be resumed after a crash or pause without re-doing completed work. It
builds on #14096 (per-doc subgraph cache) and extends the same idea to
the resolution and community-detection phases.

Fixes #14236.

## 1. Fix concurrent merge crash

Long GraphRAG runs would crash near the end of entity resolution with:
```
RuntimeError: dictionary keys changed during iteration
```
in `Extractor._merge_graph_nodes`. Two changes:

- `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)`
via `list(...)` before iterating, so concurrent `add_edge` /
`remove_node` mutations on the shared `nx.Graph` cannot invalidate the
iterator. Also tracks each redirected neighbour in `node0_neighbors` so
a later merged node sharing the same external neighbour takes the
edge-merge branch instead of overwriting via `add_edge`.
- `rag/graphrag/entity_resolution.py`: serialize the merge step with a
dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and
concurrent merges on overlapping neighbourhoods can produce incorrect
results even with the snapshot fix.

## 2. Don't wipe partial graph on pause

Previously the pause / cancel UI path called
`settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`,
destroying every subgraph, entity, relation, and graph row.
Re-triggering then started GraphRAG from scratch even though #14096 had
already added `load_subgraph_from_store`.

After main was merged in (which deleted `api/apps/kb_app.py` per
#14394), the pause path now lives on the new REST surface `DELETE
/v1/datasets/<id>/<index_type>`:

- `api/apps/services/dataset_api_service.py`: `delete_index` accepts a
`wipe: bool = True` parameter. When `False` the doc-store rows and
GraphRAG phase markers are left intact and only the running task is
cancelled. Default preserves historical behaviour.
- `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false|0|no|off`
from the query string and forwards it.
- `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`:
`unbindPipelineTask` appends `?wipe=false` when explicitly false.
- The GraphRAG pause action in
`web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe:
false` for `KnowledgeGraph`; raptor is unchanged.

**UX impact:** the pause icon next to a running GraphRAG task no longer
wipes graph data. The only path that still wipes is the explicit Delete
action in `GenerateLogButton` (trash icon behind a confirmation modal).

## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`)

A small Redis-backed marker layer at
`graphrag:phase:{kb_id}:{resolution_done|community_done}` (7-day TTL).
`run_graphrag_for_kb` consults the markers on entry and skips phases
that already completed in a prior run. Markers are cleared automatically
when:
- new docs are merged into the graph (which invalidates prior resolution
and community results),
- `delete_index` wipes the graph, or
- `delete_knowledge_graph` is called.

Redis failures never block a run -- markers are an optimization, not a
gate.

## 4. Idempotent community detection

`extract_community` previously did `delete-then-insert` on
`community_report` rows; a crash mid-insert left the dataset with no
reports. Now report IDs are derived deterministically from `(kb_id,
community.title)`, the existing report IDs are snapshotted before
insert, new rows are written, then only stale rows are pruned. A failure
at any step leaves either the prior or the new report set intact --
never a partial mix.

## 5. Tunable doc-store insert pipeline

The GraphRAG insert loop in `rag/graphrag/utils.py` and the
`community_report` insert in `rag/graphrag/general/index.py` were both
hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real
KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure
round-trip overhead.

- New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py`
batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout
semantics as the prior loop.
- Defaults: 64 docs per batch, 4 batches in flight (matches the regular
ingest pipeline in `document_service.py`). Tunable per-deployment via
`GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`.
- Both `set_graph` and `extract_community` now use the helper.

This dropped the same 1077-chunk insert from minutes to seconds in local
testing without measurable extra pressure on Infinity (total in-flight
docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default).

## Tests

- `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests):
dense neighbourhood merge, neighbour-snapshot regression, concurrent
serialized merges.
- `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has
round-trip, kb-scoped clear, no-op on empty input, graceful Redis
failure.
-
`test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`:
new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both
GraphRAG and raptor on the new REST route, and confirms the default
still wipes and clears phase markers.

## Compatibility

- Backward compatible: tasks queued before this change behave
identically (default `wipe=true`, no markers expected).
- No schema/migration changes; all new state lives in Redis.
- New optional REST query param `wipe` on `DELETE
/v1/datasets/<id>/<index_type>`.
- New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and
`GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour.

## Example of resume

Screenshot below shows a test resuming knowledge graph generation after
applying the concurrency fix and re-deploying.

<img width="521" height="677" alt="image"
src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a"
/>

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2026-05-06 15:01:01 +08:00
jony376
94f8779a00 Memory API: enforce tenant permissions on memory and message endpoints (#14535)
### What problem does this PR solve?

This PR fixes missing authorization checks in the Memory API.
Previously, several authenticated endpoints accepted caller-supplied
`tenant_id`, `owner_ids`, or `memory_id` values and used them directly
to list, read, update, delete, or search Memory data.

That could allow an authenticated user to access or mutate another
tenant's Memory records if they knew a tenant ID or memory ID. The fix
centralizes Memory access checks and applies them consistently across
Memory and Memory-message operations.

The change:

- Adds helper logic to parse list filters and compute tenant IDs
accessible to `current_user`.
- Requires direct `memory_id` operations to pass Memory access checks
before reading, updating, deleting, or changing message state.
- Filters list/search/recent-message requests to accessible memories
only.
- Applies Memory visibility filtering before count and pagination in
`MemoryService.get_by_filter`.
- Accepts `owner_ids` in the Memory list route, matching the frontend
owner filter while still intersecting values with the caller's
accessible tenants.
- 

### Related issues
Closes #14534 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: jony376 <jony376@gmail.com>
2026-05-06 14:10:47 +08:00
Yingfeng
4ee0702aed Feat: add skills space to context engine (#13908)
### What problem does this PR solve?

issue #13714

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-30 12:36:03 +08:00