### What problem does this PR solve?
extend restful api suite
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): test
### What problem does this PR solve?
Closes#15332.
RAGFlow can index Gmail and generic IMAP mailboxes but had no native
connector for Outlook / Microsoft 365 mail. Organisations on Microsoft
365 had no way to bring mailbox content into a knowledge base through
Microsoft Graph.
This PR adds a net-new Outlook data source that:
- Authenticates against Microsoft Graph with the same MSAL
client-credentials flow already used by the SharePoint and Teams
connectors (no new auth primitives).
- Pages over `/users/{id}/mailFolders/{folder}/messages/delta` per
mailbox and persists `@odata.deltaLink` values in
`OutlookCheckpoint.delta_links`, so incremental syncs only fetch changed
messages.
- Supports two scoping modes:
- **Tenant-wide** (default): enumerates every user in the tenant via
`/users` and syncs each mailbox. Requires `User.Read.All`.
- **Targeted**: when `user_ids` is provided (comma-separated UPNs or
object IDs), only those mailboxes are synced. `User.Read.All` is not
needed in this mode.
- Lets the caller pick the mail folder (`inbox`, `sentitems`, `archive`,
...). Defaults to `inbox`.
- Maps each message to a `Document` shaped after the Gmail connector:
one `TextSection` carrying `From/To/Cc/Subject` headers + body, with
HTML bodies stripped to text inline (no extra dependency).
- Surfaces typed errors on the validation probe:
401 → `ConnectorMissingCredentialError`, 403 →
`InsufficientPermissionsError` (with `Mail.Read` / `User.Read.All`
hint), 404 on a configured mailbox → `ConnectorValidationError`, 5xx →
`UnexpectedValidationError`.
- Skips messages flagged `@removed` by the delta semantics and messages
whose `receivedDateTime` is older than `poll_range_start`.
#### Files
| File | Change |
|------|--------|
| `common/data_source/outlook_connector.py` | **New** —
`OutlookConnector` (`CheckpointedConnectorWithPermSync` +
`SlimConnectorWithPermSync`) + `OutlookCheckpoint` + tiny `_strip_html`
helper. |
| `common/data_source/config.py` | `DocumentSource.OUTLOOK = "outlook"`.
|
| `common/constants.py` | `FileSource.OUTLOOK = "outlook"`. |
| `common/data_source/__init__.py` | Export `OutlookConnector`. |
| `rag/svr/sync_data_source.py` | `Outlook(SyncBase)` with `batch_size`
normalisation, CSV/list parsing of `user_ids`; registered in
`func_factory`. |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`DataSourceKey.OUTLOOK`, visibility map (`syncDeletedFiles: true`), info
entry, form fields (tenant_id, client_id, client_secret, folder,
user_ids, batch_size), default values. |
| `web/src/locales/en.ts`, `web/src/locales/zh.ts` |
`outlookDescription` + 5 tooltip keys (EN + ZH). |
| `test/unit_test/data_source/test_outlook_connector_unit.py` | **New**
— 19 unit tests (`p1`/`p2`/`p3`) covering auth, validation (tenant-wide
vs specific user vs error paths), checkpoint helpers, user enumeration
pagination, message filtering, HTML body stripping. |
#### Required Azure AD permissions
- `Mail.Read` (Application, admin-granted) — always.
- `User.Read.All` (Application, admin-granted) — only when `user_ids` is
left blank so the connector can enumerate mailboxes.
#### Out of scope
- **Attachment indexing.** The current connector emits message body +
headers; binary attachments are flagged via `metadata.has_attachments`
but not pulled. Adding attachment hydration is straightforward but
scoped out per the issue's "decide whether attachments are indexed in
the first version" note.
- **Delegated (per-user) OAuth.** The connector uses app-only
credentials, consistent with the SharePoint / Teams precedent in this
codebase.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Closes#15180.
`OIDCClient.parse_id_token` in `api/apps/auth/oidc.py` read the JWT
signing
algorithm from the **unverified** JWT header and passed it through to
`jwt.decode(..., algorithms=[alg], ...)` as the trust anchor. This is
the
textbook JWT algorithm-confusion vulnerability (CWE-345 / CWE-347). Any
unauthenticated client capable of reaching the OIDC callback could take
over
an arbitrary account on any RAGFlow deployment with OIDC login enabled:
1. **`alg: "none"`** — present a JWT with `{"alg": "none"}` and no
signature segment → `jwt.decode(..., algorithms=["none"])` → PyJWT's
`NoneAlgorithm` accepts the token without verification → login as any
user.
2. **RSA / HMAC confusion** — fetch the public RSA key from the
provider's
JWKS (it's public), forge a JWT with `{"alg": "HS256"}` HMAC-signed
using the public-key bytes as the secret → `jwt.decode(...,
algorithms=["HS256"], key=public_key)` → verifier accepts → login as
any user. (Modern PyJWT independently refuses to use a PEM-formatted
key as an HMAC secret, which mitigates this leg for PEM key formats;
the fix here is the only mitigation for raw / DER / JWK octet keys and
for older PyJWT versions.)
### What changed
**`api/apps/auth/oidc.py`:**
- New module constants `_ALLOWED_OIDC_SIGNING_ALGS` (asymmetric-only:
`RS*`, `ES*`, `PS*`, `EdDSA` — explicitly excludes `none` and `HS*`)
and `_DEFAULT_OIDC_SIGNING_ALGS = ("RS256",)` (the OIDC Core 1.0 §2
spec default).
- New helper `_resolve_id_token_signing_algs(metadata)` — intersects the
provider's advertised `id_token_signing_alg_values_supported` from
`/.well-known/openid-configuration` with the safe allowlist; falls back
to RS256 when the field is missing or contains only unsafe values.
- `OIDCClient.__init__` now stores the resolved allowlist on
`self.id_token_signing_algs` — pinned once, from a trusted source, at
construction time.
- `parse_id_token` no longer calls `jwt.get_unverified_header` and no
longer reads `alg` from the JWT header. It passes
`self.id_token_signing_algs` to `jwt.decode(..., algorithms=...)`.
`PyJWKClient.get_signing_key_from_jwt` still reads the `kid` from the
header internally for JWKS lookup — that's fine, `kid` is not a
security decision; the signature still proves which key was actually
used.
**`test/testcases/test_web_api/test_auth_app/test_oidc_client_unit.py`:**
- Existing `test_parse_id_token_success_and_error` drops its
`jwt.get_unverified_header` mock (no longer called by `parse_id_token`).
- `_metadata` and `_make_client` helpers grew an optional `signing_algs`
parameter so tests can configure what the discovery document advertises.
- New `TestSSRFValidation` / algorithm-confusion regression block (7
tests):
- `test_id_token_signing_algs_default_to_rs256_when_metadata_missing`
- `test_id_token_signing_algs_intersect_metadata_with_safe_allowlist`
- `test_id_token_signing_algs_fall_back_when_only_unsafe_advertised`
- `test_id_token_signing_algs_ignores_non_string_entries`
- `test_id_token_signing_algs_handles_non_list_metadata_field`
- `test_parse_id_token_passes_pinned_algorithms_to_jwt_decode` —
sabotages `jwt.get_unverified_header` to raise on call, proving the
verification path never consults the unverified header.
- `test_parse_id_token_rejects_alg_none` — uses real PyJWT to encode an
`alg: "none"` token; `parse_id_token` raises `ValueError("Error
parsing ID Token: …")` instead of accepting it.
- `test_parse_id_token_rejects_hs256_when_allowlist_is_asymmetric` —
uses real PyJWT to forge an `alg: "HS256"` token with a non-PEM
shared secret (so PyJWT's incidental PEM-as-HMAC refusal isn't what
blocks it); `parse_id_token` raises because `HS256` is not in the
pinned allowlist.
Sanity-checked end-to-end with real PyJWT outside the project test
runner:
- `alg=none` forged token + `algorithms=["RS256"]` →
`InvalidAlgorithmError` ✓
- `alg=HS256` forged token + `algorithms=["RS256"]` →
`InvalidAlgorithmError` ✓
- Same `alg=HS256` token + `algorithms=["HS256"]` → **accepted**
({'sub': 'admin'})
— confirming the attack path was real before the fix.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: galuis116 <contact@duerrimports.com>
## Summary
- Fix `meta_filter()` AND logic so an empty result from an early
condition is not overwritten when a later condition matches.
- Add regression tests for empty-first AND, successful AND intersection,
and OR behavior after an empty first condition.
Fixes incorrect `/retrieval` metadata filtering when multiple AND
conditions are used and the first condition matches no documents.
Closes#15360
## Test plan
- [x] `pytest test/unit_test/common/test_metadata_filter_operators.py
-v` (19/19 passed)
### What problem does this PR solve?
Closes#15330.
RAGFlow had no connector for OneDrive / OneDrive for Business. Users who
store working documents in OneDrive could not index them into a
knowledge base without manually downloading and re-uploading files.
This PR adds a net-new OneDrive data source that:
- Authenticates against Microsoft Graph with the same MSAL
client-credentials flow already used by the SharePoint and Teams
connectors (no new auth primitives).
- Enumerates every drive visible to the service principal and pages
through `/drives/{id}/root/delta`, persisting `@odata.deltaLink` values
per drive so subsequent syncs only fetch changed items.
- Optionally narrows ingestion to a sub-folder (`folder_path`) without
needing a separate code path.
- Surfaces typed errors on the validation probe (`GET /drives?$top=1`):
401 → `ConnectorMissingCredentialError`, 403 →
`InsufficientPermissionsError` (with a `Files.Read.All` hint), 5xx →
`UnexpectedValidationError`.
- Filters folders, soft-deleted items, and unsupported extensions (`.pdf
.docx .doc .xlsx .xls .pptx .ppt .txt .md .csv`).
#### Files
| File | Change |
|------|--------|
| `common/data_source/onedrive_connector.py` | **New** —
`OneDriveConnector` + `OneDriveCheckpoint`. |
| `common/data_source/config.py` | `DocumentSource.ONEDRIVE =
"onedrive"`. |
| `common/constants.py` | `FileSource.ONEDRIVE = "onedrive"`. |
| `common/data_source/__init__.py` | Export `OneDriveConnector`. |
| `rag/svr/sync_data_source.py` | `OneDrive(SyncBase)` with `batch_size`
normalisation; registered in `func_factory`. |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`DataSourceKey.ONEDRIVE`, visibility map (`syncDeletedFiles: true`),
info entry, form fields (tenant_id, client_id, client_secret,
folder_path, batch_size), default values. |
| `web/src/locales/en.ts`, `web/src/locales/zh.ts` |
`onedriveDescription` + 4 tooltip keys (EN + ZH). |
| `test/unit_test/data_source/test_onedrive_connector_unit.py` | **New**
— 13 unit tests (`p1`/`p2`) covering auth, validation, checkpoint
helpers, and document filtering. |
#### Required Azure AD permission
`Files.Read.All` (Application, admin-granted).
#### Out of scope
- Interactive end-user OAuth (delegated permissions) — the connector
uses app-only credentials, consistent with the SharePoint / Teams
precedent.
- Binary download of file contents — the sync layer emits `Document`s
carrying `webUrl` + metadata; bytes are hydrated downstream by the parse
pipeline.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
default OpenAI chat completions to non-stream when `stream` is omitted
https://github.com/infiniflow/ragflow/issues/15356
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Python implementation of the Go-based model_provider API suite.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: bill <yibie_jingnian@163.com>
### What problem does this PR solve?
Closes#15191.
RAGFlow shipped a Microsoft Teams connector stub
(`common/data_source/teams_connector.py`) whose document-loading methods
all returned `[]`, `Teams._generate()` was a `pass`, and Teams was
commented out of the data-source settings UI. As a result there was no
way to index Teams channel conversations into a knowledge base.
This PR implements the connector end to end on top of Microsoft Graph
(Office365-REST-Python-Client). It shares the MSAL client-credentials
auth shape with the SharePoint connector.
**Backend**
- `common/data_source/teams_connector.py`
- `load_credentials()` now builds the Graph client using an MSAL
client-credentials **token callback** — the form `GraphClient` actually
expects. (The previous stub passed a raw access-token string to
`GraphClient(...)`, which is not how that client is driven.) Token
acquisition is lazy, so credential loading performs no network call.
- `validate_connector_settings()` lists teams via Graph.
- `load_from_checkpoint()` is now a generator that pages teams →
channels → messages, flattens each top-level post together with its
replies into one blob-based `Document` (`extension` `.txt`/`.html`,
`blob`, `size_bytes`, `doc_updated_at`). Incremental syncs are bounded
by message `lastModifiedDateTime` (falling back to `createdDateTime`).
Per-message errors surface as `ConnectorFailure` instead of aborting the
run.
- `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument`
batches and the checkpoint helpers return proper `TeamsCheckpoint`s.
- ACL → `ExternalAccess` mapping is intentionally left best-effort
(`load_from_checkpoint_with_perm_sync` delegates to the standard load)
because the sync pipeline does not currently persist `ExternalAccess`.
- `rag/svr/sync_data_source.py`
- Implemented `Teams._generate()` using the existing
`CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google
Drive), supporting full reindex and incremental polling from
`poll_range_start`.
- `TeamsConnector` is already exported from
`common/data_source/__init__.py`.
**Frontend (`web/`)**
- Enabled the `TEAMS` data-source enum and added its form fields
(`tenant_id`, `client_id`, `client_secret`), default values, display
metadata, and a Teams icon.
- Added `teamsDescription` / `teamsTenantIdTip` to `en.ts` and `zh.ts`.
**Tests**
- `test/unit_test/data_source/test_teams_connector_unit.py`: mock-based
unit tests covering credential loading (incomplete creds raise, happy
path sets the Graph client, fetch-without-creds raises), post/reply
flattening (incl. the HTML vs text extension), incremental
`lastModifiedDateTime` filtering, and slim-doc listing. All 6 pass;
`ruff check` is clean.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Closes#15187.
RAGFlow shipped a Slack connector
(`common/data_source/slack_connector.py`) but it was never usable:
`Slack._generate()` in the sync worker was a `pass` stub, the
connector's document-generating code was incompatible with the current
data model,
and Slack was commented out of the data-source settings UI. As a result,
teams had no way to index Slack channels/threads into a knowledge base.
This PR completes the connector end to end.
**Backend**
- `common/data_source/slack_connector.py`
- Rewrote `thread_to_doc` to produce a blob-based `Document`
(`extension`/`blob`/`size_bytes`). The previous implementation built the
doc with a `sections=[...]` argument and omitted the now-required
`blob`/`extension`/ `size_bytes` fields, so it raised a validation error
against the current `Document` model. Thread messages are now cleaned
and flattened into a single UTF-8 text blob.
- Added `load_from_state()` / `poll_source(start, end)` generators. The
connector's checkpoint interface is a no-op stub, so both full and
incremental syncs run through a single channel-iterating generator built
on the existing module helpers (`get_channels`, `filter_channels`,
`get_channel_messages`, `_process_message`), with per-channel thread
de-duplication.
- `rag/svr/sync_data_source.py`
- Implemented `Slack._generate()`. Credentials are loaded via
`StaticCredentialsProvider` (the connector requires `slack_bot_token`
and does not support `load_credentials`). Supports full reindex and
incremental polling from `poll_range_start`, plus the optional channel
filter. Modeled on the Confluence/Dropbox wrappers.
- `SlackConnector` was already exported from
`common/data_source/__init__.py`.
**Frontend (`web/`)**
- Enabled the `SLACK` data-source enum and added its form fields (Slack
bot token + optional channel filter), default values, display metadata,
and a Slack icon.
- Added `slackDescription` / `slackBotTokenTip` / `slackChannelsTip`
strings to `en.ts` and `zh.ts`.
**Tests**
- `test/unit_test/data_source/test_slack_connector_unit.py`: unit tests
covering credential loading (`load_credentials` raises,
`set_credentials_provider` initializes clients, missing credentials
raises) and document generation (standalone message + flattened thread,
blob/extension/size_bytes/metadata, and the incremental poll time
window). All 5 pass; `ruff check` is clean.
Required Slack scopes: `channels:read`, `channels:history`,
`users:read`.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Closes#15189.
RAGFlow shipped a SharePoint connector stub
(`common/data_source/sharepoint_connector.py`) whose document-loading
methods all returned `[]`, `SharePoint._generate()` was a `pass`, and
SharePoint was commented out of the data-source settings UI. As a result
there was no way to index files stored in SharePoint document libraries.
This PR implements the connector end to end on top of Microsoft Graph
(Office365-REST-Python-Client).
**Backend**
- `common/data_source/sharepoint_connector.py`
- `load_credentials()` now builds the Graph client using an MSAL
client-credentials **token callback** — the form `GraphClient` actually
expects. (The previous stub passed a raw access-token string to
`GraphClient(...)`, which is not how that client is driven.) Token
acquisition is lazy, so credential loading does no network call.
- `validate_connector_settings()` resolves the configured site via
Graph.
- `load_from_checkpoint()` is now a generator that enumerates every
document library under the site, walks folders depth-first, downloads
each file, and yields blob-based `Document` objects (`extension` /
`blob` / `size_bytes` / `doc_updated_at`). Incremental syncs are bounded
by file `lastModifiedDateTime`. Per-file errors are surfaced as
`ConnectorFailure` rather than aborting the run.
- `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument`
batches (no downloads) and the checkpoint helpers return proper
checkpoints.
- ACL → `ExternalAccess` mapping is intentionally left best-effort
(`load_from_checkpoint_with_perm_sync` delegates to the standard load)
because the sync pipeline does not currently persist `ExternalAccess`;
this can be extended once that plumbing exists.
- `rag/svr/sync_data_source.py`
- Implemented `SharePoint._generate()` using the existing
`CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google
Drive), supporting full reindex and incremental polling from
`poll_range_start`.
- `SharePointConnector` is already exported from
`common/data_source/__init__.py`.
**Frontend (`web/`)**
- Enabled the `SHAREPOINT` data-source enum and added its form fields
`site_url`, `tenant_id`, `client_id`, `client_secret`), default values,
display metadata, and a SharePoint icon.
- Added `sharepointDescription` / `sharepointSiteUrlTip` to `en.ts` and
`zh.ts`.
**Tests**
- `test/unit_test/data_source/test_sharepoint_connector_unit.py`:
mock-based unit tests covering credential loading (incomplete creds
raise, happy path sets the Graph client, fetch-without-creds raises),
drive traversal + file download, incremental `lastModifiedDateTime`
filtering, and slim-doc listing. All 6 pass; `ruff check` is clean.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
extend restful api suite
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): test
### What problem does this PR solve?
1. Break huge function into smaller pieces
2. Add unit test for the smaller pieces function
3. Layer-ed design
a. infra layer - task_context.py, recording_context.py,
write_operation_interceptor.py, ...
b. service layer - *_service.py
c. business layer - task_handler.py
4. Default behavior: use "refactor-ed version" - can switch to original
version by change env variable
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Performance Improvement
---------
Co-authored-by: Liu An <asiro@qq.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
### What problem does this PR solve?
extend restful api suite
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): test
### What problem does this PR solve?
- Update version tags in README files (including translations) from
v0.25.5 to v0.25.6
- Modify Docker image references and documentation to reflect new
version
- Update version badges and image descriptions
- Maintain consistency across all language variants of README files
### Type of change
- [x] Documentation Update
### What problem does this PR solve?
extend restful api suite
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): test
### What problem does this PR solve?
Creating or updating an agent via `POST /api/v1/agents` and `PUT
/api/v1/agents/{agent_id}` did not persist `canvas_type` because the
handler `req` dict never assigned the field before
`UserCanvasService.save` / `update_by_id`.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Co-authored-by: Cursor <cursoragent@cursor.com>
### What problem does this PR solve?
extend restful api suite
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): test
### What problem does this PR solve?
extend restful api suite
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): test
Follow on PR #15146 to reslove the backwad compatability issue.
1. /agents/<attachment_id>/download ->
/agents/attachments/<attachment_id>/download
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Summary
This change fixes ingestion quality issues where MinerU parser output
may contain HTML fragments (for example, table-related tags like `<tr>`,
`<td>`, `<br>`), which were previously passed directly into
chunking/tokenization and degraded chunk quality.
The fix adds a sanitization step in the MinerU parser path so parsed
sections are normalized to clean text before chunking.
## Change Type (select all)
- [x] Bug fix
- [x] Ingestion pipeline improvement
- [x] Parser/chunking quality fix
## Related Issue
- https://github.com/infiniflow/ragflow/issues/14831
### What problem does this PR solve?
1. Fix /chat/completions to send only the latest message
2. Allo chat stream=False
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix: /openai/<chat_id>/chat/completions not aware of session_id
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Summary
Implements the TODO in `evaluation_service.py`: **Track token usage** in
evaluation results.
## Changes
- **Import** `num_tokens_from_string` from `common.token_utils`
- **Prompt tokens**: Use the full prompt returned by `async_chat` when
available (includes system prompt + knowledge base + query), otherwise
fall back to the question token count
- **Completion tokens**: Count tokens in the generated answer
- **Storage**: Store `token_usage` as `{prompt_tokens,
completion_tokens, total_tokens}` in each `EvaluationResult` instead of
`None`
## Why
The evaluation pipeline previously saved `token_usage: None` for every
result. This change allows downstream consumers (e.g. evaluation
dashboards, cost tracking) to see approximate token usage per test case
using the same tokenizer (tiktoken cl100k_base) used elsewhere in
RAGFlow.
## Testing
- No new tests added; existing evaluation flow unchanged
- Token counting uses existing `num_tokens_from_string` utility
---------
Co-authored-by: kiannidev <kiannidev@users.noreply.github.com>
### What problem does this PR solve?
The agent API currently does not pass chat_template_kwargs to the
underlying LLM call path, so clients cannot control template-level model
behavior (such as thinking-mode toggles) when invoking
/agents/chat/completion. This PR adds passthrough support for
chat_template_kwargs across agent execution flows (session and
non-session, streaming and non-streaming) by propagating it through
canvas runtime state and into LLM invocation kwargs. This addresses the
feature gap raised in [Issue
#14182](https://github.com/infiniflow/ragflow/issues/14182).
Closes#14182
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
Closes#14789
### What problem does this PR solve?
User API endpoints (`login`, `user_profile`, `user_add`,
`forget_reset_password`) were returning full user objects via
`to_json()` / `to_dict()`, which included sensitive fields like
`password` and `access_token` in the response body. This leaks
credentials to the client.
This PR adds a `to_safe_dict()` method on the `User` model that strips
sensitive fields (`password`, `access_token`) and replaces all affected
call sites to use it.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
1. Enhance retry and timeout, and adjust the default timeout
2. NER: spacy do not batch chunks
3. extract _has_cancel_and_exit
4. enhance log messages
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
### What problem does this PR solve?
Closes#14865
`download_img` in `common/misc_utils.py` is used for OAuth avatar URLs.
The previous implementation called `async_request` from
`common.http_client`, which followed redirects without re-validating
each hop and did not apply the same SSRF protections as this path needs.
That made it possible to reach non-public or disallowed targets (for
example via redirects or unsafe URLs) when fetching avatars.
This change replaces that flow with an explicit, bounded fetch: each URL
(including every redirect target) is checked with
`common.ssrf_guard.assert_url_is_safe`, DNS is pinned with
`pin_dns_global`, `httpx` streams the body with `follow_redirects=False`
and a manual redirect loop (capped by
`RAGFLOW_OAUTH_AVATAR_MAX_REDIRECTS`), and total response size is capped
(`RAGFLOW_OAUTH_AVATAR_MAX_BYTES`). Timeouts, proxy, and user agent
align with `HTTP_CLIENT_*` env vars without importing `http_client`, so
lightweight tests stay simple.
Unit tests cover empty/None URLs, loopback, cloud metadata-style
addresses, and disallowed schemes so SSRF regressions are caught early.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
### What problem does this PR solve?
1. update python version to 3.13
2. upgrade ormsgpack to 1.6.0
### Type of change
- [x] Refactoring
---------
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
implement ASR and TTS for Xinference
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
`GET /agents/<agent_id>/sessions/<session_id>` crashed with
`AttributeError: 'NoneType' object has no attribute 'to_dict'` when the
session lookup failed: `_, conv =
API4ConversationService.get_by_id(...)` returned `(False, None)`, then
`conv.to_dict()` was called unconditionally.
This is reachable in multi-instance deployments: the session row may not
yet be visible on the node servicing the immediate follow-up GET after a
session is created on a different node.
Add the same `if not exists` guard already used by every other call site
of `API4ConversationService.get_by_id` (see agent_api.py:1147,
sdk/session.py:179, conversation_service.py:248, canvas_service.py:323).
Closes#14989
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
Replace the RuntimeError with a warning + first-address fallback so a
single email whose From header contains multiple addresses no longer
crashes the entire IMAP sync task. Also add regression tests covering:
- #14963: RFC 5322 quoted display names with commas (e.g. "Schlüter,
Sabine" <s@x>) parsed as one address, not two.
- #14964: multi-address headers warn instead of raising.
Closes#14964
Refs #14963
### What problem does this PR solve?
This PR adds a new `Browser` operator to Agent workflows, enabling
prompt-driven browser automation in RAGFlow.Technically based
‘Browser-Use’
It includes:
- Backend browser component execution with tenant LLM integration
- Upload source support (file IDs, URLs, variables, CSV/JSON array)
- Downloaded file persistence to RAGFlow storage
- Frontend node/operator integration, form config, icon, and i18n
updates
- Unit tests for upload/download and ID parsing logic
- Dependency and Docker updates for browser-use runtime support
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
## Summary
- Adds a lightweight `@tool` decorator and `FunctionToolSession` adapter
in `rag/llm/tool_decorator.py` that let callers register plain Python
functions as LLM tools without hand-writing OpenAI function schemas or
building an MCP-style session.
- Refactors `Base.bind_tools` and `LiteLLMBase.bind_tools` in
`rag/llm/chat_model.py` to accept either the new decorator form
`bind_tools(tools=[fn1, fn2])` or the existing `(toolcall_session,
tools_schemas)` form, so existing agent/dialog call-sites in
`agent/component/agent_with_tools.py`, `api/db/services/llm_service.py`,
and `api/db/services/dialog_service.py` are unaffected.
- Adds 8 unit tests in `test/unit_test/rag/llm/test_tool_decorator.py`
covering schema shape, required/optional inference, sync + async
dispatch, and bad-input rejection.
## Usage
```python
from rag.llm.tool_decorator import tool
@tool
def get_weather(city: str) -> str:
"""Get current weather for a city.
:param city: City name to look up.
"""
return f"{city}: 21 C, partly cloudy"
chat_mdl.bind_tools(tools=[get_weather])
ans, tk = await chat_mdl.async_chat_with_tools(system, history)
```
The decorator introspects `inspect.signature` + type hints + the
docstring (`:param name:` style) and attaches an OpenAI-format
`openai_schema` to the callable. `FunctionToolSession` duck-types the
existing `ToolCallSession` protocol, dispatching async callables
directly and sync ones through `thread_pool_exec` so the event loop is
never blocked.
## Design notes
- `tool_decorator.py` deliberately does **not** live inside
`rag/llm/__init__.py` to avoid forcing every consumer through the heavy
provider auto-discovery loop and to sidestep a circular import
(`__init__.py` imports `chat_model`, which would otherwise need symbols
from `__init__.py`).
- `FunctionToolSession` is duck-typed against
`common.mcp_tool_call_conn.ToolCallSession` rather than explicitly
inheriting from it, so importing the decorator doesn't pull the MCP
client SDK into the import graph.
- Docstring parsing is intentionally minimal (`:param name:` only) to
keep this dependency-free; Google/NumPy styles can be added later via
`docstring_parser` if needed.
## Test plan
- [x] `python -m pytest test/unit_test/rag/llm/test_tool_decorator.py
-v` — 8 passed
- [x] `python -m pytest test/unit_test/rag/llm/
--ignore=test/unit_test/rag/llm/test_perplexity_embed.py` — 11 passed
(the ignored test has a pre-existing `numpy` import that's unrelated)
- [ ] Reviewer: smoke-test the new path end-to-end with a live model via
`chat_mdl.bind_tools(tools=[my_fn])` to confirm the OpenAI-format
schemas pass through unchanged
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
### What problem does this PR solve?
Closes#15048.
Several SDK session routes in `api/apps/sdk/session.py` called
`.split()` directly on `request.headers.get("Authorization")`. When
clients omitted the header, the handlers raised `AttributeError` before
returning the existing `Authorization is not valid!` response.
This PR centralizes SDK Authorization parsing in a small helper and
keeps the existing error response for missing, empty, or malformed
headers.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Tests
- `ZHIPU_AI_API_KEY=dummy uv run --python 3.13 --group test pytest
test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py::test_sdk_session_routes_missing_authorization_unit
-q`
- `uv run --python 3.13 --group test ruff check api/apps/sdk/session.py
test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py`
- `python3 -m py_compile api/apps/sdk/session.py
test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py`
- `git diff --check`
## Summary
- Align **GET `/api/v1/documents/<doc_id>/download`** with
**`/preview`**: resolve extension and MIME type from the stored document
name when the **`ext` query parameter is omitted**, instead of
defaulting to `markdown`.
- When **`?ext=`** is present, behavior stays the same as before
(explicit extension / `Content-Type` mapping).
- Enforce the same access + document lookup pattern as preview
(**`accessible`** + **`get_by_id`**).
- Extend unit tests for the no-`ext` PDF filename case.
## Test plan
- [x] `uv run pytest
test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_download_attachment_success_and_exception_unit`
- [x] Optional: `curl -sSI` against
`/api/v1/documents/<pdf_doc_id>/download` without `ext` and confirm
`Content-Type: application/pdf`
Fixes#15052.
POST /api/v1/dify/retrieval resolved the caller via @apikey_required
(injecting tenant_id) but then fetched the requested knowledge_id with
no tenant filter and ran the full retrieval pipeline against
kb.tenant_id (the owner). Any valid Dify-compatible API key could
retrieve chunks from any tenant whose KB UUID was known. Adds the
missing ownership check.
## Root Cause
api/apps/sdk/dify_retrieval.py line 253:
KnowledgebaseService.get_by_id(kb_id) fetched the KB by id alone, then
the handler used kb.tenant_id (the OWNER) to build the embedding model
and call the retriever. The caller tenant_id was only used downstream at
line 278 for retrieval_by_children, well after cross-tenant data was
already retrieved.
grep confirmed there was no KnowledgebaseService.accessible call
anywhere in the handler.
## Fix
Two-line guard immediately after the existing get_by_id lookup,
mirroring the pattern PR #14749 lands for the sibling sdk/doc.py routes
(download, parse, stop_parsing, retrieval_test):
e, kb = KnowledgebaseService.get_by_id(kb_id)
if not e:
return build_error_result(message="Knowledgebase not found!",
code=RetCode.NOT_FOUND)
+ if not KnowledgebaseService.accessible(kb_id, tenant_id):
+ return build_error_result(message="No authorization.",
code=RetCode.AUTHENTICATION_ERROR)
if kb.tenant_embd_id:
...
KnowledgebaseService.accessible already handles solo-tenant ownership,
team membership via TenantService.get_joined_tenants_by_user_id, and the
permission=ME distinction. No behavior change for legitimate callers;
cross-tenant callers now receive RetCode.AUTHENTICATION_ERROR (109).
## Test Plan
- [x] Regression test added:
test/unit_test/api/apps/sdk/test_dify_retrieval.py
- test_cross_tenant_request_is_rejected -- attacker tenant calling owner
tenant KB gets 109; retriever is not invoked
- test_same_tenant_request_succeeds -- owner tenant gets the records
back
- test_missing_knowledge_base_returns_not_found -- missing KB returns
404 BEFORE the access check fires (legit callers see the clearer
message)
- [x] All 3 tests pass after the fix
- [x] Cross-tenant test FAILS on pre-fix main (KeyError on result[code]
because handler leaks records dict instead of returning auth error)
- [x] ruff check clean on both changed files
- [x] No drive-by reformatting in dify_retrieval.py -- only the 2 added
lines
### Post-fix output
test_cross_tenant_request_is_rejected PASSED [ 33%]
test_same_tenant_request_succeeds PASSED [ 66%]
test_missing_knowledge_base_returns_not_found PASSED [100%]
============================== 3 passed in 0.04s
===============================
Closes#15027
### What problem does this PR solve?
Closes#15076
Two endpoints in `api/apps/restful_apis/chat_api.py` accepted a
`user_id` field from the request body and used it directly when creating
a session:
```python
# before (vulnerable)
"user_id": req.get("user_id", current_user.id) # create_session
conv = await _create_session_for_completion(chat_id, dia, req.get("user_id", current_user.id)) # session_completion
```
Any authenticated caller could supply an arbitrary `user_id` and have
the new session attributed to a different user — effectively spoofing
session ownership. Both call sites are now fixed to always use
`current_user.id`, which is set by the authentication middleware and
cannot be tampered with via the request payload.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Changes
| File | Change |
|------|--------|
| `api/apps/restful_apis/chat_api.py` | Remove `req.get("user_id", ...)`
fallback in `create_session` and `session_completion`; always use
`current_user.id` |
|
`test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py`
| Add `test_create_session_user_id_not_spoofable` and
`test_session_completion_user_id_not_spoofable` (both `@pytest.mark.p2`)
|
### Testing
Two new unit tests assert that a `user_id` value supplied in the request
body is silently ignored and the session is always owned by the
authenticated user:
```
test_create_session_user_id_not_spoofable
test_session_completion_user_id_not_spoofable
```
Run with:
```bash
uv run pytest test/testcases/test_http_api/test_session_management/test_session_sdk_routes_unit.py -k "spoofable" -v
```
### What problem does this PR solve?
Feat: add local & ssh provider in admin panel
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Closes#15025
Langfuse-enabled `dialog_service.async_chat()` regressed to
`langfuse_tracer.start_generation(...)` after the earlier Langfuse v4
migration. Langfuse v4 uses `start_observation(as_type="generation")`,
so the remaining `start_generation` call can fail when chat tracing is
enabled.
This restores the migrated `start_observation(as_type="generation")`
call for chat observations while preserving the existing trace context,
model, input payload, and update/end flow. It also adds a regression
test with a fake Langfuse v4-style client that exposes
`start_observation()` but not `start_generation()`.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Tests
- `.venv/bin/pytest
test/unit_test/api/db/services/test_dialog_service_final_answer.py -q`
- `.venv/bin/ruff check api/db/services/dialog_service.py
test/unit_test/api/db/services/test_dialog_service_final_answer.py`
### What problem does this PR solve?
extend restful api suite
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): test
### What problem does this PR solve?
extend restful api suite
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): test