Commit Graph

370 Commits

Author SHA1 Message Date
web-dev0521
1d7e45115b feat(connectors): add Salesforce CRM data source connector (#15462)
### What problem does this PR solve?

Closes #15461.

RAGFlow had no way to ingest Salesforce CRM data, so support / sales
teams couldn't ground responses on live Accounts, Contacts,
Opportunities, Cases, or Knowledge articles. This adds a first-class
Salesforce data source connector that authenticates against a Connected
App via OAuth 2.0 client-credentials, queries selected SObjects via
SOQL, and turns each record into an indexable document with incremental
sync.

**Highlights**
- `common/data_source/salesforce_connector.py`: new
`SalesforceConnector` (`CheckpointedConnectorWithPermSync` +
`SlimConnectorWithPermSync`).
- OAuth 2.0 client-credentials flow; canonical `instance_url` from the
token response so multi-pod orgs route correctly.
- Per-object `SystemModstamp` cursor stored in
`SalesforceCheckpoint.cursors` — a failure mid-object doesn't rewind
sibling objects, and re-syncs only fetch changed rows.
- Deterministic record-to-text formatter (sorted keys) so SOQL field
reordering on the server doesn't mark every row "changed" on each poll.
- `_get_json` raises on non-2xx so 429 / 5xx never silently advance the
checkpoint past missing data.
- `Knowledge__kav` is in the default object set but is skipped silently
when the org doesn't have Salesforce Knowledge enabled (404 on
describe).
- Slim-doc IDs are scoped as `<Object>/<Id>` so prune deletes can't
collide across object types.
- `common/constants.py`, `common/data_source/config.py`,
`common/data_source/__init__.py`: register `salesforce` in `FileSource`
/ `DocumentSource` and export `SalesforceConnector`.
- `rag/svr/sync_data_source.py`: new `Salesforce(SyncBase)` class routed
through `load_from_checkpoint` (poll_source would re-walk every object
each run) and added to `func_factory`.
- Frontend:
- `web/src/pages/user-setting/data-source/constant/index.tsx`: new
`DataSourceKey.SALESFORCE`, form fields (instance URL, client ID/secret,
objects, api_version, batch size), `syncDeletedFiles` capability,
default form values, and tile entry with the new icon.
  - `web/src/locales/{en,zh}.ts`: description + per-field tooltips.
- `web/src/assets/svg/data-source/salesforce.svg`: 48x48 brand-style
icon to match the other Microsoft / cloud tiles.

**Verification**
- `npm run build` (vite + esbuild) passes (1m 26s).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-05 13:24:36 +08:00
web-dev0521
98f2a2e60b feat(connectors): add Azure Blob Storage data source connector (#15466)
### What problem does this PR solve?

Closes #15465.

RAGFlow supports S3, Google Cloud Storage, R2, and OCI as data sources
but not Azure Blob Storage, leaving Azure users without a way to index
container objects into a knowledge base. This adds a first-class Azure
Blob Storage data-source connector — distinct from RAGFlow's existing
Azure storage *backends* (`rag/utils/azure_sas_conn.py`,
`rag/utils/azure_spn_conn.py`) which store RAGFlow's own files.

**Highlights**
- `common/data_source/azure_blob_connector.py`: new `AzureBlobConnector`
(`CheckpointedConnectorWithPermSync` + `SlimConnectorWithPermSync`).
- Uses the existing `azure-storage-blob` dependency (already in
`pyproject.toml`).
  - Three auth modes, tried in order of precedence:
1. **Account key** — `account_name` + `account_key` + `container_name`.
    2. **Connection string** — `connection_string` + `container_name`.
3. **SAS token** — `container_url` + `sas_token` (same shape as
`RAGFlowAzureSasBlob`).
- ETag fingerprint stored per blob in `AzureBlobCheckpoint.etags` —
unchanged blobs (same ETag as last run) are skipped without a download.
Only new/modified blobs are fetched.
  - Optional `prefix` scopes indexing to a virtual folder.
- `validate_connector_settings()` probes `get_container_properties()`
and maps `AuthenticationFailed / 403 / ContainerNotFound` to typed
connector exceptions.
  - Slim-doc IDs are blob names so prune reconciles correctly.
- `common/constants.py`, `common/data_source/config.py`,
`common/data_source/__init__.py`: register `azure_blob` in `FileSource`
/ `DocumentSource` and export `AzureBlobConnector`.
- `rag/svr/sync_data_source.py`: new `AzureBlob(SyncBase)` class routed
through `load_from_checkpoint` (ETag fingerprint owns change-detection)
and added to `func_factory`.
- Frontend:
- `web/src/pages/user-setting/data-source/constant/index.tsx`: new
`DataSourceKey.AZURE_BLOB`, auth-mode selector (account key / connection
string / SAS token), all credential fields, prefix + batch-size,
`syncDeletedFiles` capability, default form values, tile entry with
icon.
- `web/src/locales/{en,zh}.ts`: description + per-field tooltips for all
9 new keys.
- `web/src/assets/svg/data-source/azure-blob.svg`: Azure-branded
stacked-cylinders icon.

**Verification**
- `npm run build` (vite + esbuild) passes (37 s).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-04 21:06:01 +08:00
Jack
b363146997 refactor: overhaul task executor with layered architecture and comprehensive test suite (#15471)
## Summary

Decomposes the monolithic `task_executor.py` (1945 lines) into a 6-layer
architecture with clear separation of concerns. The refactored code is
functionally equivalent to the original, verified through 400 passing
tests and a production-vs-dry-run comparison framework.

## Architecture

```
entry (task_manager)
  └─ orchestration (task_handler)
       ├─ services (chunk_service, embedding_service, dataflow_service, raptor_service, post_processor)
       │    └─ utilities (chunk_builder, chunk_post_processor, embedding_utils)
       └─ infrastructure (task_context, recording_context, interceptor)
```

Key design decisions:
- **TaskContext** — typed facade over raw task dict, injects rate
limiters + callbacks via composition
- **RecordingContext + Comparator** — enables side-by-side production vs
dry-run execution for safe migration
- **NullRecordingContext** — zero-allocation no-op for production, uses
`__slots__`
- **WriteOperationInterceptor** — FIFO replay of previous runs function
returns for comparison mode

## Migration Strategy

The original `handle_task()` in `task_executor.py` uses a 3-way switch
via `TE_RUN_MODE`:
- `TE_RUN_MODE=0` (default) → runs refactored code
- `TE_RUN_MODE=1` → runs both original + refactored, compares all
intermediate results
- `TE_RUN_MODE=2` → runs original code (fallback)

The comparison mode (`TE_RUN_MODE=1`) records ~40 intermediate values
(chunks, vectors, token counts, func return values) from the production
run and replays them during dry-run, then uses `ContextComparator` to
report mismatches.

## Functional Equivalence Fixes

All divergences between original and refactored code were identified and
fixed:
- Timeout decorators (handle/build_chunks/raptor/embedding)
- NullRecordingContext leak in finally block causing RuntimeError
- MinIO None-binary check with proper FileNotFoundError
- Dataflow dispatch after embedding binding + init_kb
- Memory task missing return after processing
- RAPTOR checkpoint progress reporting
- Tag cache (get_tags_from_cache/set_tags_to_cache) restoration
- dataflow_id correction in _load_dsl
- Language default Chinese, dead code guard removal
- embed_chunks made async with proper thread_pool_exec
- Full GraphRAG default configuration (10 parameters)
- Hardcoded q_768_vec fallback removal in RAPTOR

## Test Changes

- 20 new tests covering table parser manual mode, tag cache, embedding
edge cases, RAPTOR checkpoint, dataflow_id correction, storage binary
None, cancel cleanup, metadata=None boundary
- Unified `make_task_context`/`make_task_dict` factories eliminated 10+
duplicated helpers
- DataflowService tests migrated from internal method mocks to IO
boundary mocks (real orchestration code executes)
- Parametrized duplicate build_chunks post-processor tests
- 7 raptor tests modernized to @pytest.mark.asyncio
- Mock count per test reduced through boundary-level mocking strategy

**Test count: 400 passing, 0 warnings, 0 skips**

## Files Changed

| File | Change |
|------|--------|
| `rag/svr/task_executor.py` | +1 line (NullRecordingContext fix) |
| `rag/svr/task_executor_refactor/task_handler.py` | Orchestration
layer, 8 logic fixes |
| `rag/svr/task_executor_refactor/chunk_service.py` | +timeout +
None-check |
| `rag/svr/task_executor_refactor/embedding_service.py` | sync→async
rewrite |
| `rag/svr/task_executor_refactor/dataflow_service.py` | dataflow_id fix
+ timeout |
| `rag/svr/task_executor_refactor/raptor_service.py` | checkpoint fix +
assert |
| `rag/svr/task_executor_refactor/chunk_post_processor.py` | tag cache
restore |
| `rag/svr/task_executor_refactor/task_context.py` | language default
fix |
| `test/.../conftest.py` | +294 lines shared helpers |
| `test/.../*.py` | 15 test files refactored, 20 new tests |

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 17:18:31 +08:00
web-dev0521
cd18cfab79 feat(connector): implement Outlook data source connector (issue #15332) (#15333)
### What problem does this PR solve?

Closes #15332.

RAGFlow can index Gmail and generic IMAP mailboxes but had no native
connector for Outlook / Microsoft 365 mail. Organisations on Microsoft
365 had no way to bring mailbox content into a knowledge base through
Microsoft Graph.

This PR adds a net-new Outlook data source that:

- Authenticates against Microsoft Graph with the same MSAL
client-credentials flow already used by the SharePoint and Teams
  connectors (no new auth primitives).
- Pages over `/users/{id}/mailFolders/{folder}/messages/delta` per
mailbox and persists `@odata.deltaLink` values in
`OutlookCheckpoint.delta_links`, so incremental syncs only fetch changed
messages.
- Supports two scoping modes:
- **Tenant-wide** (default): enumerates every user in the tenant via
`/users` and syncs each mailbox. Requires `User.Read.All`.
- **Targeted**: when `user_ids` is provided (comma-separated UPNs or
object IDs), only those mailboxes are synced. `User.Read.All` is not
needed in this mode.
- Lets the caller pick the mail folder (`inbox`, `sentitems`, `archive`,
...). Defaults to `inbox`.
- Maps each message to a `Document` shaped after the Gmail connector:
one `TextSection` carrying `From/To/Cc/Subject` headers + body, with
HTML bodies stripped to text inline (no extra dependency).
- Surfaces typed errors on the validation probe:
401 → `ConnectorMissingCredentialError`, 403 →
`InsufficientPermissionsError` (with `Mail.Read` / `User.Read.All`
hint), 404 on a configured mailbox → `ConnectorValidationError`, 5xx →
`UnexpectedValidationError`.
- Skips messages flagged `@removed` by the delta semantics and messages
whose `receivedDateTime` is older than `poll_range_start`.

#### Files

| File | Change |
|------|--------|
| `common/data_source/outlook_connector.py` | **New** —
`OutlookConnector` (`CheckpointedConnectorWithPermSync` +
`SlimConnectorWithPermSync`) + `OutlookCheckpoint` + tiny `_strip_html`
helper. |
| `common/data_source/config.py` | `DocumentSource.OUTLOOK = "outlook"`.
|
| `common/constants.py` | `FileSource.OUTLOOK = "outlook"`. |
| `common/data_source/__init__.py` | Export `OutlookConnector`. |
| `rag/svr/sync_data_source.py` | `Outlook(SyncBase)` with `batch_size`
normalisation, CSV/list parsing of `user_ids`; registered in
`func_factory`. |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`DataSourceKey.OUTLOOK`, visibility map (`syncDeletedFiles: true`), info
entry, form fields (tenant_id, client_id, client_secret, folder,
user_ids, batch_size), default values. |
| `web/src/locales/en.ts`, `web/src/locales/zh.ts` |
`outlookDescription` + 5 tooltip keys (EN + ZH). |
| `test/unit_test/data_source/test_outlook_connector_unit.py` | **New**
— 19 unit tests (`p1`/`p2`/`p3`) covering auth, validation (tenant-wide
vs specific user vs error paths), checkpoint helpers, user enumeration
pagination, message filtering, HTML body stripping. |

#### Required Azure AD permissions

- `Mail.Read` (Application, admin-granted) — always.
- `User.Read.All` (Application, admin-granted) — only when `user_ids` is
left blank so the connector can enumerate mailboxes.

#### Out of scope

- **Attachment indexing.** The current connector emits message body +
headers; binary attachments are flagged via `metadata.has_attachments`
but not pulled. Adding attachment hydration is straightforward but
scoped out per the issue's "decide whether attachments are indexed in
the first version" note.
- **Delegated (per-user) OAuth.** The connector uses app-only
credentials, consistent with the SharePoint / Teams precedent in this
codebase.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-29 21:52:29 +08:00
web-dev0521
bda2117a25 feat(connector): implement OneDrive data source connector (issue #15330) (#15331)
### What problem does this PR solve?

Closes #15330.

RAGFlow had no connector for OneDrive / OneDrive for Business. Users who
store working documents in OneDrive could not index them into a
knowledge base without manually downloading and re-uploading files.

This PR adds a net-new OneDrive data source that:

- Authenticates against Microsoft Graph with the same MSAL
client-credentials flow already used by the SharePoint and Teams
connectors (no new auth primitives).
- Enumerates every drive visible to the service principal and pages
through `/drives/{id}/root/delta`, persisting `@odata.deltaLink` values
per drive so subsequent syncs only fetch changed items.
- Optionally narrows ingestion to a sub-folder (`folder_path`) without
needing a separate code path.
- Surfaces typed errors on the validation probe (`GET /drives?$top=1`):
401 → `ConnectorMissingCredentialError`, 403 →
`InsufficientPermissionsError` (with a `Files.Read.All` hint), 5xx →
`UnexpectedValidationError`.
- Filters folders, soft-deleted items, and unsupported extensions (`.pdf
.docx .doc .xlsx .xls .pptx .ppt .txt .md .csv`).

#### Files

| File | Change |
|------|--------|
| `common/data_source/onedrive_connector.py` | **New** —
`OneDriveConnector` + `OneDriveCheckpoint`. |
| `common/data_source/config.py` | `DocumentSource.ONEDRIVE =
"onedrive"`. |
| `common/constants.py` | `FileSource.ONEDRIVE = "onedrive"`. |
| `common/data_source/__init__.py` | Export `OneDriveConnector`. |
| `rag/svr/sync_data_source.py` | `OneDrive(SyncBase)` with `batch_size`
normalisation; registered in `func_factory`. |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`DataSourceKey.ONEDRIVE`, visibility map (`syncDeletedFiles: true`),
info entry, form fields (tenant_id, client_id, client_secret,
folder_path, batch_size), default values. |
| `web/src/locales/en.ts`, `web/src/locales/zh.ts` |
`onedriveDescription` + 4 tooltip keys (EN + ZH). |
| `test/unit_test/data_source/test_onedrive_connector_unit.py` | **New**
— 13 unit tests (`p1`/`p2`) covering auth, validation, checkpoint
helpers, and document filtering. |

#### Required Azure AD permission

`Files.Read.All` (Application, admin-granted).

#### Out of scope

- Interactive end-user OAuth (delegated permissions) — the connector
uses app-only credentials, consistent with the SharePoint / Teams
precedent.
- Binary download of file contents — the sync layer emits `Document`s
carrying `webUrl` + metadata; bytes are hydrated downstream by the parse
pipeline.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-29 19:26:06 +08:00
Lynn
dc4b82523b Feat: tenant llm provider (#14595)
### What problem does this PR solve?

Python implementation of the Go-based model_provider API suite.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: bill <yibie_jingnian@163.com>
2026-05-29 17:39:41 +08:00
web-dev0521
98bc9ca6ac feat: implement Microsoft Teams data source connector (#15193)
### What problem does this PR solve?

Closes #15191.

RAGFlow shipped a Microsoft Teams connector stub
(`common/data_source/teams_connector.py`) whose document-loading methods
all returned `[]`, `Teams._generate()` was a `pass`, and Teams was
commented out of the data-source settings UI. As a result there was no
way to index Teams channel conversations into a knowledge base.

This PR implements the connector end to end on top of Microsoft Graph
(Office365-REST-Python-Client). It shares the MSAL client-credentials
auth shape with the SharePoint connector.

**Backend**

- `common/data_source/teams_connector.py`
- `load_credentials()` now builds the Graph client using an MSAL
client-credentials **token callback** — the form `GraphClient` actually
expects. (The previous stub passed a raw access-token string to
`GraphClient(...)`, which is not how that client is driven.) Token
acquisition is lazy, so credential loading performs no network call.
  - `validate_connector_settings()` lists teams via Graph.
- `load_from_checkpoint()` is now a generator that pages teams →
channels → messages, flattens each top-level post together with its
replies into one blob-based `Document` (`extension` `.txt`/`.html`,
`blob`, `size_bytes`, `doc_updated_at`). Incremental syncs are bounded
by message `lastModifiedDateTime` (falling back to `createdDateTime`).
Per-message errors surface as `ConnectorFailure` instead of aborting the
run.
- `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument`
batches and the checkpoint helpers return proper `TeamsCheckpoint`s.
- ACL → `ExternalAccess` mapping is intentionally left best-effort
(`load_from_checkpoint_with_perm_sync` delegates to the standard load)
because the sync pipeline does not currently persist `ExternalAccess`.
- `rag/svr/sync_data_source.py`
- Implemented `Teams._generate()` using the existing
`CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google
Drive), supporting full reindex and incremental polling from
`poll_range_start`.
- `TeamsConnector` is already exported from
`common/data_source/__init__.py`.

**Frontend (`web/`)**

- Enabled the `TEAMS` data-source enum and added its form fields
(`tenant_id`, `client_id`, `client_secret`), default values, display
metadata, and a Teams icon.
- Added `teamsDescription` / `teamsTenantIdTip` to `en.ts` and `zh.ts`.

**Tests**

- `test/unit_test/data_source/test_teams_connector_unit.py`: mock-based
unit tests covering credential loading (incomplete creds raise, happy
path sets the Graph client, fetch-without-creds raises), post/reply
flattening (incl. the HTML vs text extension), incremental
`lastModifiedDateTime` filtering, and slim-doc listing. All 6 pass;
`ruff check` is clean.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-28 17:10:38 +08:00
web-dev0521
5de021ebb4 feat: implement Slack data source connector (#15188)
### What problem does this PR solve?

Closes #15187.

RAGFlow shipped a Slack connector
(`common/data_source/slack_connector.py`) but it was never usable:
`Slack._generate()` in the sync worker was a `pass` stub, the
connector's document-generating code was incompatible with the current
data model,
and Slack was commented out of the data-source settings UI. As a result,
teams had no way to index Slack channels/threads into a knowledge base.

This PR completes the connector end to end.

**Backend**

- `common/data_source/slack_connector.py`
- Rewrote `thread_to_doc` to produce a blob-based `Document`
(`extension`/`blob`/`size_bytes`). The previous implementation built the
doc with a `sections=[...]` argument and omitted the now-required
`blob`/`extension`/ `size_bytes` fields, so it raised a validation error
against the current `Document` model. Thread messages are now cleaned
and flattened into a single UTF-8 text blob.
- Added `load_from_state()` / `poll_source(start, end)` generators. The
connector's checkpoint interface is a no-op stub, so both full and
incremental syncs run through a single channel-iterating generator built
on the existing module helpers (`get_channels`, `filter_channels`,
`get_channel_messages`, `_process_message`), with per-channel thread
de-duplication.
- `rag/svr/sync_data_source.py`
- Implemented `Slack._generate()`. Credentials are loaded via
`StaticCredentialsProvider` (the connector requires `slack_bot_token`
and does not support `load_credentials`). Supports full reindex and
incremental polling from `poll_range_start`, plus the optional channel
filter. Modeled on the Confluence/Dropbox wrappers.
- `SlackConnector` was already exported from
`common/data_source/__init__.py`.

**Frontend (`web/`)**

- Enabled the `SLACK` data-source enum and added its form fields (Slack
bot token + optional channel filter), default values, display metadata,
and a Slack icon.
- Added `slackDescription` / `slackBotTokenTip` / `slackChannelsTip`
strings to `en.ts` and `zh.ts`.

**Tests**

- `test/unit_test/data_source/test_slack_connector_unit.py`: unit tests
covering credential loading (`load_credentials` raises,
`set_credentials_provider` initializes clients, missing credentials
raises) and document generation (standalone message + flattened thread,
blob/extension/size_bytes/metadata, and the incremental poll time
window). All 5 pass; `ruff check` is clean.

Required Slack scopes: `channels:read`, `channels:history`,
`users:read`.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-28 15:46:07 +08:00
web-dev0521
c4c4e228e3 feat: implement SharePoint data source connector (#15190)
### What problem does this PR solve?

Closes #15189.

RAGFlow shipped a SharePoint connector stub
(`common/data_source/sharepoint_connector.py`) whose document-loading
methods all returned `[]`, `SharePoint._generate()` was a `pass`, and
SharePoint was commented out of the data-source settings UI. As a result
there was no way to index files stored in SharePoint document libraries.

This PR implements the connector end to end on top of Microsoft Graph
(Office365-REST-Python-Client).

**Backend**

- `common/data_source/sharepoint_connector.py`
- `load_credentials()` now builds the Graph client using an MSAL
client-credentials **token callback** — the form `GraphClient` actually
expects. (The previous stub passed a raw access-token string to
`GraphClient(...)`, which is not how that client is driven.) Token
acquisition is lazy, so credential loading does no network call.
- `validate_connector_settings()` resolves the configured site via
Graph.
- `load_from_checkpoint()` is now a generator that enumerates every
document library under the site, walks folders depth-first, downloads
each file, and yields blob-based `Document` objects (`extension` /
`blob` / `size_bytes` / `doc_updated_at`). Incremental syncs are bounded
by file `lastModifiedDateTime`. Per-file errors are surfaced as
`ConnectorFailure` rather than aborting the run.
- `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument`
batches (no downloads) and the checkpoint helpers return proper
checkpoints.
- ACL → `ExternalAccess` mapping is intentionally left best-effort
(`load_from_checkpoint_with_perm_sync` delegates to the standard load)
because the sync pipeline does not currently persist `ExternalAccess`;
this can be extended once that plumbing exists.
- `rag/svr/sync_data_source.py`
- Implemented `SharePoint._generate()` using the existing
`CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google
Drive), supporting full reindex and incremental polling from
`poll_range_start`.
- `SharePointConnector` is already exported from
`common/data_source/__init__.py`.

**Frontend (`web/`)**

- Enabled the `SHAREPOINT` data-source enum and added its form fields
`site_url`, `tenant_id`, `client_id`, `client_secret`), default values,
display metadata, and a SharePoint icon.
- Added `sharepointDescription` / `sharepointSiteUrlTip` to `en.ts` and
`zh.ts`.

**Tests**

- `test/unit_test/data_source/test_sharepoint_connector_unit.py`:
mock-based unit tests covering credential loading (incomplete creds
raise, happy path sets the Graph client, fetch-without-creds raises),
drive traversal + file download, incremental `lastModifiedDateTime`
filtering, and slim-doc listing. All 6 pass; `ruff check` is clean.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-28 13:26:08 +08:00
Jack
f0cb7a544b Refactor: Task Executor (#15154)
### What problem does this PR solve?

1. Break huge function into smaller pieces
2. Add unit test for the smaller pieces function
3. Layer-ed design
a. infra layer - task_context.py, recording_context.py,
write_operation_interceptor.py, ...
    b. service layer - *_service.py
    c. business layer - task_handler.py
4. Default behavior: use "refactor-ed version" - can switch to original
version by change env variable

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Performance Improvement

---------

Co-authored-by: Liu An <asiro@qq.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-05-27 21:54:17 +08:00
Wang Qi
619b971785 Fix: empty file with better message (#15232)
Fix: empty file with better message
2026-05-26 12:28:53 +08:00
Wang Qi
a9ec78cb9c Refactor: enahnce retry and timeout (#14983)
### What problem does this PR solve?

1. Enhance retry and timeout, and adjust the default timeout
2. NER: spacy do not batch chunks
3. extract _has_cancel_and_exit
4. enhance log messages

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2026-05-22 13:16:39 +08:00
buua436
04bdb41909 Fix: guard missing task language (#15136)
### What problem does this PR solve?

guard missing task language

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-22 11:46:38 +08:00
Wang Qi
c5a46fda44 Fix: <asyncio.locks.Semaphore object at 0xabcd [locked]> is bound to a different event loop (#15100)
Fix: <asyncio.locks.Semaphore object at 0xabcd [locked]> is bound to a
different event loop
2026-05-21 19:23:41 +08:00
Magicbook1108
b69a6a5d80 Feat: full optimization on connector dashboard (#14979)
### What problem does this PR solve?

This PR improves the connector dashboard task management experience and
adds better visibility into connector execution logs.

### Overview:

#### Before
<img width="700" alt="image"
src="https://github.com/user-attachments/assets/e4a8ed6f-2e18-4f0f-8528-41a514550052"
/>

#### Now:
<img width="700" alt="Screenshot from 2026-05-18 16-31-30"
src="https://github.com/user-attachments/assets/d4ca193b-847a-49ae-9e4f-5fbca60ea627"
/>

### 1. Add a new logging page to the connector dashboard

A new logging page has been added so users can view connector task
execution logs directly from the connector dashboard.

### 2. Merge the Resume button into Confirm

The separate **Resume** button has been removed. The **Confirm** button
now represents different actions depending on the current task state:

- **Save**: Save form changes and reschedule tasks.
- **Stop**: Cancel currently scheduled or running tasks.
- **Resume**: Create new scheduled tasks after the previous tasks have
been stopped.
- **Start**: Start tasks when no task has been started yet.

### 3. Separate syncing and pruning tasks

Connector tasks are now separated into **syncing** and **pruning**.

Pruning is controlled by the **Sync deleted files** option:

- When **Sync deleted files** is disabled, only syncing tasks are shown.
- When **Sync deleted files** is enabled, both syncing and pruning tasks
are shown.

**Now: Sync deleted files disabled**

<img width="700" alt="Sync deleted files disabled"
src="https://github.com/user-attachments/assets/dbd9232e-614a-407f-a0b1-c109e5fa567d"
/>

**Now: Sync deleted files enabled**

<img width="700" alt="Sync deleted files enabled"
src="https://github.com/user-attachments/assets/1f527f48-ccb3-4ee8-97ca-086891489296"
/>

### 4. Update logs in backend

<img width="700" alt="image"
src="https://github.com/user-attachments/assets/10a95a3f-98c1-4e67-8afa-ddf6cda5b0b2"
/>

### 5. Remove connector resume API

- Removed: `POST /v1/connectors/<connector_id>/resume`
- Replaced by: `PATCH /v1/connectors/<connector_id>`


### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-19 10:07:11 +08:00
Wang Qi
13b422037f Refactor: enhance graphrag - part 2 (#14972)
### What problem does this PR solve?
1. expose batch_chunk_token_size for configuration
2. retrieve chunks when build subgraph for the doc, not retreive all
docs chunks at the begining
3. get all chunks for a document, used to be hard coded 10000
4. delete not used method run_graphrag

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring

Follow on: #14617
2026-05-18 16:10:21 +08:00
Octopus
eaa5d9921b fix: enable GitHub connector to sync PRs and issues by default (#14062)
Fixes #13975

## Problem
The GitHub data source connector had both `include_pull_requests` and
`include_issues` defaulting to `false` in both the frontend form and the
backend sync code. This meant that with the default configuration, **no
content was synced at all** from a GitHub repository — silently
producing zero results.

Additionally, the form field labels contained a typo: "Inlcude" instead
of "Include".

## Solution
- Changed `include_pull_requests` default from `false` to `true` in the
frontend form fields and default values
- Changed `include_issues` default from `false` to `true` in the
frontend form fields and default values
- Changed both backend defaults in `sync_data_source.py` from `False` to
`True`
- Fixed label typos: "Inlcude Pull Requests" → "Include Pull Requests"
and "Inlcude Issues" → "Include Issues"

This makes the GitHub connector consistent with the GitLab connector,
which already defaults `include_mrs`, `include_issues`, and
`include_code_files` all to `true`.

## Testing
- The connector now syncs both pull requests and issues by default when
a new GitHub data source is created
- Users who want to exclude PRs or issues can uncheck the corresponding
checkboxes in the form

Co-authored-by: octo-patch <octo-patch@github.com>
2026-05-15 13:26:31 +08:00
Ahmad Intisar
e994051eb9 Feature/generic api connector (#13545)
# feat: Add Generic REST API Connector

## What problem does this PR solve?

RAGFlow supports many specific data source connectors (MySQL, Slack,
Google Drive, etc.), but there was no way to connect an arbitrary REST
API as a data source. Users with custom or third-party APIs had to write
a new connector class for each one.

This PR adds a **generic, configuration-driven REST API connector** that
lets users connect any REST API as a data source entirely through the UI
— no code changes needed per API.

---

## Features

### Core Connector (`common/data_source/rest_api_connector.py`)

- Implements `LoadConnector` and `PollConnector` interfaces for full and
incremental sync
- **Configurable authentication:** None, API Key (custom header), Bearer
Token, Basic Auth
- **Pluggable pagination:** Page-based, Offset-based, Cursor-based, or
None
- Smart page-size inference from user's query parameters to avoid
duplicate/conflicting params
- Configurable request delay between pages to prevent API rate limiting
- Auto-detection of the items array in JSON responses (`items`,
`results`, `data`, `records`, or first list found)
- **Advanced field mapping** with dot-notation (`country.name`), array
wildcards (`newsType[*].name`), type hints, and default values
- Optional content template rendering (`"Title: {title}\nBody: {body}"`)
- HTML stripping for content fields
- Stable document IDs via `hash128` from a configurable ID field or
auto-generated from item content
- Pydantic configuration schema with automatic coercion of UI string
inputs to dicts/lists

### Backend Registration (`rag/svr/sync_data_source.py`,
`common/constants.py`, `common/data_source/config.py`)

- `REST_API` sync class wired into RAGFlow's `func_factory`
- Full sync (`load_from_state`) and incremental polling (`poll_source`)
support
- Credentials and config passed from task to connector following
existing patterns (MySQL, SeaFile, etc.)

### Test Connection Endpoint (`api/apps/connector_app.py`)

- `POST /v1/connector/<id>/test` validates config schema,
authentication, and API connectivity without triggering a sync
- Clear error messages for auth failures vs. config issues

### Frontend UI (`web/src/pages/user-setting/data-source/constant/`)

- **Postman-style configuration:** Base URL, Query Parameters (key=value
per line), Auth, Content Fields, Metadata Fields, Pagination Type
- Auth-type-aware form: fields for API key header/value, Bearer token,
or Basic username/password appear only when relevant
- **Advanced Settings** toggle for: Custom Headers, Max Pages, Request
Delay, Poll Timestamp Field, Request Body (POST)
- Connector icon (SVG) and i18n strings (English)
- **"Test Connection"** button to validate before syncing

---

## Controls & Safety

- Configurable max pages safety cap (default: 1000, adjustable in UI)
- Configurable request delay between pages (default: 0.5s, adjustable in
UI)
- Auth errors (401/403) fail immediately without retries; transient
errors retry with exponential backoff
- Diagnostic logging: auth setup confirmation, request details on
failure, content field extraction status

---

## Type of change

- [x] New Feature (non-breaking change which adds functionality)


##Visual Screenshots of Features
<img width="482" height="510" alt="Screenshot 2026-03-11 at 5 19 52 PM"
src="https://github.com/user-attachments/assets/dcb7ab4a-1622-44f3-bb02-d6f0527314c4"
/>
(Connector can be configured within the external data sources tab)

Configuration Parameters:
<img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 46 PM"
src="https://github.com/user-attachments/assets/5e154e71-4ab5-4872-bfb2-04f02b73c18a"
/>
<img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 54 PM"
src="https://github.com/user-attachments/assets/00cb14b7-0bcf-4b94-9d71-34e93369ecb2"
/>

Connection can be tested before attaching to dataset:
<img width="981" height="681" alt="Screenshot 2026-03-11 at 5 21 40 PM"
src="https://github.com/user-attachments/assets/aaa6eeeb-89a7-4349-bc34-2423bf8be9ee"
/>

Ingestion tested with API connector (works perfectly fine):
<img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 22 30 PM"
src="https://github.com/user-attachments/assets/afcd0d58-cadd-4152-badc-d2f14d96fbec"
/>

Search & Retrieval works as well with metadata flow:
<img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 23 05 PM"
src="https://github.com/user-attachments/assets/d41ee935-dcf7-4456-b317-22a76ca032c0"
/>

---------

Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-13 20:35:01 +08:00
shawnxiao105-afk
8b6dd6a5c2 fix: guard whitespace-only chunks before embedding (#13938)
## Problem

When parsing DOCX files with many tables, DeepDOC generates chunks
containing only empty HTML table tags, such as:

```html
<table><tr><td></td></tr><tr><td></td></tr><tr><td></td></tr><tr><td></td></tr></table>
```

After the regex cleanup at `task_executor.py:584`, this becomes `" "`
(whitespace only).

The guard at line 585 (`if not c`) only catches empty strings `""`, but
whitespace strings are truthy in Python and pass through. When sent to
Zhipu `embedding-3` API, it rejects them with error 1213:
`未正常接收到prompt参数`.

## Root Cause

```python
c = re.sub(r"</?(table|td|caption|tr|th)( [^<>]{0,12})?>", " ", c)
if not c:       # ← only catches "", not "   " / "\n" / "\t"
    c = "None"
```

Verified with Zhipu `embedding-3`:
| Input | Result |
|---|---|
| `""` | error 1213 |
| `" "` | error 1213 |
| `"\n"` | error 1213 |
| `"None"` | OK |

## Fix

```diff
- if not c:
+ if not c.strip():
      c = "None"
```

## Testing

Reproduced with a 678KB DOCX file (166 tables, 270 chunks). Chunk #89 is
the empty table above. After fix, `"None"` is sent instead and embedding
succeeds.

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2026-05-13 11:47:50 +08:00
Wang Qi
4374e07a29 Speed up start time (#14833)
### What problem does this PR solve?

Speed up start time

### Type of change
- [x] Refactoring
2026-05-12 17:00:45 +08:00
CaptainTimon
2717ee283f feat(raptor): add Psi tree builder with original-space ranking and safe migration (#14679)
### What problem does this PR solve?

Closes #14674.

This PR improves RAPTOR configuration and tree construction while
preserving the existing RAPTOR behavior as the default.

RAPTOR currently builds summary layers with the original UMAP + GMM
clustering path. This PR keeps that default path, and adds:

- A hidden backend tree-builder option:
  - `tree_builder="raptor"`: default, existing RAPTOR behavior.
- `tree_builder="psi"`: rank-aware Psi-style tree builder using original
embedding-space cosine ranking.
- A user-facing clustering method option for the default RAPTOR builder:
  - `clustering_method="gmm"`: existing default.
- `clustering_method="ahc"`: agglomerative hierarchical clustering path.
- A RAPTOR UI setting for `Clustering method` and `Max cluster`.

### What changed

#### Backend

- Added `tree_builder` support for RAPTOR/Psi.
- Added `clustering_method` support for GMM/AHC.
- Kept existing RAPTOR + GMM as the default.
- Added Psi tree building from original-space cosine similarity.
- Added bucketed Psi building controls for large inputs:
  - `raptor.ext.psi_exact_max_leaves`
  - `raptor.ext.psi_bucket_size`
- Added method-aware RAPTOR summary metadata using existing
`extra.raptor_method`.
- Avoided adding a dedicated DB schema field for experimental method
tracking.
- Added cleanup/migration logic to avoid mixing stale RAPTOR summary
trees.
- Added defensive checks for Psi tree construction and summary failures.

#### Frontend/UI

- Added `Clustering method` in RAPTOR settings with `GMM` and `AHC`.
- Added/kept `Max cluster` in RAPTOR settings.
- Enlarged max cluster UI limit to `1024`, matching backend validation.
- Kept AHC editable even when a RAPTOR task has already finished.
- Fixed the UI save payload so `clustering_method` and `tree_builder`
are serialized through `parser_config.raptor.ext`, avoiding backend
validation errors for extra top-level RAPTOR fields.

Example saved RAPTOR config:

```json
{
  "raptor": {
    "max_cluster": 317,
    "ext": {
      "clustering_method": "ahc",
      "tree_builder": "raptor"
    }
  }
}

Co-authored-by: CaptainTimon <CaptainTimon@users.noreply.github.com>
2026-05-12 09:42:31 +08:00
web-dev0521
cc207b5b05 Refactor: tidy up ThreadPoolExecutor lifecycle in file_service and task executor (#14668)
## Summary
- Wrap the `ThreadPoolExecutor` instances in `FileService.parse_docs`
and `FileService.get_files` with `with ... as exe:` blocks for
deterministic cleanup
- Replace the `concurrent.futures.ThreadPoolExecutor` in
`do_handle_task` with `asyncio.create_task(asyncio.to_thread(build_TOC,
...))`, preserving the existing parallelism with chunk insertion while
leveraging the surrounding async context
- Drop the now-unused `import concurrent` and the
`executor.shutdown(wait=False)` call in the `finally` block

Closes #14622.

No behavioral change, no public API change. Net diff: ~19 insertions /
25 deletions across two files.

## Test plan
- [ ] `uv run ruff check api/db/services/file_service.py
rag/svr/task_executor.py` passes
- [ ] Upload a multi-file batch through the chat/file endpoint and
confirm `FileService.parse_docs` still returns combined parsed text
- [ ] Trigger `FileService.get_files` via the chat reference flow with a
mix of image and non-image files; verify both `raw=True` and `raw=False`
paths return correctly
- [ ] Run a `naive`-parser document task with `toc_extraction: true` and
confirm the TOC chunk is generated and inserted exactly as before
- [ ] Run a `naive`-parser document task with `toc_extraction: false`
and confirm the path with `toc_thread = None` is unaffected
- [ ] Cancel a running task to exercise the `finally` block and confirm
cleanup still works without the executor shutdown call

---------

Co-authored-by: web-dev0521 <jasonpette1783@gmail.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-05-11 12:59:00 +08:00
Qinsanz
d6660cf156 fix(keyword_extraction): accept Chinese commas/semicolons/newlines as keyword delimiters (#14540)
## What
Widen the keyword delimiter in `rag/svr/task_executor.py`:
both `build_chunks` (LLM `keyword_extraction` cache parsing) and
`run_dataflow` (chunk-level `keywords` ingestion) now split on
`, , ; ; 、 \r \n` instead of only ASCII comma.

## Why
`rag/prompts/keyword_prompt.md` instructs the LLM:

> The keywords are delimited by ENGLISH COMMA.

In practice, Chinese-leaning models (Qwen / Tongyi-Qianwen, GLM,
etc.) frequently ignore this instruction when the source content is
Chinese and emit Chinese commas (`,`) instead. Result:
`cached.split(",")` sees the full LLM output as a *single* keyword.

Repro: `auto_keywords>=4` + Chinese docs + `qwen-plus@Tongyi-Qianwen`.
We observed entries in `important_kwd` like
`"功能介绍,配置说明,参数详解,问题排查"` — one bucket instead of four.

## Impact
- Silent data-quality bug; no exception thrown.
- BM25 `important_kwd^30` boost effectively stops firing — the
  indexed term is the whole list, never matches user query tokens.
- Any downstream aggregating `important_kwd` (tagging, analytics,
  candidate-keyword review UIs) sees garbage.

## Compatibility
- Pure widening of the splitter; ASCII-comma-only outputs continue
  to work identically.
- No schema / API change.

## Test plan
Manually verified against `qwen-plus@Tongyi-Qianwen` with
`auto_keywords=10` on Chinese .txt files:

- Before: `important_kwd` contains one element per chunk that is the
  full LLM string with `,`-separated phrases inside.
- After: `important_kwd` contains N elements, one per phrase, as the
  LLM intended.
2026-05-11 12:05:24 +08:00
Ahmad Intisar
3c4d1da98f Feature/table parser column roles (#13710)
### What problem does this PR solve?

The table file parser (CSV/Excel) currently treats all columns
identically — every column is both vectorized (embedded in chunk text)
and stored as filterable metadata. There's no way for users to control
which columns should be searchable by semantic meaning versus which
should only be filterable attributes.

For example, when ingesting a news articles CSV with columns like title,
content, country, category, source, etc., the embedding includes
metadata fields like country: Brazil and source: Reuters in the chunk
text, which dilutes the semantic quality of the embedding without adding
retrieval value.

The RDBMS connector (MySQL/PostgreSQL) already supports content_columns
/ metadata_columns, but this capability was missing for file-based table
ingestion.

This PR adds column-level control (vectorize / metadata / both) for the
table file parser, following RAGFlow's existing patterns.

Backward compatible: Datasets without table_column_roles or with
table_column_mode: auto behave exactly as before (all columns = both).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-11 10:06:04 +08:00
Hunnyboy1217
782084780e feat(connectors): ETag-based bypass for incremental S3 ingestion (#14628) (#14677)
### What problem does this PR solve?

S3-family connector syncs currently re-download every in-window object
just so we can compute `xxhash128(blob)` and compare against
`Document.content_hash`. Anything that bumps `LastModified` without
changing bytes (`aws s3 cp` touches, bucket re-encryption, etc.) pays
full bandwidth and re-parses files that didn't actually change. #14628
covers the broader incremental-ingestion redesign; this PR is the first
slice.

The fix is a pre-listing short-circuit. `BlobStorageConnector` (S3 / R2
/ GCS / OCI / S3-compat) now implements a new `FingerprintConnector`
interface: `list_keys()` paginates `list_objects_v2` and yields
`KeyRecord(key, fingerprint)` where `fingerprint = xxhash128(ETag)`. The
orchestrator joins those against the connector's existing `{doc_id:
content_hash}` map and only calls `get_value(key)` when the fingerprint
differs. Unchanged keys are skipped entirely — no `GetObject`, no
re-parse.

No DDL. xxhash128(ETag) is 32 hex chars and reuses the existing
`Document.content_hash` column per @yingfeng's suggestion; the connector
decides at listing time whether to populate it. Local uploads and
connectors that don't opt in fall through to the existing post-download
`xxhash128(blob)` path with no behavior change.

This is PR-1 of a 4-PR series — full design lives on #14628. Subsequent
PRs extend tier 1 to local FS / WebDAV / Dropbox / Seafile / RDBMS
(PR-2), wire up tier 2 cursor connectors with `SyncLogs.next_checkpoint`
(PR-3), and unify deletion via `KeyRecord(deleted=True)` reconciliation
(PR-4). Holding those back keeps this PR additive and reviewable on its
own.

#### Files touched

- `common/data_source/models.py` — new `KeyRecord`; optional
`fingerprint` on `Document`
- `common/data_source/interfaces.py` — `IncrementalCapability` enum,
`FingerprintConnector` ABC
- `common/data_source/blob_connector.py` — `BlobStorageConnector`
implements `FingerprintConnector`; per-object download factored into
`_build_document_from_obj()` so `_yield_blob_objects`, `list_keys`,
`get_value` all share it
- `rag/svr/sync_data_source.py` —
`_BlobLikeBase._fingerprint_filtered_generator` does the bypass loop;
`_run_task_logic` plumbs `doc.fingerprint` into the upload dict
- `api/db/services/document_service.py` —
`list_id_content_hash_map_by_kb_and_source_type()` helper
- `api/db/services/connector_service.py` + `file_service.py` —
fingerprint flows through `duplicate_and_parse → upload_document` and
lands in `content_hash`
- `test/unit_test/common/test_blob_connector_fingerprint.py` — 14 tests
covering ETag normalization (single-part, multipart, quoted, empty),
`list_keys()` not calling `GetObject`, `get_value()` materializing with
fingerprint, deterministic/stable fingerprints, and the bypass loop
asserting `GetObject` is *not* called on a match

#### Worth flagging for review

Old `_BlobLikeBase._generate` called `poll_source(start, now)` with a
`LastModified` window when `poll_range_start` was set. New code uses
`_fingerprint_filtered_generator` (full bucket listing + fingerprint
compare) outside of explicit `reindex=1`. Strictly better for
unchanged-bucket cases since it skips `GetObject`, but it does mean
every sync now does a full `list_objects_v2` paginate. Should still be
cheap for most buckets — flagging in case anyone has a very large bucket
where the time-window filter was meaningful.

On migration: existing rows have `content_hash = xxhash128(blob)` from
the old code. The first sync after this lands sees ETag-derived
fingerprints that don't match, re-fetches every object once, and writes
the new fingerprint. From the second sync onward the bypass works as
expected. "Slow day one, fast every day after." A `fingerprint_backfill:
trust` opt-out is sketched in the design doc but not in this PR.

#### Test plan

- [x] `uv run ruff check` — clean on all 8 touched files
- [x] `uv run pytest
test/unit_test/common/test_blob_connector_fingerprint.py -v` — 14 passed
- [x] Broader unit-test suite — no regressions in anything I touched
- [ ] Manual smoke against a real S3 bucket — configure a connector, run
sync twice, expect the second sync to log `bypassed=N, fetched=0` and no
`GetObject` calls in CloudTrail / bucket access logs
- [ ] Manual smoke with `reindex=1` — confirm the full re-download path
still works

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
2026-05-09 20:03:56 +08:00
Octopus
653b00b94c fix(sync): scope document IDs per connector to prevent cross-KB collisions (#14378)
Fixes #14360

## Problem

When the same blob storage bucket is connected to multiple knowledge
bases (each through a different data source connector), the sync
pipeline hashes only the blob path
(`bucket_type:bucket_name:object_key`) to derive the document ID. Every
connector pointing at the same bucket therefore produces **identical
IDs** for the same object. The collision guard in
`FileService.upload_document` then fires for the second knowledge base:

```
Existing document id collision with another knowledge base; skipping update.
```

This makes it impossible to index the same bucket into more than one KB
simultaneously.

## Solution

Include `connector_id` in the hash input so that each connector produces
a distinct document ID even when the underlying blob path is identical:

```python
# Before
"id": hash128(doc.id),

# After
"id": hash128(f"{task['connector_id']}:{doc.id}"),
```

Because each KB connection uses its own connector (with a unique
`connector_id`), documents are now namespaced per connector and no
collision occurs.

**Note:** This is a breaking change for existing synced data sources.
After upgrading, a re-sync will create new documents with the updated ID
format. Old documents (indexed under the previous format) will remain in
the database but can be manually deleted or cleaned up via a re-sync
with reindex enabled.

## Testing

- Verified that the one-line change produces unique IDs for two
connectors pointing at the same S3 path.
- Existing unit test
`test_upload_document_skips_cross_kb_document_id_collision` continues to
pass — the collision guard in `FileService` is still valid for genuinely
colliding IDs from other sources.

---------

Co-authored-by: octo-patch <octo-patch@github.com>
2026-05-09 10:33:54 +08:00
Jack Storment
59bb184e63 feat(moodle): support deleted-file sync (#14548)
Fixes #14551 

### What problem does this PR solve?

The Moodle connector did not let the sync runner clean up indexed
documents that were deleted from the source. Other connectors such as
dropbox, seafile, webdav, and rss already do this through a slim
snapshot pass. This PR adds the same support for Moodle.

When `sync_deleted_files` is on, the runner now asks the Moodle
connector for a lightweight list of every module id that could be
indexed. The runner then compares this list with the index and removes
any indexed document whose id is not in the list.

The slim pass does not download files. It only goes through courses and
modules and yields ids. The id format matches the ids that the loader
produces, so the match is exact.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

### Notes

- `MoodleConnector` now also implements `SlimConnectorWithPermSync`.
- New `retrieve_all_slim_docs_perm_sync` yields slim docs with the same
ids the loader uses (`moodle_resource_<id>`, `moodle_forum_<id>`,
`moodle_page_<id>`, `moodle_book_<id>`, `moodle_assign_<id>`,
`moodle_quiz_<id>`).
- The `Moodle` sync class now returns `(document_generator, file_list)`
so the runner can do the cleanup. If the slim snapshot fails,
`file_list` is set back to `None` and the run continues without cleanup.
- The web data source map exposes `syncDeletedFiles` for Moodle so the
option shows up in the UI.

### How was this tested?

- `ruff check` passes on the changed Python files.
- Manual review of the produced slim ids against the ids the loader
builds in `_process_resource`, `_process_forum`, `_process_page`,
`_process_book`, and `_process_activity`.
- Behavior parity with the merged dropbox (#14476), seafile (#14499),
webdav (#14491), and rss (#14493) PRs.
2026-05-07 17:44:46 +08:00
Magicbook1108
911671cef0 Feat: enable sync deleted files for RDBMS & fix remove last file issue (#14615)
### What problem does this PR solve?

Feat: enable sync deleted files for RDBMS & fix remove last file issue

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2026-05-07 13:31:05 +08:00
buua436
5672be0652 Feat: add IMAP deleted document sync (#14539)
### What problem does this PR solve?

add IMAP deleted document sync

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-06 14:06:46 +08:00
NeedmeFordev
89961962c0 feat(dingtalk-ai-table): support deleted-file sync via slim snapshot (#14525)
### What problem does this PR solve?

Incremental DingTalk AI Table (Notable) sync did not reconcile rows
removed on the remote side with documents already in the knowledge base.
This follows the coordinated datasource work in #14362 (“sync deleted
files”).

This PR adds a **full slim snapshot**
(`retrieve_all_slim_docs_perm_sync`) that lists **current record IDs for
all sheets** without building document blobs, using the same logical
document IDs as full ingest
(`dingtalk_ai_table:{table_id}:{sheet_id}:{record_id}`). When
**`sync_deleted_files`** is enabled on incremental runs,
`DingTalkAITable._generate` returns **`(document_generator,
file_list)`** so **`SyncBase`** can run
**`cleanup_stale_documents_for_task`** and remove KB rows that no longer
exist remotely.

Design notes:

- **`_document_id`** centralizes the ID string so slim snapshots and
**`_convert_record_to_document`** stay aligned with
**`hash128(doc.id)`** semantics used during ingestion/cleanup.
- **`end_ts`** is captured before building **`file_list`**, then
**`poll_source`** uses the same upper bound (consistent with other
Dropbox-style connectors).
- **`batch_size`** from connector config is coerced to a positive
**`int`** before constructing the connector.
- Slim snapshot failures are caught in **`_generate`**; **`file_list`**
is set to **`None`** so cleanup is skipped rather than running on
partial/error state.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

### Files changed (summary)

| Area | Change |
|------|--------|
| `common/data_source/dingtalk_ai_table_connector.py` |
`SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`,
`_document_id` shared with document conversion |
| `rag/svr/sync_data_source.py` | `DingTalkAITable._generate`: slim
snapshot + tuple return; `batch_size` validation; shared `end_ts` with
`poll_source` |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`syncDeletedFiles` for DingTalk AI Table in
`DataSourceFeatureVisibilityMap` |

Closes / relates to: #14362
2026-05-06 14:06:23 +08:00
Magicbook1108
5fd4579a2f Fix: sync data source empty list (#14530)
### What problem does this PR solve?

Fix: sync data source empty list

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-30 18:56:43 +08:00
bitloi
a69e0c73c7 feat(rss): support deleted-file sync (#14493)
### What problem does this PR solve?

Partially addresses #14362.

This PR enables syncing deleted files for RSS data sources.

Previously, RSS incremental sync only returned feed entries whose
timestamps were inside the poll window. If an entry was removed from the
RSS feed, RAGFlow had no full current RSS snapshot to pass into the
shared stale-document cleanup path, so the deleted remote entry could
remain in the knowledge base.

This PR:
- adds `retrieve_all_slim_docs_perm_sync()` to `RSSConnector`
- reuses the same `rss:<md5(stable_key)>` document ID derivation used by
normal RSS ingest
- returns `(document_generator, file_list)` for incremental RSS sync
when `sync_deleted_files` is enabled
- captures the poll end timestamp before snapshot/poll so cleanup does
not race against the same sync window
- adds start/end logs around RSS slim snapshot collection
- exposes the deleted-file sync toggle for RSS in the data source UI

Per maintainer request on related datasource PRs, this PR contains no
test-case changes. Local verification was run with an external script.

Validation:
- `uv run ruff check common/data_source/rss_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`
- `git diff --check`
- `uv run python /tmp/verify_rss_deleted_sync.py --repo
/root/74/ragflow`

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-30 18:56:13 +08:00
NeedmeFordev
bedf9592ef feat(webdav): support deleted-file sync via slim snapshot (#14491)
## What problem does this PR solve?

Incremental WebDAV sync only ingested files whose modification time fell
inside the poll window; documents removed on the WebDAV server were
never removed from the knowledge base. This aligns with
[#14362](https://github.com/infiniflow/ragflow/issues/14362)
(coordinated datasource “sync deleted files” work).

This PR adds a **full-tree slim snapshot**
(`retrieve_all_slim_docs_perm_sync`) that enumerates current remote
paths **without downloading file contents**, using the same logical
document IDs as full ingest (`webdav:{base_url}:{file_path}`). When
**`sync_deleted_files`** is enabled on incremental runs, sync returns
**`(document_generator, file_list)`** so **`SyncBase`** runs
**`cleanup_stale_documents_for_task`** and removes KB rows no longer
present remotely.

Design notes:

- **`_list_files_recursive`** gains **`filter_by_mtime`**: snapshot
passes **`filter_by_mtime=False`** (full tree under **`remote_path`**);
**`poll_source`** keeps mtime-window filtering as before.
- Slim snapshot applies the same **extension** and **`size_threshold`**
rules as **`_yield_webdav_documents`** so retain IDs match what would be
indexed.
- **`end_ts`** is captured before building **`file_list`**, then
**`poll_source`** uses the same upper bound (consistent with
Dropbox-style connectors).

## Type of change

- [x] New Feature (non-breaking change which adds functionality)

## Files changed

| Area | Change |
|------|--------|
| `common/data_source/webdav_connector.py` |
`SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`,
`filter_by_mtime` on `_list_files_recursive` |
| `rag/svr/sync_data_source.py` | WebDAV `_generate`: `file_list` +
tuple return; pass **`batch_size`** from connector config |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`syncDeletedFiles` for WebDAV in `DataSourceFeatureVisibilityMap` |
2026-04-30 17:26:27 +08:00
bitloi
17eda04b8d feat(zendesk): support deleted-file sync (#14487)
### What problem does this PR solve?

Refs #14362.

This PR enables syncing deleted files for Zendesk data sources.

Previously, Zendesk incremental sync never returned a slim remote
snapshot to the shared stale-document cleanup path, so deleted remote
Zendesk records could remain in RAGFlow. The existing Zendesk slim
snapshot also included records that ingestion intentionally skips, such
as draft articles, articles without bodies, skipped-label articles,
empty-body articles, and tickets with `status == "deleted"`.

This PR:
- exposes the deleted-file sync option for Zendesk in the data source UI
- returns Zendesk slim snapshots during incremental sync when
`sync_deleted_files` is enabled
- reuses Zendesk indexability rules so cleanup compares against the same
records ingestion can materialize
- adds start/end logs around Zendesk slim snapshot collection for
operational visibility

Per maintainer request, this PR contains no test-case changes. Manual
verification recording will be provided separately.

Validation:
- `uv run ruff check common/data_source/zendesk_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2026-04-30 14:44:05 +08:00
bitloi
8f75e52bbf feat(asana): support deleted-file sync (#14468)
### What problem does this PR solve?

Partially addresses #14362.

Adds deleted-file sync support for the Asana data source. Asana already
indexes task attachments as documents, but it did not provide the slim
document snapshot required by stale-document reconciliation, and the
sync wrapper never returned a `file_list` for cleanup.

This PR:
- adds `retrieve_all_slim_docs_perm_sync()` to `AsanaConnector`
- builds slim IDs with the same `asana:{task_id}:{attachment_gid}`
format used by indexed documents
- avoids downloading attachment blobs during the snapshot
- aborts the snapshot if Asana API errors occur, preventing partial
snapshots from deleting valid local docs
- captures the incremental poll end time before snapshotting and makes
`poll_source()` respect that boundary
- exposes the deleted-file sync toggle for Asana in the data source UI

Per maintainer request, this PR contains no test-case changes. Manual
verification recording will be provided separately.

Validation:
- `uv run ruff check common/data_source/asana_connector.py
rag/svr/sync_data_source.py`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`
- `git diff --check`

### Type of change

- [x] New Feature
2026-04-30 14:41:36 +08:00
NeedmeFordev
2932b65da6 feat(seafile): support deleted-file sync via slim snapshot (#14499)
### What problem does this PR solve?

Incremental Seafile sync only ingests files whose modification time
falls in the poll window; documents removed in Seafile were never
removed from the knowledge base. This contributes to
[#14362](https://github.com/infiniflow/ragflow/issues/14362) (datasource
“sync deleted files” coordination).

This PR adds a **slim snapshot** (`retrieve_all_slim_docs_perm_sync`)
that enumerates current remote file IDs **without downloading content**,
using the same logical IDs as full ingest
(`seafile:{repo_id}:{file_id}`). When **`sync_deleted_files`** is
enabled on incremental runs, **`SeaFile._generate`** returns
**`(document_generator, file_list)`** so **`SyncBase`** can run
**`cleanup_stale_documents_for_task`** and remove stale KB documents.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
### What changed

- **`common/data_source/seafile_connector.py`**: `SeaFileConnector`
implements **`SlimConnectorWithPermSync`**;
**`_list_files_recursive(..., filter_by_mtime=...)`** supports full-tree
listing for snapshots; **`retrieve_all_slim_docs_perm_sync()`** reuses
the same library/root scan as ingest and applies the same **size**
ceiling; logging for snapshot start/end and counts.
- **`rag/svr/sync_data_source.py`**: **`SeaFile._generate`** validates
**`batch_size`**, captures **`end_ts`** before snapshot +
**`poll_source`**, wraps slim retrieval in **`try`/`except`** (
**`file_list = None`** on failure so ingest continues), returns
**`(generator, file_list)`**.
- **`web/src/pages/user-setting/data-source/constant/index.tsx`**:
**`syncDeletedFiles`** for Seafile in
**`DataSourceFeatureVisibilityMap`**.
2026-04-30 12:05:12 +08:00
sapienza yoan
811e9826d0 perf: avoid O(n²) array growth in embedding accumulation (#14369)
### What problem does this PR solve?

Both tokenizer (`rag/flow/tokenizer/tokenizer.py`) and
`BuiltinEmbed.encode`
(`rag/llm/embedding_model.py`) currently accumulate embedding batches
via
`np.concatenate` inside the per-batch loop. `np.concatenate` allocates a
new
array and copies all existing data on every call, so accumulating N
batches
is O(N²) in both time and peak memory.

Replacing the incremental concatenate with a list-of-batches + a single
`np.vstack` at the end gives O(N) total work.

For tokenizer the title-vector broadcast `np.concatenate([vts[0]] * N)`
is
also replaced by `np.tile`, which does the same job with a single
contiguous
allocation instead of building a Python list of references.

This is purely a CPU/memory optimisation — output shape and dtype are
unchanged. Measured impact grows with document size:
  -   1k chunks (batch 512, 2 iters):    ~negligible
  -  10k chunks (20 iters):              ~10× speedup on this stage
  - 100k chunks (195 iters):             ~100× speedup, and peak RAM
drops from O(N) extra to near-zero

### Type of change

- [x] Performance Improvement

Co-authored-by: yoan sapienza <Yoan Sapienza yoan.sapienza@orange.fr Yoan Sapienza zappy@macbookpro.home>
2026-04-30 11:00:10 +08:00
Magicbook1108
de8c6ad0f3 Feat: enable sync deleted file for Discord (#14451)
### What problem does this PR solve?

Feat: enable sync deleted file for Discord

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 19:05:40 +08:00
bitloi
2bc8c6d35e feat(dropbox): support deleted-file sync (#14476)
### What problem does this PR solve?

Partially addresses #14362 by adding deleted-file sync support for the
Dropbox data source.

Dropbox previously did not provide the slim current-file snapshot
required by stale document reconciliation, and its sync runner returned
only document batches. As a result, enabling deleted-file sync could not
remove local documents that had been deleted from Dropbox.

This PR:
- Adds `retrieve_all_slim_docs_perm_sync()` to `DropboxConnector`.
- Reuses Dropbox metadata traversal to collect current remote file IDs
without downloading file contents.
- Wires incremental Dropbox sync to return `(document_generator,
file_list)` when `sync_deleted_files` is enabled.
- Enables the deleted-file sync toggle for Dropbox in the data source
settings UI.
- Adds regression coverage for slim snapshots, nested folders, paginated
listings, duplicate filenames, and full reindex behavior.

Tests:
- `uv run pytest test/unit_test/common/test_dropbox_connector.py -q`
- `uv run pytest test/unit_test/rag/test_sync_data_source.py -q`
- `uv run pytest test/unit_test/common/test_dropbox_connector.py
test/unit_test/rag/test_sync_data_source.py -q`
- `uv run ruff check common/data_source/dropbox_connector.py
rag/svr/sync_data_source.py
test/unit_test/common/test_dropbox_connector.py
test/unit_test/rag/test_sync_data_source.py`
- `./node_modules/.bin/eslint
src/pages/user-setting/data-source/constant/index.tsx`

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 19:05:11 +08:00
Magicbook1108
db1a73b255 Feat: enable sync deleted files in gitlab (#14481)
### What problem does this PR solve?

Feat: enable sync deleted files in gitlab

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 19:04:10 +08:00
Magicbook1108
e0b3070012 Feat: enable sync deleted files for Gmail && fix google drive issues (#14462)
### What problem does this PR solve?

Feat: enable sync deleted files for Gmail && fix google drive issues

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: bill <yibie_jingnian@163.com>
Co-authored-by: balibabu <assassin_cike@163.com>
2026-04-29 17:03:56 +08:00
Magicbook1108
3b7a6eaa6c Feat: sync deleted files in Bitbucket (#14450)
### What problem does this PR solve?

Feat: sync deleted files in Bitbucket

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-29 11:29:17 +08:00
Paras Sondhi
74fa54f122 feat(google-drive): optimize memory payload and enable sync deletion (#14372)
**Addresses the Google Drive integration for #14362**

This PR completely overhauls the Google Drive sync logic to accurately
detect remote deletions, while drastically reducing the memory footprint
during the snapshot phase.

### What changed under the hood:

* **Killed the memory bloat:** Swapped out the massive document
dictionary objects for a lightweight `collections.namedtuple` (`SlimDoc
= namedtuple('SlimDoc', ['id'])`). This prevents RAM spikes during
`retrieve_all_slim_docs_perm_sync` on massive enterprise drives.
* **Flawless downstream integration:** The `SlimDoc` object relies on
simple duck typing. It perfectly delivers the `.id` attribute required
by `ConnectorService.cleanup_stale_documents_for_task`, meaning your
core `hash128` vector cleanup logic runs natively without modification.
* **Fixed the Shared Drive blindspot:** The standard API query was
missing team folders. Injected the `corpora="allDrives"` and
`includeItemsFromAllDrives=True` override flags so the connector now
accurately maps state across both personal workspaces and organizational
Shared Drives.

### Testing:
Isolated the Google API retrieval logic locally to prove the `SlimDoc`
mapping works and correctly registers state drops when a file is trashed
remotely.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Performance Improvement
2026-04-29 10:04:36 +08:00
Magicbook1108
0d18b293f5 Fix: enable sync deleted file in airtable (#14438)
### What problem does this PR solve?

Fix: enable sync deleted file in airtable

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-28 20:09:08 +08:00
Magicbook1108
18fbfafca6 Feat: enable sync deleted files for more connectors (#14353)
### What problem does this PR solve?

Feat: enable sync delted files for connectors

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-04-28 15:07:14 +08:00
Jack
872ff08304 Fix: add executor.shutdown (#14403)
### What problem does this PR solve?

Add executor shutdown in finally clause to free resources.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 22:38:43 +08:00
Idriss Sbaaoui
4303be223f Fix metadata parsing regression for upgraded v0.24 datasets (#14383)
### What problem does this PR solve?

This PR fixes issue #14371 where file parsing failed after upgrading
from v0.24.0 to v0.25.0, because metadata config could be a JSON Schema
object but was handled like a list and later caused `KeyError:
'properties'`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-27 16:18:06 +08:00
yuch85
0d87cecae2 feat: persist PDF bookmark outline as document metadata (#13287)
## Summary

PDF files often contain a bookmark/outline tree (table of contents built
into the file by the authoring tool). RAGFlow's `pdf_parser.outlines`
already extracts these `(title, depth)` tuples via pypdf, but they are
used ephemerally during chunking (`manual` parser uses them for
hierarchy detection) and then discarded.

This PR persists the outline as `doc.meta_fields["outline"]` — a JSON
array of `{"title": str, "depth": int}` objects — so downstream features
can use the structural information.

### Why this matters

- **Complementary to `toc_extraction`** — the existing `toc_extraction`
feature uses LLM calls to generate a TOC and only works for the `naive`
parser. The raw PDF outline is free (already extracted by pypdf), works
for all parsers, and captures the author's original document structure.
- **Document navigation** — frontends can render a clickable TOC from
the outline
- **Entity extraction** — the outline provides a structural map for
identifying document sections and key topics
- **Search result context** — knowing which section a chunk belongs to
helps users evaluate relevance

### Changes

| File | Change | LOC |
|------|--------|-----|
| `rag/app/naive.py` | Attach `pdf_parser.outlines` as `__outline__` on
first chunk dict | ~7 |
| `rag/app/manual.py` | Same for the manual parser | ~5 |
| `rag/svr/task_executor.py` | Extract `__outline__`, persist via
`DocMetadataService.update_document_metadata()` | ~12 |

### Design decisions

- **Transient key pattern**: The outline is passed from parser →
task_executor via `__outline__` on the first chunk dict, then removed
before indexing. This follows the same pattern as `metadata_obj` for
LLM-generated metadata.
- **No schema changes**: Uses the existing `meta_fields` JSON column on
the document table.
- **Graceful degradation**: If a PDF has no outline (common for scanned
docs), nothing is stored. If persistence fails, it logs a warning and
continues — parsing is not interrupted.

### Backward compatibility

- **Fully backward compatible** — no existing fields, behavior, or
schemas changed
- PDFs without outlines are unaffected
- Existing `meta_fields` data is preserved (merged, not overwritten)

## Test plan

- [ ] Parse a PDF with bookmarks (e.g. any multi-chapter document),
verify `meta_fields["outline"]` is populated
- [ ] Parse a PDF without bookmarks, verify no errors and no outline key
in meta_fields
- [ ] Verify existing `meta_fields` data is preserved (not overwritten)
when outline is added
- [ ] Verify `manual` parser also persists outlines
- [ ] Verify outline JSON structure: `[{"title": "Chapter 1", "depth":
0}, ...]`

Related: #9921 (Deterministic Document Access Layer)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yuch85 <yuch85.1@gmail.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-04-27 11:57:06 +08:00
yuch85
3ad3241ae0 feat: persist RAPTOR layer metadata on summary chunks (#13286)
## Summary

RAPTOR's recursive clustering builds a `layers` list tracking
`(start_idx, end_idx)` boundaries per level, but currently discards this
information — only the flat `chunks` list is returned. This makes it
impossible to distinguish leaf-level summaries from top-level ones.

This PR:
- Returns `(chunks, layers)` tuple from `raptor.py`'s `__call__`
- Annotates each RAPTOR summary chunk with `raptor_layer_int` (1 = first
summary level, 2 = summary-of-summaries, etc.)
- Adds `raptor_layer_int` to `infinity_mapping.json` (Elasticsearch
handles it via existing `*_int` dynamic template)

### Why this matters

Downstream features need to know which RAPTOR layer a summary belongs
to:
- **Retrieving the top-level document summary** for entity extraction,
search snippets, or document comparison
- **Filtering by abstraction level** — users may want only high-level
summaries or only leaf-level cluster summaries
- **RAPTOR recall quality** — #10951 reports summaries not being
recalled for definition queries; layer metadata enables targeted
retrieval

### Changes

| File | Change | LOC |
|------|--------|-----|
| `rag/raptor.py` | Return `(chunks, layers)` tuple | ~3 |
| `rag/svr/task_executor.py` | Build `chunk_layer` mapping, set
`raptor_layer_int` | ~12 |
| `conf/infinity_mapping.json` | Add `raptor_layer_int` integer field |
~1 |

### Backward compatibility

- **Additive only** — no existing fields or behavior changed
- Existing RAPTOR chunks continue to work (they'll have
`raptor_layer_int = 0` by default)
- New RAPTOR chunks get layer metadata automatically

## Test plan

- [ ] Parse a document with RAPTOR enabled, verify `raptor_layer_int` is
set on indexed chunks
- [ ] Verify `raptor_layer_int` values increase with abstraction level
(layer 1 < layer 2 < ...)
- [ ] Verify existing RAPTOR deletion (`delete by raptor_kwd`) still
works
- [ ] Verify Infinity backend accepts the new field

Fixes #7488
Related: #4104, #11191, #10951

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yuch85 <yuch85.1@gmail.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-04-27 10:20:46 +08:00
Idriss Sbaaoui
ca01c7a745 Fix blob sync: skip unsupported files before download (#14357)
### What problem does this PR solve?

Blob storage sync was downloading unsupported files first and rejecting
them later, which wasted bandwidth and made sync slower. This PR skips
unsupported extensions before download and applies `allow_images` in
blob sync. fixes #14338

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-04-24 19:22:32 +08:00