ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Author	SHA1	Message	Date
Zhichang Yu	faef22c18a	Harden closed-advisory fixes (#16409 ) ## Summary - harden reopened advisory fixes across REST connector, invoke, document downloads, and markdown rendering - add targeted regression coverage for redirect-safe SSRF handling, invoke SSRF checks, document access control, and markdown sanitization - verify each referenced GHSA against the original GitHub advisory text and align the closed-advisory plan with the implemented remediation ## What changed - add tenant access checks to document download endpoints to avoid cross-tenant document disclosure - add per-hop SSRF validation, DNS pinning, redirect handling, and redirect limits to the REST API connector - ensure invoke requests validate and pin the resolved host and never follow redirects implicitly - keep the generic rate-limited request path wrapped, not just GET and POST helpers - sanitize markdown HTML before rendering in the highlight markdown component ## Validation - `cd web && npm test -- --runInBand src/components/highlight-markdown/__tests__/index.test.tsx` - `.venv/bin/python -m pytest -q test/unit_test/data_source/test_rest_api_connector.py` - targeted `test/testcases/test_web_api/...` unit additions were reviewed, but the suite cannot be executed end-to-end in this environment because parent `test/testcases/conftest.py` requires a local service on `127.0.0.1:9380` ## Notes - all GHSA entries referenced by the plan were checked against the original GitHub advisory text, not sampled - the closed-advisory plan document was updated locally during review, but is intentionally not included in this PR	2026-06-29 09:45:16 +08:00
oktofeesh	c15b2b3f66	fix(connectors): enforce WebDAV numeric string size limits (#15731 ) ## Summary - Normalize WebDAV file-size metadata before applying the sync size threshold. - Enforce the same threshold for numeric string sizes in both document sync and slim snapshot paths. - Add focused WebDAV unit coverage for size parsing and over-threshold skips. ## Why Some WebDAV servers return file sizes from PROPFIND metadata as strings. The previous threshold check only handled integer values, so oversized files could still be downloaded and sent into the chunking pipeline. Closes #15724. ## Validation - `uv run --no-project --with pytest --with pytest-asyncio pytest test/unit_test/data_source/test_webdav_connector_unit.py -q` - `uvx ruff check common/data_source/webdav_connector.py test/unit_test/data_source/test_webdav_connector_unit.py` - `python -m compileall -q common/data_source/webdav_connector.py test/unit_test/data_source/test_webdav_connector_unit.py` - `git diff --check` --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 15:47:54 +08:00
gaulin-ai	8abe627e69	i18n(it): complete Italian translation (49% → 100%) (#15729 ) ## Summary Brings the Italian locale (`web/src/locales/it.ts`) from approximately 49% coverage (986 out of 2008 keys) to 100% coverage (2008/2008 keys), fully aligned with `en.ts` in structure and key count. ### What was missing Previously untranslated sections include: - `skills`, `skillSearch` — agent skills UI - `memories`, `memory` — memory management - `datasetOverview` — dataset statistics - `llmTools` — LLM tool configuration - `explore` — explore/template page - `dataflowParser` — ingestion pipeline parser settings - `flow` (complete) — agent canvas / workflow editor - `setting` connectors section — data source connectors (Google Drive, Gmail, Box, RDBMS, etc.) - Various `header`, `common`, `knowledgeBase`, `chat`, `fileManager` additions ### Translation conventions - Technical terms kept in English: RAG, LLM, API, token, chunk, embedding, prompt, dataset, agent, canvas, knowledge graph, RAPTOR, webhook, and all model/provider names (Bedrock, Tavily, SearXNG, etc.) - `{{placeholder}}` variables preserved unchanged - Informal tu form used consistently, matching the existing style - All previously correct translations preserved	2026-06-08 18:06:47 +08:00
kpdev	b0a45809ff	fix(onedrive): normalize folder_path for Graph delta URL (#15503 ) Prepend a leading slash and reject `..` segments so scoped OneDrive delta queries use `root:/path:/delta` instead of `root:path:/delta`. Fixes #15500 ### What problem does this PR solve? The OneDrive connector builds Microsoft Graph delta URLs from optional `config.folder_path`. When users enter a path without a leading slash (e.g. `Documents/Reports` instead of `/Documents/Reports`), the connector produces a malformed URL such as `root:Documents/Reports:/delta`. Per [Microsoft Graph path-based addressing](https://learn.microsoft.com/en-us/graph/onedrive-addressing-driveitems), the segment after `root:` must start with `/` (e.g. `root:/Documents/Reports:/delta`). Sync and validation then fail or return no documents, which is hard to diagnose from the UI because the optional folder field does not enforce the format. This PR normalizes `folder_path` at connector construction time (prepend `/`, trim whitespace and trailing slashes) and rejects `..` segments before any Graph request is made. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-08 09:56:47 +08:00
web-dev0521	1d7e45115b	feat(connectors): add Salesforce CRM data source connector (#15462 ) ### What problem does this PR solve? Closes #15461. RAGFlow had no way to ingest Salesforce CRM data, so support / sales teams couldn't ground responses on live Accounts, Contacts, Opportunities, Cases, or Knowledge articles. This adds a first-class Salesforce data source connector that authenticates against a Connected App via OAuth 2.0 client-credentials, queries selected SObjects via SOQL, and turns each record into an indexable document with incremental sync. Highlights - `common/data_source/salesforce_connector.py`: new `SalesforceConnector` (`CheckpointedConnectorWithPermSync` + `SlimConnectorWithPermSync`). - OAuth 2.0 client-credentials flow; canonical `instance_url` from the token response so multi-pod orgs route correctly. - Per-object `SystemModstamp` cursor stored in `SalesforceCheckpoint.cursors` — a failure mid-object doesn't rewind sibling objects, and re-syncs only fetch changed rows. - Deterministic record-to-text formatter (sorted keys) so SOQL field reordering on the server doesn't mark every row "changed" on each poll. - `_get_json` raises on non-2xx so 429 / 5xx never silently advance the checkpoint past missing data. - `Knowledge__kav` is in the default object set but is skipped silently when the org doesn't have Salesforce Knowledge enabled (404 on describe). - Slim-doc IDs are scoped as `<Object>/<Id>` so prune deletes can't collide across object types. - `common/constants.py`, `common/data_source/config.py`, `common/data_source/__init__.py`: register `salesforce` in `FileSource` / `DocumentSource` and export `SalesforceConnector`. - `rag/svr/sync_data_source.py`: new `Salesforce(SyncBase)` class routed through `load_from_checkpoint` (poll_source would re-walk every object each run) and added to `func_factory`. - Frontend: - `web/src/pages/user-setting/data-source/constant/index.tsx`: new `DataSourceKey.SALESFORCE`, form fields (instance URL, client ID/secret, objects, api_version, batch size), `syncDeletedFiles` capability, default form values, and tile entry with the new icon. - `web/src/locales/{en,zh}.ts`: description + per-field tooltips. - `web/src/assets/svg/data-source/salesforce.svg`: 48x48 brand-style icon to match the other Microsoft / cloud tiles. Verification - `npm run build` (vite + esbuild) passes (1m 26s). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-06-05 13:24:36 +08:00
web-dev0521	98f2a2e60b	feat(connectors): add Azure Blob Storage data source connector (#15466 ) ### What problem does this PR solve? Closes #15465. RAGFlow supports S3, Google Cloud Storage, R2, and OCI as data sources but not Azure Blob Storage, leaving Azure users without a way to index container objects into a knowledge base. This adds a first-class Azure Blob Storage data-source connector — distinct from RAGFlow's existing Azure storage backends (`rag/utils/azure_sas_conn.py`, `rag/utils/azure_spn_conn.py`) which store RAGFlow's own files. Highlights - `common/data_source/azure_blob_connector.py`: new `AzureBlobConnector` (`CheckpointedConnectorWithPermSync` + `SlimConnectorWithPermSync`). - Uses the existing `azure-storage-blob` dependency (already in `pyproject.toml`). - Three auth modes, tried in order of precedence: 1. Account key — `account_name` + `account_key` + `container_name`. 2. Connection string — `connection_string` + `container_name`. 3. SAS token — `container_url` + `sas_token` (same shape as `RAGFlowAzureSasBlob`). - ETag fingerprint stored per blob in `AzureBlobCheckpoint.etags` — unchanged blobs (same ETag as last run) are skipped without a download. Only new/modified blobs are fetched. - Optional `prefix` scopes indexing to a virtual folder. - `validate_connector_settings()` probes `get_container_properties()` and maps `AuthenticationFailed / 403 / ContainerNotFound` to typed connector exceptions. - Slim-doc IDs are blob names so prune reconciles correctly. - `common/constants.py`, `common/data_source/config.py`, `common/data_source/__init__.py`: register `azure_blob` in `FileSource` / `DocumentSource` and export `AzureBlobConnector`. - `rag/svr/sync_data_source.py`: new `AzureBlob(SyncBase)` class routed through `load_from_checkpoint` (ETag fingerprint owns change-detection) and added to `func_factory`. - Frontend: - `web/src/pages/user-setting/data-source/constant/index.tsx`: new `DataSourceKey.AZURE_BLOB`, auth-mode selector (account key / connection string / SAS token), all credential fields, prefix + batch-size, `syncDeletedFiles` capability, default form values, tile entry with icon. - `web/src/locales/{en,zh}.ts`: description + per-field tooltips for all 9 new keys. - `web/src/assets/svg/data-source/azure-blob.svg`: Azure-branded stacked-cylinders icon. Verification - `npm run build` (vite + esbuild) passes (37 s). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-06-04 21:06:01 +08:00
web-dev0521	cd18cfab79	feat(connector): implement Outlook data source connector (issue #15332 ) (#15333 ) ### What problem does this PR solve? Closes #15332. RAGFlow can index Gmail and generic IMAP mailboxes but had no native connector for Outlook / Microsoft 365 mail. Organisations on Microsoft 365 had no way to bring mailbox content into a knowledge base through Microsoft Graph. This PR adds a net-new Outlook data source that: - Authenticates against Microsoft Graph with the same MSAL client-credentials flow already used by the SharePoint and Teams connectors (no new auth primitives). - Pages over `/users/{id}/mailFolders/{folder}/messages/delta` per mailbox and persists `@odata.deltaLink` values in `OutlookCheckpoint.delta_links`, so incremental syncs only fetch changed messages. - Supports two scoping modes: - Tenant-wide (default): enumerates every user in the tenant via `/users` and syncs each mailbox. Requires `User.Read.All`. - Targeted: when `user_ids` is provided (comma-separated UPNs or object IDs), only those mailboxes are synced. `User.Read.All` is not needed in this mode. - Lets the caller pick the mail folder (`inbox`, `sentitems`, `archive`, ...). Defaults to `inbox`. - Maps each message to a `Document` shaped after the Gmail connector: one `TextSection` carrying `From/To/Cc/Subject` headers + body, with HTML bodies stripped to text inline (no extra dependency). - Surfaces typed errors on the validation probe: 401 → `ConnectorMissingCredentialError`, 403 → `InsufficientPermissionsError` (with `Mail.Read` / `User.Read.All` hint), 404 on a configured mailbox → `ConnectorValidationError`, 5xx → `UnexpectedValidationError`. - Skips messages flagged `@removed` by the delta semantics and messages whose `receivedDateTime` is older than `poll_range_start`. #### Files \| File \| Change \| \|------\|--------\| \| `common/data_source/outlook_connector.py` \| New — `OutlookConnector` (`CheckpointedConnectorWithPermSync` + `SlimConnectorWithPermSync`) + `OutlookCheckpoint` + tiny `_strip_html` helper. \| \| `common/data_source/config.py` \| `DocumentSource.OUTLOOK = "outlook"`. \| \| `common/constants.py` \| `FileSource.OUTLOOK = "outlook"`. \| \| `common/data_source/__init__.py` \| Export `OutlookConnector`. \| \| `rag/svr/sync_data_source.py` \| `Outlook(SyncBase)` with `batch_size` normalisation, CSV/list parsing of `user_ids`; registered in `func_factory`. \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `DataSourceKey.OUTLOOK`, visibility map (`syncDeletedFiles: true`), info entry, form fields (tenant_id, client_id, client_secret, folder, user_ids, batch_size), default values. \| \| `web/src/locales/en.ts`, `web/src/locales/zh.ts` \| `outlookDescription` + 5 tooltip keys (EN + ZH). \| \| `test/unit_test/data_source/test_outlook_connector_unit.py` \| New — 19 unit tests (`p1`/`p2`/`p3`) covering auth, validation (tenant-wide vs specific user vs error paths), checkpoint helpers, user enumeration pagination, message filtering, HTML body stripping. \| #### Required Azure AD permissions - `Mail.Read` (Application, admin-granted) — always. - `User.Read.All` (Application, admin-granted) — only when `user_ids` is left blank so the connector can enumerate mailboxes. #### Out of scope - Attachment indexing. The current connector emits message body + headers; binary attachments are flagged via `metadata.has_attachments` but not pulled. Adding attachment hydration is straightforward but scoped out per the issue's "decide whether attachments are indexed in the first version" note. - Delegated (per-user) OAuth. The connector uses app-only credentials, consistent with the SharePoint / Teams precedent in this codebase. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-29 21:52:29 +08:00
web-dev0521	bda2117a25	feat(connector): implement OneDrive data source connector (issue #15330 ) (#15331 ) ### What problem does this PR solve? Closes #15330. RAGFlow had no connector for OneDrive / OneDrive for Business. Users who store working documents in OneDrive could not index them into a knowledge base without manually downloading and re-uploading files. This PR adds a net-new OneDrive data source that: - Authenticates against Microsoft Graph with the same MSAL client-credentials flow already used by the SharePoint and Teams connectors (no new auth primitives). - Enumerates every drive visible to the service principal and pages through `/drives/{id}/root/delta`, persisting `@odata.deltaLink` values per drive so subsequent syncs only fetch changed items. - Optionally narrows ingestion to a sub-folder (`folder_path`) without needing a separate code path. - Surfaces typed errors on the validation probe (`GET /drives?$top=1`): 401 → `ConnectorMissingCredentialError`, 403 → `InsufficientPermissionsError` (with a `Files.Read.All` hint), 5xx → `UnexpectedValidationError`. - Filters folders, soft-deleted items, and unsupported extensions (`.pdf .docx .doc .xlsx .xls .pptx .ppt .txt .md .csv`). #### Files \| File \| Change \| \|------\|--------\| \| `common/data_source/onedrive_connector.py` \| New — `OneDriveConnector` + `OneDriveCheckpoint`. \| \| `common/data_source/config.py` \| `DocumentSource.ONEDRIVE = "onedrive"`. \| \| `common/constants.py` \| `FileSource.ONEDRIVE = "onedrive"`. \| \| `common/data_source/__init__.py` \| Export `OneDriveConnector`. \| \| `rag/svr/sync_data_source.py` \| `OneDrive(SyncBase)` with `batch_size` normalisation; registered in `func_factory`. \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `DataSourceKey.ONEDRIVE`, visibility map (`syncDeletedFiles: true`), info entry, form fields (tenant_id, client_id, client_secret, folder_path, batch_size), default values. \| \| `web/src/locales/en.ts`, `web/src/locales/zh.ts` \| `onedriveDescription` + 4 tooltip keys (EN + ZH). \| \| `test/unit_test/data_source/test_onedrive_connector_unit.py` \| New — 13 unit tests (`p1`/`p2`) covering auth, validation, checkpoint helpers, and document filtering. \| #### Required Azure AD permission `Files.Read.All` (Application, admin-granted). #### Out of scope - Interactive end-user OAuth (delegated permissions) — the connector uses app-only credentials, consistent with the SharePoint / Teams precedent. - Binary download of file contents — the sync layer emits `Document`s carrying `webUrl` + metadata; bytes are hydrated downstream by the parse pipeline. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-29 19:26:06 +08:00
web-dev0521	98bc9ca6ac	feat: implement Microsoft Teams data source connector (#15193 ) ### What problem does this PR solve? Closes #15191. RAGFlow shipped a Microsoft Teams connector stub (`common/data_source/teams_connector.py`) whose document-loading methods all returned `[]`, `Teams._generate()` was a `pass`, and Teams was commented out of the data-source settings UI. As a result there was no way to index Teams channel conversations into a knowledge base. This PR implements the connector end to end on top of Microsoft Graph (Office365-REST-Python-Client). It shares the MSAL client-credentials auth shape with the SharePoint connector. Backend - `common/data_source/teams_connector.py` - `load_credentials()` now builds the Graph client using an MSAL client-credentials token callback — the form `GraphClient` actually expects. (The previous stub passed a raw access-token string to `GraphClient(...)`, which is not how that client is driven.) Token acquisition is lazy, so credential loading performs no network call. - `validate_connector_settings()` lists teams via Graph. - `load_from_checkpoint()` is now a generator that pages teams → channels → messages, flattens each top-level post together with its replies into one blob-based `Document` (`extension` `.txt`/`.html`, `blob`, `size_bytes`, `doc_updated_at`). Incremental syncs are bounded by message `lastModifiedDateTime` (falling back to `createdDateTime`). Per-message errors surface as `ConnectorFailure` instead of aborting the run. - `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument` batches and the checkpoint helpers return proper `TeamsCheckpoint`s. - ACL → `ExternalAccess` mapping is intentionally left best-effort (`load_from_checkpoint_with_perm_sync` delegates to the standard load) because the sync pipeline does not currently persist `ExternalAccess`. - `rag/svr/sync_data_source.py` - Implemented `Teams._generate()` using the existing `CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google Drive), supporting full reindex and incremental polling from `poll_range_start`. - `TeamsConnector` is already exported from `common/data_source/__init__.py`. Frontend (`web/`) - Enabled the `TEAMS` data-source enum and added its form fields (`tenant_id`, `client_id`, `client_secret`), default values, display metadata, and a Teams icon. - Added `teamsDescription` / `teamsTenantIdTip` to `en.ts` and `zh.ts`. Tests - `test/unit_test/data_source/test_teams_connector_unit.py`: mock-based unit tests covering credential loading (incomplete creds raise, happy path sets the Graph client, fetch-without-creds raises), post/reply flattening (incl. the HTML vs text extension), incremental `lastModifiedDateTime` filtering, and slim-doc listing. All 6 pass; `ruff check` is clean. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-28 17:10:38 +08:00
web-dev0521	5de021ebb4	feat: implement Slack data source connector (#15188 ) ### What problem does this PR solve? Closes #15187. RAGFlow shipped a Slack connector (`common/data_source/slack_connector.py`) but it was never usable: `Slack._generate()` in the sync worker was a `pass` stub, the connector's document-generating code was incompatible with the current data model, and Slack was commented out of the data-source settings UI. As a result, teams had no way to index Slack channels/threads into a knowledge base. This PR completes the connector end to end. Backend - `common/data_source/slack_connector.py` - Rewrote `thread_to_doc` to produce a blob-based `Document` (`extension`/`blob`/`size_bytes`). The previous implementation built the doc with a `sections=[...]` argument and omitted the now-required `blob`/`extension`/ `size_bytes` fields, so it raised a validation error against the current `Document` model. Thread messages are now cleaned and flattened into a single UTF-8 text blob. - Added `load_from_state()` / `poll_source(start, end)` generators. The connector's checkpoint interface is a no-op stub, so both full and incremental syncs run through a single channel-iterating generator built on the existing module helpers (`get_channels`, `filter_channels`, `get_channel_messages`, `_process_message`), with per-channel thread de-duplication. - `rag/svr/sync_data_source.py` - Implemented `Slack._generate()`. Credentials are loaded via `StaticCredentialsProvider` (the connector requires `slack_bot_token` and does not support `load_credentials`). Supports full reindex and incremental polling from `poll_range_start`, plus the optional channel filter. Modeled on the Confluence/Dropbox wrappers. - `SlackConnector` was already exported from `common/data_source/__init__.py`. Frontend (`web/`) - Enabled the `SLACK` data-source enum and added its form fields (Slack bot token + optional channel filter), default values, display metadata, and a Slack icon. - Added `slackDescription` / `slackBotTokenTip` / `slackChannelsTip` strings to `en.ts` and `zh.ts`. Tests - `test/unit_test/data_source/test_slack_connector_unit.py`: unit tests covering credential loading (`load_credentials` raises, `set_credentials_provider` initializes clients, missing credentials raises) and document generation (standalone message + flattened thread, blob/extension/size_bytes/metadata, and the incremental poll time window). All 5 pass; `ruff check` is clean. Required Slack scopes: `channels:read`, `channels:history`, `users:read`. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-28 15:46:07 +08:00
web-dev0521	c4c4e228e3	feat: implement SharePoint data source connector (#15190 ) ### What problem does this PR solve? Closes #15189. RAGFlow shipped a SharePoint connector stub (`common/data_source/sharepoint_connector.py`) whose document-loading methods all returned `[]`, `SharePoint._generate()` was a `pass`, and SharePoint was commented out of the data-source settings UI. As a result there was no way to index files stored in SharePoint document libraries. This PR implements the connector end to end on top of Microsoft Graph (Office365-REST-Python-Client). Backend - `common/data_source/sharepoint_connector.py` - `load_credentials()` now builds the Graph client using an MSAL client-credentials token callback — the form `GraphClient` actually expects. (The previous stub passed a raw access-token string to `GraphClient(...)`, which is not how that client is driven.) Token acquisition is lazy, so credential loading does no network call. - `validate_connector_settings()` resolves the configured site via Graph. - `load_from_checkpoint()` is now a generator that enumerates every document library under the site, walks folders depth-first, downloads each file, and yields blob-based `Document` objects (`extension` / `blob` / `size_bytes` / `doc_updated_at`). Incremental syncs are bounded by file `lastModifiedDateTime`. Per-file errors are surfaced as `ConnectorFailure` rather than aborting the run. - `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument` batches (no downloads) and the checkpoint helpers return proper checkpoints. - ACL → `ExternalAccess` mapping is intentionally left best-effort (`load_from_checkpoint_with_perm_sync` delegates to the standard load) because the sync pipeline does not currently persist `ExternalAccess`; this can be extended once that plumbing exists. - `rag/svr/sync_data_source.py` - Implemented `SharePoint._generate()` using the existing `CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google Drive), supporting full reindex and incremental polling from `poll_range_start`. - `SharePointConnector` is already exported from `common/data_source/__init__.py`. Frontend (`web/`) - Enabled the `SHAREPOINT` data-source enum and added its form fields `site_url`, `tenant_id`, `client_id`, `client_secret`), default values, display metadata, and a SharePoint icon. - Added `sharepointDescription` / `sharepointSiteUrlTip` to `en.ts` and `zh.ts`. Tests - `test/unit_test/data_source/test_sharepoint_connector_unit.py`: mock-based unit tests covering credential loading (incomplete creds raise, happy path sets the Graph client, fetch-without-creds raises), drive traversal + file download, incremental `lastModifiedDateTime` filtering, and slim-doc listing. All 6 pass; `ruff check` is clean. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-28 13:26:08 +08:00
dripsmvcp	ce9a4425d2	fix(imap): handle multi-address headers in _parse_singular_addr (#15006 ) Replace the RuntimeError with a warning + first-address fallback so a single email whose From header contains multiple addresses no longer crashes the entire IMAP sync task. Also add regression tests covering: - #14963: RFC 5322 quoted display names with commas (e.g. "Schlüter, Sabine" <s@x>) parsed as one address, not two. - #14964: multi-address headers warn instead of raising. Closes #14964 Refs #14963	2026-05-21 15:37:02 +08:00
Wang Qi	6ce76e6799	Fix discord async issue (#15054 ) ### What problem does this PR solve? RuntimeError: Cannot run the event loop while another loop is running ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-20 19:21:19 +08:00
Ahmad Intisar	e994051eb9	Feature/generic api connector (#13545 ) # feat: Add Generic REST API Connector ## What problem does this PR solve? RAGFlow supports many specific data source connectors (MySQL, Slack, Google Drive, etc.), but there was no way to connect an arbitrary REST API as a data source. Users with custom or third-party APIs had to write a new connector class for each one. This PR adds a generic, configuration-driven REST API connector that lets users connect any REST API as a data source entirely through the UI — no code changes needed per API. --- ## Features ### Core Connector (`common/data_source/rest_api_connector.py`) - Implements `LoadConnector` and `PollConnector` interfaces for full and incremental sync - Configurable authentication: None, API Key (custom header), Bearer Token, Basic Auth - Pluggable pagination: Page-based, Offset-based, Cursor-based, or None - Smart page-size inference from user's query parameters to avoid duplicate/conflicting params - Configurable request delay between pages to prevent API rate limiting - Auto-detection of the items array in JSON responses (`items`, `results`, `data`, `records`, or first list found) - Advanced field mapping with dot-notation (`country.name`), array wildcards (`newsType[].name`), type hints, and default values - Optional content template rendering (`"Title: {title}\nBody: {body}"`) - HTML stripping for content fields - Stable document IDs via `hash128` from a configurable ID field or auto-generated from item content - Pydantic configuration schema with automatic coercion of UI string inputs to dicts/lists ### Backend Registration (`rag/svr/sync_data_source.py`, `common/constants.py`, `common/data_source/config.py`) - `REST_API` sync class wired into RAGFlow's `func_factory` - Full sync (`load_from_state`) and incremental polling (`poll_source`) support - Credentials and config passed from task to connector following existing patterns (MySQL, SeaFile, etc.) ### Test Connection Endpoint (`api/apps/connector_app.py`) - `POST /v1/connector/<id>/test` validates config schema, authentication, and API connectivity without triggering a sync - Clear error messages for auth failures vs. config issues ### Frontend UI (`web/src/pages/user-setting/data-source/constant/`) - Postman-style configuration:* Base URL, Query Parameters (key=value per line), Auth, Content Fields, Metadata Fields, Pagination Type - Auth-type-aware form: fields for API key header/value, Bearer token, or Basic username/password appear only when relevant - Advanced Settings toggle for: Custom Headers, Max Pages, Request Delay, Poll Timestamp Field, Request Body (POST) - Connector icon (SVG) and i18n strings (English) - "Test Connection" button to validate before syncing --- ## Controls & Safety - Configurable max pages safety cap (default: 1000, adjustable in UI) - Configurable request delay between pages (default: 0.5s, adjustable in UI) - Auth errors (401/403) fail immediately without retries; transient errors retry with exponential backoff - Diagnostic logging: auth setup confirmation, request details on failure, content field extraction status --- ## Type of change - [x] New Feature (non-breaking change which adds functionality) ##Visual Screenshots of Features <img width="482" height="510" alt="Screenshot 2026-03-11 at 5 19 52 PM" src="https://github.com/user-attachments/assets/dcb7ab4a-1622-44f3-bb02-d6f0527314c4" /> (Connector can be configured within the external data sources tab) Configuration Parameters: <img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 46 PM" src="https://github.com/user-attachments/assets/5e154e71-4ab5-4872-bfb2-04f02b73c18a" /> <img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 54 PM" src="https://github.com/user-attachments/assets/00cb14b7-0bcf-4b94-9d71-34e93369ecb2" /> Connection can be tested before attaching to dataset: <img width="981" height="681" alt="Screenshot 2026-03-11 at 5 21 40 PM" src="https://github.com/user-attachments/assets/aaa6eeeb-89a7-4349-bc34-2423bf8be9ee" /> Ingestion tested with API connector (works perfectly fine): <img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 22 30 PM" src="https://github.com/user-attachments/assets/afcd0d58-cadd-4152-badc-d2f14d96fbec" /> Search & Retrieval works as well with metadata flow: <img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 23 05 PM" src="https://github.com/user-attachments/assets/d41ee935-dcf7-4456-b317-22a76ca032c0" /> --------- Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-05-13 20:35:01 +08:00
Hunnyboy1217	782084780e	feat(connectors): ETag-based bypass for incremental S3 ingestion (#14628 ) (#14677 ) ### What problem does this PR solve? S3-family connector syncs currently re-download every in-window object just so we can compute `xxhash128(blob)` and compare against `Document.content_hash`. Anything that bumps `LastModified` without changing bytes (`aws s3 cp` touches, bucket re-encryption, etc.) pays full bandwidth and re-parses files that didn't actually change. #14628 covers the broader incremental-ingestion redesign; this PR is the first slice. The fix is a pre-listing short-circuit. `BlobStorageConnector` (S3 / R2 / GCS / OCI / S3-compat) now implements a new `FingerprintConnector` interface: `list_keys()` paginates `list_objects_v2` and yields `KeyRecord(key, fingerprint)` where `fingerprint = xxhash128(ETag)`. The orchestrator joins those against the connector's existing `{doc_id: content_hash}` map and only calls `get_value(key)` when the fingerprint differs. Unchanged keys are skipped entirely — no `GetObject`, no re-parse. No DDL. xxhash128(ETag) is 32 hex chars and reuses the existing `Document.content_hash` column per @yingfeng's suggestion; the connector decides at listing time whether to populate it. Local uploads and connectors that don't opt in fall through to the existing post-download `xxhash128(blob)` path with no behavior change. This is PR-1 of a 4-PR series — full design lives on #14628. Subsequent PRs extend tier 1 to local FS / WebDAV / Dropbox / Seafile / RDBMS (PR-2), wire up tier 2 cursor connectors with `SyncLogs.next_checkpoint` (PR-3), and unify deletion via `KeyRecord(deleted=True)` reconciliation (PR-4). Holding those back keeps this PR additive and reviewable on its own. #### Files touched - `common/data_source/models.py` — new `KeyRecord`; optional `fingerprint` on `Document` - `common/data_source/interfaces.py` — `IncrementalCapability` enum, `FingerprintConnector` ABC - `common/data_source/blob_connector.py` — `BlobStorageConnector` implements `FingerprintConnector`; per-object download factored into `_build_document_from_obj()` so `_yield_blob_objects`, `list_keys`, `get_value` all share it - `rag/svr/sync_data_source.py` — `_BlobLikeBase._fingerprint_filtered_generator` does the bypass loop; `_run_task_logic` plumbs `doc.fingerprint` into the upload dict - `api/db/services/document_service.py` — `list_id_content_hash_map_by_kb_and_source_type()` helper - `api/db/services/connector_service.py` + `file_service.py` — fingerprint flows through `duplicate_and_parse → upload_document` and lands in `content_hash` - `test/unit_test/common/test_blob_connector_fingerprint.py` — 14 tests covering ETag normalization (single-part, multipart, quoted, empty), `list_keys()` not calling `GetObject`, `get_value()` materializing with fingerprint, deterministic/stable fingerprints, and the bypass loop asserting `GetObject` is not called on a match #### Worth flagging for review Old `_BlobLikeBase._generate` called `poll_source(start, now)` with a `LastModified` window when `poll_range_start` was set. New code uses `_fingerprint_filtered_generator` (full bucket listing + fingerprint compare) outside of explicit `reindex=1`. Strictly better for unchanged-bucket cases since it skips `GetObject`, but it does mean every sync now does a full `list_objects_v2` paginate. Should still be cheap for most buckets — flagging in case anyone has a very large bucket where the time-window filter was meaningful. On migration: existing rows have `content_hash = xxhash128(blob)` from the old code. The first sync after this lands sees ETag-derived fingerprints that don't match, re-fetches every object once, and writes the new fingerprint. From the second sync onward the bypass works as expected. "Slow day one, fast every day after." A `fingerprint_backfill: trust` opt-out is sketched in the design doc but not in this PR. #### Test plan - [x] `uv run ruff check` — clean on all 8 touched files - [x] `uv run pytest test/unit_test/common/test_blob_connector_fingerprint.py -v` — 14 passed - [x] Broader unit-test suite — no regressions in anything I touched - [ ] Manual smoke against a real S3 bucket — configure a connector, run sync twice, expect the second sync to log `bypassed=N, fetched=0` and no `GetObject` calls in CloudTrail / bucket access logs - [ ] Manual smoke with `reindex=1` — confirm the full re-download path still works ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-05-09 20:03:56 +08:00
Jack Storment	59bb184e63	feat(moodle): support deleted-file sync (#14548 ) Fixes #14551 ### What problem does this PR solve? The Moodle connector did not let the sync runner clean up indexed documents that were deleted from the source. Other connectors such as dropbox, seafile, webdav, and rss already do this through a slim snapshot pass. This PR adds the same support for Moodle. When `sync_deleted_files` is on, the runner now asks the Moodle connector for a lightweight list of every module id that could be indexed. The runner then compares this list with the index and removes any indexed document whose id is not in the list. The slim pass does not download files. It only goes through courses and modules and yields ids. The id format matches the ids that the loader produces, so the match is exact. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Notes - `MoodleConnector` now also implements `SlimConnectorWithPermSync`. - New `retrieve_all_slim_docs_perm_sync` yields slim docs with the same ids the loader uses (`moodle_resource_<id>`, `moodle_forum_<id>`, `moodle_page_<id>`, `moodle_book_<id>`, `moodle_assign_<id>`, `moodle_quiz_<id>`). - The `Moodle` sync class now returns `(document_generator, file_list)` so the runner can do the cleanup. If the slim snapshot fails, `file_list` is set back to `None` and the run continues without cleanup. - The web data source map exposes `syncDeletedFiles` for Moodle so the option shows up in the UI. ### How was this tested? - `ruff check` passes on the changed Python files. - Manual review of the produced slim ids against the ids the loader builds in `_process_resource`, `_process_forum`, `_process_page`, `_process_book`, and `_process_activity`. - Behavior parity with the merged dropbox (#14476), seafile (#14499), webdav (#14491), and rss (#14493) PRs.	2026-05-07 17:44:46 +08:00
Magicbook1108	911671cef0	Feat: enable sync deleted files for RDBMS & fix remove last file issue (#14615 ) ### What problem does this PR solve? Feat: enable sync deleted files for RDBMS & fix remove last file issue ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-05-07 13:31:05 +08:00
buua436	5672be0652	Feat: add IMAP deleted document sync (#14539 ) ### What problem does this PR solve? add IMAP deleted document sync ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-06 14:06:46 +08:00
NeedmeFordev	89961962c0	feat(dingtalk-ai-table): support deleted-file sync via slim snapshot (#14525 ) ### What problem does this PR solve? Incremental DingTalk AI Table (Notable) sync did not reconcile rows removed on the remote side with documents already in the knowledge base. This follows the coordinated datasource work in #14362 (“sync deleted files”). This PR adds a full slim snapshot (`retrieve_all_slim_docs_perm_sync`) that lists current record IDs for all sheets without building document blobs, using the same logical document IDs as full ingest (`dingtalk_ai_table:{table_id}:{sheet_id}:{record_id}`). When `sync_deleted_files` is enabled on incremental runs, `DingTalkAITable._generate` returns `(document_generator, file_list)` so `SyncBase` can run `cleanup_stale_documents_for_task` and remove KB rows that no longer exist remotely. Design notes: - `_document_id` centralizes the ID string so slim snapshots and `_convert_record_to_document` stay aligned with `hash128(doc.id)` semantics used during ingestion/cleanup. - `end_ts` is captured before building `file_list`, then `poll_source` uses the same upper bound (consistent with other Dropbox-style connectors). - `batch_size` from connector config is coerced to a positive `int` before constructing the connector. - Slim snapshot failures are caught in `_generate`; `file_list` is set to `None` so cleanup is skipped rather than running on partial/error state. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Files changed (summary) \| Area \| Change \| \|------\|--------\| \| `common/data_source/dingtalk_ai_table_connector.py` \| `SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`, `_document_id` shared with document conversion \| \| `rag/svr/sync_data_source.py` \| `DingTalkAITable._generate`: slim snapshot + tuple return; `batch_size` validation; shared `end_ts` with `poll_source` \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `syncDeletedFiles` for DingTalk AI Table in `DataSourceFeatureVisibilityMap` \| Closes / relates to: #14362	2026-05-06 14:06:23 +08:00
bitloi	a69e0c73c7	feat(rss): support deleted-file sync (#14493 ) ### What problem does this PR solve? Partially addresses #14362. This PR enables syncing deleted files for RSS data sources. Previously, RSS incremental sync only returned feed entries whose timestamps were inside the poll window. If an entry was removed from the RSS feed, RAGFlow had no full current RSS snapshot to pass into the shared stale-document cleanup path, so the deleted remote entry could remain in the knowledge base. This PR: - adds `retrieve_all_slim_docs_perm_sync()` to `RSSConnector` - reuses the same `rss:<md5(stable_key)>` document ID derivation used by normal RSS ingest - returns `(document_generator, file_list)` for incremental RSS sync when `sync_deleted_files` is enabled - captures the poll end timestamp before snapshot/poll so cleanup does not race against the same sync window - adds start/end logs around RSS slim snapshot collection - exposes the deleted-file sync toggle for RSS in the data source UI Per maintainer request on related datasource PRs, this PR contains no test-case changes. Local verification was run with an external script. Validation: - `uv run ruff check common/data_source/rss_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` - `git diff --check` - `uv run python /tmp/verify_rss_deleted_sync.py --repo /root/74/ragflow` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-30 18:56:13 +08:00
NeedmeFordev	bedf9592ef	feat(webdav): support deleted-file sync via slim snapshot (#14491 ) ## What problem does this PR solve? Incremental WebDAV sync only ingested files whose modification time fell inside the poll window; documents removed on the WebDAV server were never removed from the knowledge base. This aligns with [#14362](https://github.com/infiniflow/ragflow/issues/14362) (coordinated datasource “sync deleted files” work). This PR adds a full-tree slim snapshot (`retrieve_all_slim_docs_perm_sync`) that enumerates current remote paths without downloading file contents, using the same logical document IDs as full ingest (`webdav:{base_url}:{file_path}`). When `sync_deleted_files` is enabled on incremental runs, sync returns `(document_generator, file_list)` so `SyncBase` runs `cleanup_stale_documents_for_task` and removes KB rows no longer present remotely. Design notes: - `_list_files_recursive` gains `filter_by_mtime`: snapshot passes `filter_by_mtime=False` (full tree under `remote_path`); `poll_source` keeps mtime-window filtering as before. - Slim snapshot applies the same extension and `size_threshold` rules as `_yield_webdav_documents` so retain IDs match what would be indexed. - `end_ts` is captured before building `file_list`, then `poll_source` uses the same upper bound (consistent with Dropbox-style connectors). ## Type of change - [x] New Feature (non-breaking change which adds functionality) ## Files changed \| Area \| Change \| \|------\|--------\| \| `common/data_source/webdav_connector.py` \| `SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`, `filter_by_mtime` on `_list_files_recursive` \| \| `rag/svr/sync_data_source.py` \| WebDAV `_generate`: `file_list` + tuple return; pass `batch_size` from connector config \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `syncDeletedFiles` for WebDAV in `DataSourceFeatureVisibilityMap` \|	2026-04-30 17:26:27 +08:00
bitloi	17eda04b8d	feat(zendesk): support deleted-file sync (#14487 ) ### What problem does this PR solve? Refs #14362. This PR enables syncing deleted files for Zendesk data sources. Previously, Zendesk incremental sync never returned a slim remote snapshot to the shared stale-document cleanup path, so deleted remote Zendesk records could remain in RAGFlow. The existing Zendesk slim snapshot also included records that ingestion intentionally skips, such as draft articles, articles without bodies, skipped-label articles, empty-body articles, and tickets with `status == "deleted"`. This PR: - exposes the deleted-file sync option for Zendesk in the data source UI - returns Zendesk slim snapshots during incremental sync when `sync_deleted_files` is enabled - reuses Zendesk indexability rules so cleanup compares against the same records ingestion can materialize - adds start/end logs around Zendesk slim snapshot collection for operational visibility Per maintainer request, this PR contains no test-case changes. Manual verification recording will be provided separately. Validation: - `uv run ruff check common/data_source/zendesk_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-04-30 14:44:05 +08:00
bitloi	8f75e52bbf	feat(asana): support deleted-file sync (#14468 ) ### What problem does this PR solve? Partially addresses #14362. Adds deleted-file sync support for the Asana data source. Asana already indexes task attachments as documents, but it did not provide the slim document snapshot required by stale-document reconciliation, and the sync wrapper never returned a `file_list` for cleanup. This PR: - adds `retrieve_all_slim_docs_perm_sync()` to `AsanaConnector` - builds slim IDs with the same `asana:{task_id}:{attachment_gid}` format used by indexed documents - avoids downloading attachment blobs during the snapshot - aborts the snapshot if Asana API errors occur, preventing partial snapshots from deleting valid local docs - captures the incremental poll end time before snapshotting and makes `poll_source()` respect that boundary - exposes the deleted-file sync toggle for Asana in the data source UI Per maintainer request, this PR contains no test-case changes. Manual verification recording will be provided separately. Validation: - `uv run ruff check common/data_source/asana_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` - `git diff --check` ### Type of change - [x] New Feature	2026-04-30 14:41:36 +08:00
NeedmeFordev	2932b65da6	feat(seafile): support deleted-file sync via slim snapshot (#14499 ) ### What problem does this PR solve? Incremental Seafile sync only ingests files whose modification time falls in the poll window; documents removed in Seafile were never removed from the knowledge base. This contributes to [#14362](https://github.com/infiniflow/ragflow/issues/14362) (datasource “sync deleted files” coordination). This PR adds a slim snapshot (`retrieve_all_slim_docs_perm_sync`) that enumerates current remote file IDs without downloading content, using the same logical IDs as full ingest (`seafile:{repo_id}:{file_id}`). When `sync_deleted_files` is enabled on incremental runs, `SeaFile._generate` returns `(document_generator, file_list)` so `SyncBase` can run `cleanup_stale_documents_for_task` and remove stale KB documents. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### What changed - `common/data_source/seafile_connector.py`: `SeaFileConnector` implements `SlimConnectorWithPermSync`; `_list_files_recursive(..., filter_by_mtime=...)` supports full-tree listing for snapshots; `retrieve_all_slim_docs_perm_sync()` reuses the same library/root scan as ingest and applies the same size ceiling; logging for snapshot start/end and counts. - `rag/svr/sync_data_source.py`: `SeaFile._generate` validates `batch_size`, captures `end_ts` before snapshot + `poll_source`, wraps slim retrieval in `try`/`except` ( `file_list = None` on failure so ingest continues), returns `(generator, file_list)`. - `web/src/pages/user-setting/data-source/constant/index.tsx`: `syncDeletedFiles` for Seafile in `DataSourceFeatureVisibilityMap`.	2026-04-30 12:05:12 +08:00
Magicbook1108	de8c6ad0f3	Feat: enable sync deleted file for Discord (#14451 ) ### What problem does this PR solve? Feat: enable sync deleted file for Discord ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:40 +08:00
bitloi	2bc8c6d35e	feat(dropbox): support deleted-file sync (#14476 ) ### What problem does this PR solve? Partially addresses #14362 by adding deleted-file sync support for the Dropbox data source. Dropbox previously did not provide the slim current-file snapshot required by stale document reconciliation, and its sync runner returned only document batches. As a result, enabling deleted-file sync could not remove local documents that had been deleted from Dropbox. This PR: - Adds `retrieve_all_slim_docs_perm_sync()` to `DropboxConnector`. - Reuses Dropbox metadata traversal to collect current remote file IDs without downloading file contents. - Wires incremental Dropbox sync to return `(document_generator, file_list)` when `sync_deleted_files` is enabled. - Enables the deleted-file sync toggle for Dropbox in the data source settings UI. - Adds regression coverage for slim snapshots, nested folders, paginated listings, duplicate filenames, and full reindex behavior. Tests: - `uv run pytest test/unit_test/common/test_dropbox_connector.py -q` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `uv run pytest test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py -q` - `uv run ruff check common/data_source/dropbox_connector.py rag/svr/sync_data_source.py test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:11 +08:00
Magicbook1108	db1a73b255	Feat: enable sync deleted files in gitlab (#14481 ) ### What problem does this PR solve? Feat: enable sync deleted files in gitlab ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:04:10 +08:00
Magicbook1108	e0b3070012	Feat: enable sync deleted files for Gmail && fix google drive issues (#14462 ) ### What problem does this PR solve? Feat: enable sync deleted files for Gmail && fix google drive issues ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: bill <yibie_jingnian@163.com> Co-authored-by: balibabu <assassin_cike@163.com>	2026-04-29 17:03:56 +08:00
Magicbook1108	3b7a6eaa6c	Feat: sync deleted files in Bitbucket (#14450 ) ### What problem does this PR solve? Feat: sync deleted files in Bitbucket ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 11:29:17 +08:00
Magicbook1108	0d18b293f5	Fix: enable sync deleted file in airtable (#14438 ) ### What problem does this PR solve? Fix: enable sync deleted file in airtable ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 20:09:08 +08:00
Magicbook1108	18fbfafca6	Feat: enable sync deleted files for more connectors (#14353 ) ### What problem does this PR solve? Feat: enable sync delted files for connectors ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-28 15:07:14 +08:00
Xing Hong	fb95136f39	Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090 ) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-25 14:30:15 +08:00
Idriss Sbaaoui	ca01c7a745	Fix blob sync: skip unsupported files before download (#14357 ) ### What problem does this PR solve? Blob storage sync was downloading unsupported files first and rejecting them later, which wasted bandwidth and made sync slower. This PR skips unsupported extensions before download and applies `allow_images` in blob sync. fixes #14338 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-24 19:22:32 +08:00
corevibe555	e7d044413f	Fix: Google Drive connector missing new files after initial sync (#13943 ) Closes https://github.com/infiniflow/ragflow/issues/13939 ## What problem does this PR solve? The Google Drive connector fails to detect new files after the initial sync (#13939). The root cause is that `generate_time_range_filter()` applies a strict `modifiedTime > poll_range_start` cutoff when querying the Google Drive API. Files uploaded to Google Drive that retain their original `modifiedTime` (common behavior) get silently excluded if their timestamp predates the last sync's cutoff. Unlike the Confluence and Jira connectors which use a configurable time buffer (`CONFLUENCE_SYNC_TIME_BUFFER_SECONDS`) to offset `poll_range_start` backward, the Google Drive connector had no such mechanism — resulting in a razor-sharp timestamp boundary with zero tolerance for overlap. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ## Summary * New Features * Added a configurable time buffer for Google Drive synchronization to address timing delays and improve sync reliability. * Improved file detection logic to include recently created files alongside modified ones, reducing missed synchronizations.	2026-04-10 11:39:19 +08:00
Magicbook1108	8d52ef2893	Feat: enable sync deleted files for connector (#14000 ) ### What problem does this PR solve? Feat: enable sync deleted files for connector 1. first comes with github ### Type of change - [x] New Feature (non-breaking change which adds functionality) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added "sync deleted files" feature for data sources, enabling automatic removal of files deleted from the source system. * Added multilingual support for the new sync deleted files setting across multiple languages. * UI Improvements * Improved checkbox form field rendering and layout. * Enhanced full-width display for authentication token input fields.	2026-04-09 16:40:14 +08:00
Ricardo-M-L	424aee5bec	fix: correct typos in code comments, docstrings and docs (#13931 ) ## Summary - Fix `a image` → `an image` in README and log message - Fix `colomn` → `column` in table structure recognizer comment - Fix `formated` → `formatted` in confluence connector docstring - Fix `tabel of content` → `table of contents` in TOC prompt ## Test plan - [ ] Documentation and comment changes, no functional impact 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuj <yuj@ztjzsoft.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-04-07 13:05:39 +08:00
NeedmeFordev	6b7989b4b4	Add file type validation (#13802 ) ### What problem does this PR solve? This PR fixes WebDAV sync behavior for unsupported file types ([#13795](https://github.com/infiniflow/ragflow/issues/13795)). Previously, the WebDAV connector selected files primarily by modified time (and size threshold) and could still pass unsupported extensions into the download/document-generation path. This caused unnecessary processing and inconsistent behavior compared with connectors that validate file type earlier. This change adds extension validation in two places: 1. Early filter during recursive listing to skip unsupported files before they enter the download flow. 2. Defensive filter before download/document creation to prevent unsupported files from being processed if any listing edge case slips through. It also wires `allow_images` into the WebDAV sync path so image extension handling follows connector policy. Scope is intentionally limited to WebDAV for a focused bug-fix PR. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### How was this tested? - Manual verification with mixed file types under the configured WebDAV path: - supported: `.pdf`, `.txt`, `.md` - unsupported: `.exe`, `.bin`, `.dat` - Triggered full sync and polling sync. - Confirmed unsupported files are skipped before download. - Confirmed supported files are still indexed normally. - Confirmed image handling follows `allow_images` setting. Fixes: #13795	2026-04-02 14:12:27 +08:00
KeJun	cb78ce0a7b	feat: support rss datasource (#13721 ) ### What problem does this PR solve? Supporting public RSS/Atom feed URLs as data sources for RagFlow. link https://github.com/infiniflow/ragflow/issues/12313 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-27 22:58:44 +08:00
NeedmeFordev	840cc8fbe9	fix(asana): use project memberships endpoint for project IDs in connector (#13746 ) ### What problem does this PR solve? Fixes a bug in the Asana connector where providing `Project IDs` caused sync to fail with: `project_membership: Not a recognized ID: <PROJECT_GID>` Root cause: the connector called `get_project_membership(project_gid)`, but that API expects a project membership gid, not a project gid. This PR switches to the correct project-scoped API and adds regression tests. Fixes: [#13669](https://github.com/infiniflow/ragflow/issues/13669) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Changes made - Updated `common/data_source/asana_connector.py`: - Replaced `get_project_membership(pid, ...)` with `get_project_memberships_for_project(pid, ...)` - Trimmed and filtered `asana_project_ids` parsing to avoid empty/whitespace IDs - Normalized `asana_team_id` by trimming whitespace - Used safer access for membership email extraction (`m.get("user")`) - Added `test/unit_test/common/test_asana_connector.py`: - Verifies the correct project-membership API method is called - Verifies empty `project_ids` path returns workspace emails - Verifies project/team input normalization behavior ### Compatibility / risk - Non-breaking bug fix - No API contract changes - Existing behavior for empty `Project IDs` remains unchanged	2026-03-24 20:21:31 +08:00
NeedmeFordev	c3f79dbcb0	fix(jira): prevent missed incremental updates after issue edits (#13674 ) ### What problem does this PR solve? Fixes [#13505](https://github.com/infiniflow/ragflow/issues/13505): Jira incremental sync could miss updated issues after initial sync, especially near time boundaries. Root cause: - Jira JQL uses minute-level precision for `updated` filters. - Incremental windows had no overlap buffer, so boundary updates could be skipped. - Sync log cursor tracking used a backward-facing update for `poll_range_start`. - Existing-doc updates in `upload_document` lacked a KB ownership guard for doc-id collisions. What changed: - Added Jira incremental overlap buffer (`time_buffer_seconds`, defaulting to `JIRA_SYNC_TIME_BUFFER_SECONDS`) when building JQL lower-bound time. - Preserved second-level post-filtering to avoid duplicate reprocessing while still catching boundary updates. - Improved Jira sync logging to include start/end window and overlap configuration. - Updated sync cursor tracking in `increase_docs` to keep `poll_range_start` moving forward with max update time. - Added KB ID safety check before updating existing document records in `upload_document`. Verification performed: - Python syntax compile checks passed for modified files. - Manual verification flow: 1. Run full Jira sync. 2. Edit an already-indexed Jira issue. 3. Run next incremental sync. 4. Confirm updated content is re-ingested into KB. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-18 23:31:05 +08:00
Magicbook1108	675810e0cf	Refact: optimize confluence performance (#13497 ) ### What problem does this PR solve? Refact: optimize confluence performance #13494 ### Type of change - [x] Refactoring	2026-03-10 15:02:24 +08:00
Heyang Wang	c217b8f3d8	Feat: add DingTalk AI Table connector and integration for data synch… (#13413 ) ### What problem does this PR solve? Add DingTalk AI Table connector and integration for data synchronization Issue #13400 ### Type of change - [x] New Feature (non-breaking change which adds functionality) Co-authored-by: wangheyang <wangheyang@corp.netease.com>	2026-03-06 21:13:23 +08:00
tunsuy	020068dd16	Fix: preserve field boundaries in chunked documents from MySQL… (#13369 ) ### What problem does this PR solve? When multiple columns are used as content columns in RDBMS connector, the generated document text gets chunked by TxtParser which strips newline delimiters during merge. This causes field names and values from different columns to be concatenated without any separator, making the content unreadable. Changes: - txt_parser.py: restore newline separator when merging adjacent text segments within a chunk, so that split sections are not directly concatenated - rdbms_connector.py: use double newline between fields and place field value on a new line after the field name bracket, giving TxtParser clearer boundaries to work with Closes #13001 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: tunsuytang <tunsuytang@tencent.com>	2026-03-04 21:42:02 +08:00
Yesid Cano Castro	d1afcc9e71	feat(seafile): add library and directory sync scope support (#13153 ) ### What problem does this PR solve? The SeaFile connector currently synchronises the entire account — every library visible to the authenticated user. This is impractical for users who only need a subset of their data indexed, especially on large SeaFile instances with many shared libraries. This PR introduces granular sync scope support, allowing users to choose between syncing their entire account, a single library, or a specific directory within a library. It also adds support for SeaFile library-scoped API tokens (`/api/v2.1/via-repo-token/` endpoints), enabling tighter access control without exposing account-level credentials. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Test ``` from seafile_connector import SeaFileConnector import logging import os logging.basicConfig(level=logging.DEBUG) URL = os.environ.get("SEAFILE_URL", "https://seafile.example.com") TOKEN = os.environ.get("SEAFILE_TOKEN", "") REPO_ID = os.environ.get("SEAFILE_REPO_ID", "") SYNC_PATH = os.environ.get("SEAFILE_SYNC_PATH", "/Documents") REPO_TOKEN = os.environ.get("SEAFILE_REPO_TOKEN", "") def _test_scope(scope, repo_id=None, sync_path=None): print(f"\n{'='50}") print(f"Testing scope: {scope}") print(f"{'='50}") creds = {"seafile_token": TOKEN} if TOKEN else {} if REPO_TOKEN and scope in ("library", "directory"): creds["repo_token"] = REPO_TOKEN connector = SeaFileConnector( seafile_url=URL, batch_size=5, sync_scope=scope, include_shared = False, repo_id=repo_id, sync_path=sync_path, ) connector.load_credentials(creds) connector.validate_connector_settings() count = 0 for batch in connector.load_from_state(): for doc in batch: count += 1 print(f" [{count}] {doc.semantic_identifier} " f"({doc.size_bytes} bytes, {doc.extension})") print(f"\n-> {scope} scope: {count} document(s) found.\n") # 1. Account scope if TOKEN: _test_scope("account") else: print("\nSkipping account scope (set SEAFILE_TOKEN)") # 2. Library scope if REPO_ID and (TOKEN or REPO_TOKEN): _test_scope("library", repo_id=REPO_ID) else: print("\nSkipping library scope (set SEAFILE_REPO_ID + token)") # 3. Directory scope if REPO_ID and SYNC_PATH and (TOKEN or REPO_TOKEN): _test_scope("directory", repo_id=REPO_ID, sync_path=SYNC_PATH) else: print("\nSkipping directory scope (set SEAFILE_REPO_ID + SEAFILE_SYNC_PATH + token)") ```	2026-02-28 10:24:28 +08:00
Ahmad Intisar	99d1c9725c	Bug mysql connector empty content resolved: Semantic ID Issue (#13206 ) The RDBMS (MySQL/PostgreSQL) connector generates document filenames using the first 100 characters of the content column (semantic_identifier). When the content contains newline characters (\n), the resulting filename includes those newlines — for example: Category: غير صحيح كليًا\nTitle: تفنيد حقائق....txt RAGFlow's filename_type() function uses re.match(r".\.txt$", filename) to detect file types, but . does not match newline characters by default in Python regex. This causes the regex to fail, returning FileType.OTHER, which triggers: pythonraise RuntimeError("This type of file has not been supported yet!") As a result, all documents synced via the MySQL/PostgreSQL connector are silently discarded. The sync logs report success (e.g., "399 docs synchronized"), but zero documents actually appear in the dataset. This is the root cause of issue #13001. Root cause trace: rdbms_connector.py → _row_to_document() sets semantic_identifier from raw content (may contain \n) connector_service.py → duplicate_and_parse() uses semantic_identifier as the filename file_service.py → upload_document() calls filename_type(filename) file_utils.py → filename_type() regex .*\.txt$ fails on newlines → returns FileType.OTHER upload_document() raises "This type of file has not been supported yet!" Fix: Sanitize the semantic_identifier in _row_to_document() by replacing newlines and carriage returns with spaces before truncating to 100 characters. Relates to: #13001, #12817 Type of change Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-02-25 12:55:04 +08:00
ksufer	5a8fa7cf31	Fix #13119 : Use email.utils to fix IMAP parsing for names with commas (#13120 ) ## Type of Change - [x] Bug fix ## Description Closes #13119 The current IMAP connector uses `split(',')` to parse email headers, which crashes when a sender's display name contains a comma inside quotes (e.g., `"Doe, John" <john@example.com>`). This PR replaces the manual string splitting with Python's standard `email.utils.getaddresses`. This correctly handles RFC 5322 quoted strings and prevents the `RuntimeError: Expected a singular address`. ## Checklist - [x] I have checked the code and it works as expected. --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-02-24 19:18:55 +08:00
Ahmad Intisar	5885f150ab	fix: register WebDAVConnector in data_source __init__.py (#13121 ) What problem does this PR solve? The sync_data_source.py module imports WebDAVConnector from common.data_source, but WebDAVConnector was never registered in the package's __init__.py. This causes an ImportError at startup, crashing the data sync service: ImportError: cannot import name 'WebDAVConnector' from 'common.data_source' The webdav_connector.py file already exists in the common/data_source/ directory — it just wasn't exported. This PR adds the import and registers it in __all__. Type of change Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-02-12 16:05:58 +08:00
Magicbook1108	e89fd686e2	Improve: optimize file name (with path) in box container. (#13124 ) ### What problem does this PR solve? Refact: optimize file name (with path) in box container. ### Type of change - [x] Performance Improvement <img width="2357" height="1258" alt="image" src="https://github.com/user-attachments/assets/f4c5c90b-d885-4514-b7bc-f17ab62b045f" />	2026-02-12 15:40:55 +08:00
MkDev11	13a6545e48	fix(rdbms): use brackets around field names to preserve distinction after chunking (#13010 ) Fix RDBMS field separation after chunking by wrapping field names in brackets (【field】: value). This ensures fields remain distinguishable even when TxtParser strips newline delimiters during chunk merging. Closes #13001 Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com>	2026-02-06 14:44:58 +08:00
Magicbook1108	1349e6b7d1	Fix: adressing style without a default value (#13009 ) ### What problem does this PR solve? Fix: adressing style without a default value #12396 #11510 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-05 13:52:23 +08:00

1 2

95 Commits