Commit Graph

4 Commits

Author SHA1 Message Date
web-dev0521
5de021ebb4 feat: implement Slack data source connector (#15188)
### What problem does this PR solve?

Closes #15187.

RAGFlow shipped a Slack connector
(`common/data_source/slack_connector.py`) but it was never usable:
`Slack._generate()` in the sync worker was a `pass` stub, the
connector's document-generating code was incompatible with the current
data model,
and Slack was commented out of the data-source settings UI. As a result,
teams had no way to index Slack channels/threads into a knowledge base.

This PR completes the connector end to end.

**Backend**

- `common/data_source/slack_connector.py`
- Rewrote `thread_to_doc` to produce a blob-based `Document`
(`extension`/`blob`/`size_bytes`). The previous implementation built the
doc with a `sections=[...]` argument and omitted the now-required
`blob`/`extension`/ `size_bytes` fields, so it raised a validation error
against the current `Document` model. Thread messages are now cleaned
and flattened into a single UTF-8 text blob.
- Added `load_from_state()` / `poll_source(start, end)` generators. The
connector's checkpoint interface is a no-op stub, so both full and
incremental syncs run through a single channel-iterating generator built
on the existing module helpers (`get_channels`, `filter_channels`,
`get_channel_messages`, `_process_message`), with per-channel thread
de-duplication.
- `rag/svr/sync_data_source.py`
- Implemented `Slack._generate()`. Credentials are loaded via
`StaticCredentialsProvider` (the connector requires `slack_bot_token`
and does not support `load_credentials`). Supports full reindex and
incremental polling from `poll_range_start`, plus the optional channel
filter. Modeled on the Confluence/Dropbox wrappers.
- `SlackConnector` was already exported from
`common/data_source/__init__.py`.

**Frontend (`web/`)**

- Enabled the `SLACK` data-source enum and added its form fields (Slack
bot token + optional channel filter), default values, display metadata,
and a Slack icon.
- Added `slackDescription` / `slackBotTokenTip` / `slackChannelsTip`
strings to `en.ts` and `zh.ts`.

**Tests**

- `test/unit_test/data_source/test_slack_connector_unit.py`: unit tests
covering credential loading (`load_credentials` raises,
`set_credentials_provider` initializes clients, missing credentials
raises) and document generation (standalone message + flattened thread,
blob/extension/size_bytes/metadata, and the incremental poll time
window). All 5 pass; `ruff check` is clean.

Required Slack scopes: `channels:read`, `channels:history`,
`users:read`.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-28 15:46:07 +08:00
web-dev0521
c4c4e228e3 feat: implement SharePoint data source connector (#15190)
### What problem does this PR solve?

Closes #15189.

RAGFlow shipped a SharePoint connector stub
(`common/data_source/sharepoint_connector.py`) whose document-loading
methods all returned `[]`, `SharePoint._generate()` was a `pass`, and
SharePoint was commented out of the data-source settings UI. As a result
there was no way to index files stored in SharePoint document libraries.

This PR implements the connector end to end on top of Microsoft Graph
(Office365-REST-Python-Client).

**Backend**

- `common/data_source/sharepoint_connector.py`
- `load_credentials()` now builds the Graph client using an MSAL
client-credentials **token callback** — the form `GraphClient` actually
expects. (The previous stub passed a raw access-token string to
`GraphClient(...)`, which is not how that client is driven.) Token
acquisition is lazy, so credential loading does no network call.
- `validate_connector_settings()` resolves the configured site via
Graph.
- `load_from_checkpoint()` is now a generator that enumerates every
document library under the site, walks folders depth-first, downloads
each file, and yields blob-based `Document` objects (`extension` /
`blob` / `size_bytes` / `doc_updated_at`). Incremental syncs are bounded
by file `lastModifiedDateTime`. Per-file errors are surfaced as
`ConnectorFailure` rather than aborting the run.
- `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument`
batches (no downloads) and the checkpoint helpers return proper
checkpoints.
- ACL → `ExternalAccess` mapping is intentionally left best-effort
(`load_from_checkpoint_with_perm_sync` delegates to the standard load)
because the sync pipeline does not currently persist `ExternalAccess`;
this can be extended once that plumbing exists.
- `rag/svr/sync_data_source.py`
- Implemented `SharePoint._generate()` using the existing
`CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google
Drive), supporting full reindex and incremental polling from
`poll_range_start`.
- `SharePointConnector` is already exported from
`common/data_source/__init__.py`.

**Frontend (`web/`)**

- Enabled the `SHAREPOINT` data-source enum and added its form fields
`site_url`, `tenant_id`, `client_id`, `client_secret`), default values,
display metadata, and a SharePoint icon.
- Added `sharepointDescription` / `sharepointSiteUrlTip` to `en.ts` and
`zh.ts`.

**Tests**

- `test/unit_test/data_source/test_sharepoint_connector_unit.py`:
mock-based unit tests covering credential loading (incomplete creds
raise, happy path sets the Graph client, fetch-without-creds raises),
drive traversal + file download, incremental `lastModifiedDateTime`
filtering, and slim-doc listing. All 6 pass; `ruff check` is clean.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-28 13:26:08 +08:00
dripsmvcp
ce9a4425d2 fix(imap): handle multi-address headers in _parse_singular_addr (#15006)
Replace the RuntimeError with a warning + first-address fallback so a
single email whose From header contains multiple addresses no longer
crashes the entire IMAP sync task. Also add regression tests covering:

- #14963: RFC 5322 quoted display names with commas (e.g. "Schlüter,
Sabine" <s@x>) parsed as one address, not two.
- #14964: multi-address headers warn instead of raising.

Closes #14964
Refs #14963
2026-05-21 15:37:02 +08:00
Ahmad Intisar
e994051eb9 Feature/generic api connector (#13545)
# feat: Add Generic REST API Connector

## What problem does this PR solve?

RAGFlow supports many specific data source connectors (MySQL, Slack,
Google Drive, etc.), but there was no way to connect an arbitrary REST
API as a data source. Users with custom or third-party APIs had to write
a new connector class for each one.

This PR adds a **generic, configuration-driven REST API connector** that
lets users connect any REST API as a data source entirely through the UI
— no code changes needed per API.

---

## Features

### Core Connector (`common/data_source/rest_api_connector.py`)

- Implements `LoadConnector` and `PollConnector` interfaces for full and
incremental sync
- **Configurable authentication:** None, API Key (custom header), Bearer
Token, Basic Auth
- **Pluggable pagination:** Page-based, Offset-based, Cursor-based, or
None
- Smart page-size inference from user's query parameters to avoid
duplicate/conflicting params
- Configurable request delay between pages to prevent API rate limiting
- Auto-detection of the items array in JSON responses (`items`,
`results`, `data`, `records`, or first list found)
- **Advanced field mapping** with dot-notation (`country.name`), array
wildcards (`newsType[*].name`), type hints, and default values
- Optional content template rendering (`"Title: {title}\nBody: {body}"`)
- HTML stripping for content fields
- Stable document IDs via `hash128` from a configurable ID field or
auto-generated from item content
- Pydantic configuration schema with automatic coercion of UI string
inputs to dicts/lists

### Backend Registration (`rag/svr/sync_data_source.py`,
`common/constants.py`, `common/data_source/config.py`)

- `REST_API` sync class wired into RAGFlow's `func_factory`
- Full sync (`load_from_state`) and incremental polling (`poll_source`)
support
- Credentials and config passed from task to connector following
existing patterns (MySQL, SeaFile, etc.)

### Test Connection Endpoint (`api/apps/connector_app.py`)

- `POST /v1/connector/<id>/test` validates config schema,
authentication, and API connectivity without triggering a sync
- Clear error messages for auth failures vs. config issues

### Frontend UI (`web/src/pages/user-setting/data-source/constant/`)

- **Postman-style configuration:** Base URL, Query Parameters (key=value
per line), Auth, Content Fields, Metadata Fields, Pagination Type
- Auth-type-aware form: fields for API key header/value, Bearer token,
or Basic username/password appear only when relevant
- **Advanced Settings** toggle for: Custom Headers, Max Pages, Request
Delay, Poll Timestamp Field, Request Body (POST)
- Connector icon (SVG) and i18n strings (English)
- **"Test Connection"** button to validate before syncing

---

## Controls & Safety

- Configurable max pages safety cap (default: 1000, adjustable in UI)
- Configurable request delay between pages (default: 0.5s, adjustable in
UI)
- Auth errors (401/403) fail immediately without retries; transient
errors retry with exponential backoff
- Diagnostic logging: auth setup confirmation, request details on
failure, content field extraction status

---

## Type of change

- [x] New Feature (non-breaking change which adds functionality)


##Visual Screenshots of Features
<img width="482" height="510" alt="Screenshot 2026-03-11 at 5 19 52 PM"
src="https://github.com/user-attachments/assets/dcb7ab4a-1622-44f3-bb02-d6f0527314c4"
/>
(Connector can be configured within the external data sources tab)

Configuration Parameters:
<img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 46 PM"
src="https://github.com/user-attachments/assets/5e154e71-4ab5-4872-bfb2-04f02b73c18a"
/>
<img width="661" height="682" alt="Screenshot 2026-03-11 at 5 20 54 PM"
src="https://github.com/user-attachments/assets/00cb14b7-0bcf-4b94-9d71-34e93369ecb2"
/>

Connection can be tested before attaching to dataset:
<img width="981" height="681" alt="Screenshot 2026-03-11 at 5 21 40 PM"
src="https://github.com/user-attachments/assets/aaa6eeeb-89a7-4349-bc34-2423bf8be9ee"
/>

Ingestion tested with API connector (works perfectly fine):
<img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 22 30 PM"
src="https://github.com/user-attachments/assets/afcd0d58-cadd-4152-badc-d2f14d96fbec"
/>

Search & Retrieval works as well with metadata flow:
<img width="1062" height="705" alt="Screenshot 2026-03-11 at 5 23 05 PM"
src="https://github.com/user-attachments/assets/d41ee935-dcf7-4456-b317-22a76ca032c0"
/>

---------

Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-13 20:35:01 +08:00