Commit Graph

6479 Commits

Author SHA1 Message Date
writinwaters
c2597f132e Docs: Added a guide on how to ingest an RSS feed. (#15467)
### What problem does this PR solve?

Added a guide on how to ingest an RSS feed.

### Type of change

- [x] Documentation Update
2026-06-01 20:23:36 +08:00
monsterDavid
d398d617ca fix(mineru): skip page chrome blocks to prevent duplicate chunks (#15387)
## Summary
- Skip MinerU `header`, `footer`, and `page_number` blocks when
converting `content_list.json` into sections.
- Ignore unsupported block types explicitly so future MinerU output
types cannot re-emit the previous text block.

Fixes duplicate text in General/naive chunks when parsing PDFs via
MinerU (reported with repeated page headers and body text in slices).

Closes #15335

## Test plan
- [x] `pytest test/unit_test/deepdoc/parser/test_mineru_parser.py -v`
(4/4 passed)
2026-06-01 20:15:04 +08:00
oktofeesh
f0e4f2d5d8 fix(go-models): apply custom Google base URLs (#15385)
## Summary
- Add custom `base_url` support to the Google Go model driver.
- Preserve Google URL suffix configuration when creating custom base URL
driver instances.
- Validate Google chat/stream request inputs before constructing the SDK
client.
- Cover Google model listing, connection checks, base URL resolution,
and request validation with focused tests.

## What changed
- `GoogleModel.NewInstance` now returns a Google driver configured with
the supplied base URL map.
- Google SDK client creation now resolves configured base URLs through
`genai.HTTPOptions.BaseURL`.
- Base URL lookup supports configured regions, empty-region keys, and
`default` fallback.
- Google chat, streaming chat, embeddings, and model listing now reject
blank API keys before creating SDK clients.
- Google chat and streaming chat now reject blank model names locally,
and streaming chat rejects a nil sender.
- Existing message handling, embeddings, pagination, and provider errors
are preserved.

## Why
Google custom model instances could not use configured base URLs because
`NewInstance` returned `nil` and the SDK client path ignored the driver
base URL map. The request validation keeps invalid Google calls from
reaching SDK client construction with blank credentials or incomplete
chat inputs.
2026-06-01 19:24:29 +08:00
euvre
fb3bd3de02 fix(deepdoc): add English caption patterns to fix missing figure/table numbering (#15481)
### What problem does this PR solve?
## Problem

When parsing PDFs containing English figure/table captions (e.g. "Fig.
20", "Figure 20", "Table 20"), the `is_caption` method in
`TableStructureRecognizer` failed to recognize them as captions. This
caused figure numbering gaps in the parsed output (e.g. Fig. 19 → Fig.
21, skipping Fig. 20).

## Root Cause

The `is_caption` regex only matched Chinese caption formats:

```python
patt = [r"[图表]+[ 0-9::]{2,}"]
```

When the layout recognizer also failed to assign a `caption` layout type
to a given text block, English captions were entirely missed.

## Fix

Added three case-insensitive English caption patterns to `is_caption` in
`deepdoc/vision/table_structure_recognizer.py`:

- `(?i)Fig\.?\s*\d+` — matches `Fig. 20`, `Fig 20`, `FIG. 20`, etc.
- `(?i)Figure\s+\d+` — matches `Figure 20`, `FIGURE 20`, etc.
- `(?i)Table\s+\d+` — matches `Table 20`, `TABLE 20`, etc.

## Files Changed

- `deepdoc/vision/table_structure_recognizer.py` — extended `is_caption`
regex patterns


- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: noob <yixiao121314@outlook.com>
2026-06-01 19:22:11 +08:00
Wang Qi
1a6df01b53 Bug fix: Enhance embeding model to give better error message (#15346)
To resolve https://github.com/infiniflow/ragflow/issues/15343 enhance
the model embedding message to give extact failure message to customer.


# QWen

## Retrieval
<img width="3321" height="1033" alt="image"
src="https://github.com/user-attachments/assets/6b82921a-a3a7-4a33-a383-1cf316398ee2"
/>

## Chat
<img width="2241" height="311" alt="image"
src="https://github.com/user-attachments/assets/ec311365-62d5-407a-8915-5c8d72be9716"
/>


# SiliconFlow
## Retrieval
<img width="3321" height="1033" alt="image"
src="https://github.com/user-attachments/assets/ee2cd191-a27d-4729-b53d-2fbdb4e352cd"
/>

## Chat
<img width="1562" height="210" alt="image"
src="https://github.com/user-attachments/assets/10376a8e-a3f4-422f-bc2e-96f2a8a96448"
/>

# Baichuan
## Retrieval
<img width="3321" height="1107" alt="image"
src="https://github.com/user-attachments/assets/dcb5409d-f7fc-4804-b186-5e1ee11e09c4"
/>

## Chat
<img width="2241" height="311" alt="image"
src="https://github.com/user-attachments/assets/ec311365-62d5-407a-8915-5c8d72be9716"
/>


# Zhipu
zhipu is good.
2026-06-01 19:18:16 +08:00
kpdev
252cc19f93 Infer Content-Type for document image endpoint (#15368)
## Summary

Fixes [#15367](https://github.com/infiniflow/ragflow/issues/15367) —
`GET /api/v1/documents/images/<image_id>` always returned `Content-Type:
image/JPEG` even for PNG/WebP chunk images and extensioned thumbnails.

## Related Issue

Fixes #15367

## Change Type

- [x] Bug fix
- [x] Regression tests
- [ ] New feature
- [ ] Refactor

## What Changed

- Added `_detect_image_content_type_from_bytes()` —
PNG/JPEG/GIF/WebP/BMP magic-byte detection
- Added `_content_type_for_document_image()` — object-key extension via
`CONTENT_TYPE_MAP`, then magic bytes, else `application/octet-stream`
- **`get_document_image()`** — set inferred `Content-Type` instead of
hardcoded `image/JPEG`
- Also guards missing storage blob (`Image not found.`) to avoid
`make_response(None)` (same handler; complements #15365)

## Files Changed

| File | Change |
|------|--------|
| `api/apps/restful_apis/document_api.py` | MIME inference helpers +
handler update |
|
`test/testcases/test_web_api/test_document_app/test_document_metadata.py`
| 3 unit tests |

## Validation

```bash
cd /root/gittensor/ragflow
pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_content_type_from_object_extension_unit -v
pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_content_type_from_magic_bytes_unit -v
pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py::TestDocumentMetadataUnit::test_get_document_image_missing_blob_unit -v
```

## Test Plan

- [x] `.png` object key → `image/png`
- [x] Extensionless chunk key + PNG bytes → `image/png` (magic bytes)
- [x] Missing blob → 4xx `"Image not found."`
- [ ] CI green
2026-06-01 19:08:32 +08:00
kpdev
b35266e9a5 Return 4xx when file download storage blob is missing (#15371)
## Summary

Fixes [#15369](https://github.com/infiniflow/ragflow/issues/15369) —
`GET /api/v1/files/<file_id>` calls `make_response(None)` when both
primary and fallback storage lookups return empty, causing HTTP 500.

## Related Issue

Fixes #15369

## Change Type

- [x] Bug fix
- [x] Regression tests

## What Changed

- **`file_api.download()`** — after fallback `STORAGE_IMPL.get`, return
`get_error_data_result(message="This file is empty.")` when `not blob`,
matching document REST download semantics.

## Files Changed

| File | Change |
|------|--------|
| `api/apps/restful_apis/file_api.py` | Empty-blob guard before
`make_response()` |
| `test/testcases/test_web_api/test_file_app/test_file_routes_unit.py` |
Regression test |

## Validation

```bash
cd /root/gittensor/ragflow
pytest test/testcases/test_web_api/test_file_app/test_file_routes_unit.py::test_download_missing_blob_returns_error -v
pytest test/testcases/test_web_api/test_file_app/test_file_routes_unit.py::test_download_falls_back_to_document_storage -v
```

## Test Plan

- [x] Both storage paths empty → `"This file is empty."` (no
`make_response(None)`)
- [x] Existing fallback success test still passes
- [ ] CI green
2026-06-01 19:08:06 +08:00
balibabu
f194e8b4c4 Fix: The newly added model did not appear in the drop-down menu. (#15476)
### What problem does this PR solve?

Fix: The newly added model did not appear in the drop-down menu.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-01 17:56:41 +08:00
euvre
1e80419c21 fix: restore TitleChunker output for json/chunks upstream formats (#15396)
fix: restore TitleChunker output for json/chunks upstream formats

## Summary

The refactor commit e194027b (#14247) introduced two regressions that
caused `TitleChunker` to produce zero chunks when the upstream Parser
node outputs `json` or `chunks` format (e.g. PDF parsing).

## Root Cause

### 1. Dead code in `extract_line_records` (critical)

After refactor, when `payload` is `None` (which is the case for `json`
and `chunks` output formats), the method returns an empty list
immediately via `return []`, so no records are ever extracted from
structured upstream output. The original `json`/`chunks` handling code
became unreachable dead code.

### 2. Unconditional overwrite in `build_chunks_from_record_groups`

The `chunks` variable assigned in the `if` branch for markdown/text/html
formats was unconditionally overwritten by the statement below it, due
to a missing `else` keyword.

## Fix

- Remove the premature `return []` so the `json`/`chunks` branch is
reachable again.
- Add `else` branch in `build_chunks_from_record_groups` so the two
format families are handled independently.

## Test Plan

- [x] Verified no lint errors on the changed file
- [ ] Tested with a PDF document parsed via DeepDOC → TitleChunker
pipeline
- [ ] Tested with markdown input through TitleChunker
- [ ] Tested hierarchy and group chunking modes

## Impact

- Fixes the regression where documents parsed with `json`/`chunks`
output format produced no chunks from `TitleChunker`.
- No API or configuration changes. Fully backward compatible.

Signed-off-by: noob <yixiao121314@outlook.com>
2026-06-01 17:14:22 +08:00
balibabu
82202fa469 Fix: Unable to create dataset (#15472)
### What problem does this PR solve?

Fix: Unable to create dataset

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-01 15:30:52 +08:00
Wang Qi
10e8690890 GraphRAG - NER - spacy - fix spacy extraction (#14783)
Fix spacy extraction
2026-06-01 13:05:54 +08:00
sxxtony
12579dbc3d Go: implement dataset ingestion log APIs (#15421)
### What problem does this PR solve?

Part of the Python → Go API server rewrite tracked in #15240 (Dataset
ingestion section). This PR implements the three dataset ingestion
endpoints in the Go API server, mirroring the existing Python
`dataset_api_service` behaviour:

- `GET /api/v1/datasets/<dataset_id>/ingestions/summary`
- `GET /api/v1/datasets/<dataset_id>/ingestions`
- `GET /api/v1/datasets/<dataset_id>/ingestions/<log_id>`

### Type of change

- [x] Refactoring
- [x] New Feature (non-breaking change which adds functionality)

Co-authored-by: sxxtony <sxxtony@users.noreply.github.com>
2026-06-01 11:23:44 +08:00
glorydavid03023
3774916060 Go: implement Embed in GPUStack driver (#15182)
### What problem does this PR solve?

The Go GPUStack driver returned a stub error for `Embed()` even though
GPUStack exposes OpenAI-compatible embeddings on the **v1-openai** route
(not `v1/embeddings`).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-01 11:22:43 +08:00
Haruko386
2d7044b57e feat[Go] implement api/v1/thumbnails API (#15416)
### What problem does this PR solve?

As title

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality
2026-06-01 11:22:08 +08:00
Idriss Sbaaoui
da1ed6f0e7 Feat: add new tests and tescases for restful api suite (#15347)
### What problem does this PR solve?

extend restful api suite

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Other (please describe): test
2026-06-01 11:02:40 +08:00
Wang Qi
4972af4367 Fix memory empty issue (#15411)
Fix memory empty issue
2026-06-01 10:25:56 +08:00
balibabu
e13431cdc0 Fix: If the filename is too long, it overflows the confirmation box for deleting the file. (#15287)
### What problem does this PR solve?

Fix: If the filename is too long, it overflows the confirmation box for
deleting the file.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-01 10:22:56 +08:00
web-dev0521
cd18cfab79 feat(connector): implement Outlook data source connector (issue #15332) (#15333)
### What problem does this PR solve?

Closes #15332.

RAGFlow can index Gmail and generic IMAP mailboxes but had no native
connector for Outlook / Microsoft 365 mail. Organisations on Microsoft
365 had no way to bring mailbox content into a knowledge base through
Microsoft Graph.

This PR adds a net-new Outlook data source that:

- Authenticates against Microsoft Graph with the same MSAL
client-credentials flow already used by the SharePoint and Teams
  connectors (no new auth primitives).
- Pages over `/users/{id}/mailFolders/{folder}/messages/delta` per
mailbox and persists `@odata.deltaLink` values in
`OutlookCheckpoint.delta_links`, so incremental syncs only fetch changed
messages.
- Supports two scoping modes:
- **Tenant-wide** (default): enumerates every user in the tenant via
`/users` and syncs each mailbox. Requires `User.Read.All`.
- **Targeted**: when `user_ids` is provided (comma-separated UPNs or
object IDs), only those mailboxes are synced. `User.Read.All` is not
needed in this mode.
- Lets the caller pick the mail folder (`inbox`, `sentitems`, `archive`,
...). Defaults to `inbox`.
- Maps each message to a `Document` shaped after the Gmail connector:
one `TextSection` carrying `From/To/Cc/Subject` headers + body, with
HTML bodies stripped to text inline (no extra dependency).
- Surfaces typed errors on the validation probe:
401 → `ConnectorMissingCredentialError`, 403 →
`InsufficientPermissionsError` (with `Mail.Read` / `User.Read.All`
hint), 404 on a configured mailbox → `ConnectorValidationError`, 5xx →
`UnexpectedValidationError`.
- Skips messages flagged `@removed` by the delta semantics and messages
whose `receivedDateTime` is older than `poll_range_start`.

#### Files

| File | Change |
|------|--------|
| `common/data_source/outlook_connector.py` | **New** —
`OutlookConnector` (`CheckpointedConnectorWithPermSync` +
`SlimConnectorWithPermSync`) + `OutlookCheckpoint` + tiny `_strip_html`
helper. |
| `common/data_source/config.py` | `DocumentSource.OUTLOOK = "outlook"`.
|
| `common/constants.py` | `FileSource.OUTLOOK = "outlook"`. |
| `common/data_source/__init__.py` | Export `OutlookConnector`. |
| `rag/svr/sync_data_source.py` | `Outlook(SyncBase)` with `batch_size`
normalisation, CSV/list parsing of `user_ids`; registered in
`func_factory`. |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`DataSourceKey.OUTLOOK`, visibility map (`syncDeletedFiles: true`), info
entry, form fields (tenant_id, client_id, client_secret, folder,
user_ids, batch_size), default values. |
| `web/src/locales/en.ts`, `web/src/locales/zh.ts` |
`outlookDescription` + 5 tooltip keys (EN + ZH). |
| `test/unit_test/data_source/test_outlook_connector_unit.py` | **New**
— 19 unit tests (`p1`/`p2`/`p3`) covering auth, validation (tenant-wide
vs specific user vs error paths), checkpoint helpers, user enumeration
pagination, message filtering, HTML body stripping. |

#### Required Azure AD permissions

- `Mail.Read` (Application, admin-granted) — always.
- `User.Read.All` (Application, admin-granted) — only when `user_ids` is
left blank so the connector can enumerate mailboxes.

#### Out of scope

- **Attachment indexing.** The current connector emits message body +
headers; binary attachments are flagged via `metadata.has_attachments`
but not pulled. Adding attachment hydration is straightforward but
scoped out per the issue's "decide whether attachments are indexed in
the first version" note.
- **Delegated (per-user) OAuth.** The connector uses app-only
credentials, consistent with the SharePoint / Teams precedent in this
codebase.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-29 21:52:29 +08:00
Rintaro
11af34a895 fix(opensearch): repair document-metadata path broken by #14577 (#15393)
### What problem does this PR solve?

Document metadata is completely broken on the OpenSearch backend
(`DOC_ENGINE=opensearch`). Both failures were introduced by #14577,
which added
a doc-metadata dispatch surface but only validated it against
Elasticsearch.

**1. Index creation rejected (`mapper_parsing_exception`).**
`OSConnection.create_doc_meta_idx` feeds `conf/doc_meta_es_mapping.json`
verbatim to OpenSearch. That file declares a top-level `"dynamic":
"runtime"`.
Runtime fields are Elasticsearch-only; OpenSearch cannot parse the
value:

mapper_parsing_exception: Could not convert [dynamic.dynamic] to boolean
(400)

**2. `search()` signature mismatch (`TypeError`).**
`DocMetadataService` (added by #14577) calls `docStoreConn.search(...)`
with
snake_case kwargs (`select_fields=`, `index_names=`,
`knowledgebase_ids=`, …),
matching `ESConnection.search`. But `OSConnection.search` still uses
camelCase
parameters (`selectFields`, `indexNames`, `knowledgebaseIds`, …):

TypeError: OSConnection.search() got an unexpected keyword argument
'select_fields'

The UI then shows "0 fields" for every document on OpenSearch.

### Fix

1. In `OSConnection.create_doc_meta_idx`, normalize a top-level
`"dynamic": "runtime"` to `True` **for the OpenSearch request only**.
The
shared mapping file is left untouched, so the Elasticsearch backend
keeps its
runtime-field behavior. Dynamic field discovery is preserved on
OpenSearch.
2. Rename the `OSConnection.search()` parameters (and their in-method
local
uses) from camelCase to snake_case so they match `ESConnection.search()`
and
the `DocMetadataService` call sites. The change is confined to
`search()`;
`get/insert/update/delete` keep their existing positional signatures
(they
   are called positionally from `rag/nlp/search.py`).

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)

### Affected backends
OpenSearch only. Elasticsearch, Infinity and OceanBase are untouched.

### How to reproduce
1. `DOC_ENGINE=opensearch`, restart the stack.
2. Upload/parse a document, then open the dataset's document list / set
metadata.
- Before: index creation 400s (`Could not convert [dynamic.dynamic]`),
and/or
     `TypeError ... 'select_fields'`; document metadata shows 0 fields.

### Risk & backward compatibility
- ES default deployment: no change. `doc_meta_es_mapping.json` is not
modified,
  so ES still receives `"dynamic": "runtime"`.
- `search()` rename is internal; the only kwarg caller
(`DocMetadataService`)
  already uses the snake_case names this PR aligns to.

### Test plan
- [ ] `DOC_ENGINE=opensearch`: per-tenant `ragflow_doc_meta_*` index is
created
(no `mapper_parsing_exception`); document metadata reads/writes work.
- [ ] `DOC_ENGINE=elasticsearch` regression: doc-meta index still
created with
      runtime mapping; metadata unchanged.
2026-05-29 21:49:36 +08:00
Rintaro
3dfc16973c fix(opensearch): implement get_scores for KNN second-pass scoring (#15390)
### What problem does this PR solve?

On the OpenSearch backend (`DOC_ENGINE=opensearch`), every retrieval
that
performs the KNN second-pass scoring crashes with:

    AttributeError: 'OSConnection' object has no attribute 'get_scores'

**Root cause.** #14970 ("Refactor: Drop the vector fetch for ES") added
a
`get_scores()` helper to `ESConnectionBase`
(`common/doc_store/es_conn_base.py`)
and introduced `Dealer._knn_scores()` in `rag/nlp/search.py`, which
calls
`self.dataStore.get_scores(res)`. `search.py` routes Infinity and
OceanBase to
their own similarity paths via `DOC_ENGINE_INFINITY` /
`DOC_ENGINE_OCEANBASE`,
but OpenSearch sets neither flag, so it falls into the Elasticsearch
branch and
calls `get_scores`. `OSConnection` (which subclasses
`DocStoreConnection`
directly, not `ESConnectionBase`) never received that method, so any
vector-search hit triggers the crash. It reproduces with any normal
embedding
(e.g. 1024-dim mistral-embed) as soon as a KNN query returns hits.

### Fix

Add `OSConnection.get_scores()`, mirroring
`ESConnectionBase.get_scores()`.
OpenSearch hit headers expose `_score` exactly like Elasticsearch (the
existing
`OSConnection.__getSource` already reads `d["_score"]`), so the
implementation
is identical.

Scope note: Infinity and OceanBase deliberately do not use `get_scores`
(#14970 routes them elsewhere), so this fix is intentionally limited to
the
OpenSearch backend, which is the only one reaching the ES KNN-score
path.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Affected backends
OpenSearch only. Elasticsearch already implements `get_scores`; Infinity
/
OceanBase are routed away from it.

### How to reproduce
1. `DOC_ENGINE=opensearch` (docker `.env`), restart the stack.
2. Create a knowledge base with any dense embedding model and parse a
document.
3. Run a retrieval / chat over that KB -> 500 with the AttributeError
above.

### Risk & backward compatibility
None for the default Elasticsearch deployment -- the change only adds a
method
to `OSConnection`. No default values or ES/Infinity/OceanBase behavior
change.

### Test plan
- [ ] With `DOC_ENGINE=opensearch`, retrieval over a KB returns scored
chunks
      (no AttributeError).
- [ ] `DOC_ENGINE=elasticsearch` regression: retrieval unchanged.
- [ ] Empty-result path: `_knn_scores` early-returns `{}` (guarded),
get_scores
      handles an empty `hits` list gracefully.
2026-05-29 21:49:15 +08:00
jony376
a2500fed43 fix(api): move dify retrieval health check to /dify/retrieval/health (#15311)
### Related issues
Closes #15310

### What problem does this PR solve?

`/api/v1/dify/retrieval` had duplicate `GET` route registrations in
`dify_retrieval_api.py`: one for authenticated retrieval and another for
unauthenticated health checks. Sharing the same path and method created
ambiguous routing behavior and an unstable API contract for Dify
external knowledge base integration.

This PR separates concerns by moving the health-check endpoint to `GET
/api/v1/dify/retrieval/health`, while keeping retrieval on
`/api/v1/dify/retrieval`. This makes auth behavior deterministic and
prevents route shadowing/conflicts.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-29 21:47:55 +08:00
OrbisAI Security
b4c8711d51 fix: upgrade crawl4ai to 0.8.0 (CVE-2026-26217) (#15415)
## Summary
Upgrade crawl4ai from 0.7.6 to 0.8.0 to fix CVE-2026-26217.

## Vulnerability
| Field | Value |
|-------|-------|
| **ID** | CVE-2026-26217 |
| **Severity** | CRITICAL |
| **Scanner** | trivy |
| **Rule** | `CVE-2026-26217` |
| **File** | `uv.lock` |
| **Assessment** | Likely exploitable |

**Description**: Crawl4AI Has Local File Inclusion in Docker API via
file:// URLs

## Evidence

**Scanner confirmation**: trivy rule `CVE-2026-26217` flagged this
pattern.

**Production code**: This file is in the production codebase, not
test-only code.

## Threat Model Context

This is a web service - vulnerabilities in request handlers are directly
exploitable by remote attackers.

## Changes
- `pyproject.toml`
- `uv.lock`

## Verification
- [x] Build passes
- [x] Scanner re-scan confirms fix
- [x] LLM code review passed

---
*This change addresses a pattern flagged by static analysis. The code
path handles user-influenced input and the fix reduces the attack
surface against both manual and automated exploitation.*

---
*Automated security fix by [OrbisAI Security](https://orbisappsec.com)*
2026-05-29 21:38:41 +08:00
Attili-sys
a28a0c6986 File addition .rooignore (#15414)
This PR introduces a `.rooignore` file to the root of the repository to
optimize how AI coding assistants (like Roo) interact with the RAGFlow
codebase.

Currently, when AI agents index the workspace, they can waste tokens and
processing time reading through generated files, caches, large
dependency artifacts, and runtime logs. This `.rooignore` file provides
a standard configuration to exclude these irrelevant directories and
files (such as `.venv/`, `node_modules/`, `__pycache__/`, logs, and
large binaries). This significantly reduces indexing noise, prevents
accidental reads of sensitive or bulky local data, and ensures AI coding
agents remain focused strictly on relevant source code.

### Type of change

- [x] Other (please describe): Developer Experience (DX) / AI Tooling
configuration
2026-05-29 20:37:44 +08:00
Hz_
539d38bc20 fix: backfill missing api token beta values (#15405)
### What problem does this PR solve?

This PR updates `SystemService.ListAPITokens` to lazily backfill missing
`beta` values for API tokens, matching the Python behavior of
`/api/v1/system/tokens`.

### Type of change
  
  - When an API token has an empty `beta`, generate a new one.
  - Persist the generated `beta` back to the `api_token` table.
  - Keep the handler/routing unchanged.
- `GET /api/v1/system/tokens` now returns tokens with `beta` filled in
for older records that were missing it.
  - This aligns Go behavior with the Python implementation.
2026-05-29 20:04:10 +08:00
oktofeesh
be28177955 fix(go-models): harden Hunyuan embedding validation (#15249)
## Summary
- Validate Hunyuan embedding model name and API key before building
requests.
- Reuse region-aware base URL validation for embedding requests.
- Replace the stale unsupported Embed test with happy-path and
validation coverage.

## What changed
- Added early Hunyuan Embed validation for missing model names and API
keys.
- Routed Embed through the same base URL region guard used by the other
Hunyuan methods.
- Updated Hunyuan tests to configure the embedding suffix and cover
Embed success plus invalid inputs.

## Why
Hunyuan Embed is implemented, but the existing test still expected it to
be unsupported and could panic before returning a normal validation
error. This keeps the implemented embedding path aligned with the
current driver behavior and prevents nil input panics.

Closes #15087
Refs #14736
2026-05-29 19:50:01 +08:00
galuis116
d1f6594618 Fix: JWT algorithm-confusion in OIDC ID token verification (#15181)
### What problem does this PR solve?

Closes #15180.

`OIDCClient.parse_id_token` in `api/apps/auth/oidc.py` read the JWT
signing
algorithm from the **unverified** JWT header and passed it through to
`jwt.decode(..., algorithms=[alg], ...)` as the trust anchor. This is
the
textbook JWT algorithm-confusion vulnerability (CWE-345 / CWE-347). Any
unauthenticated client capable of reaching the OIDC callback could take
over
an arbitrary account on any RAGFlow deployment with OIDC login enabled:

1. **`alg: "none"`** — present a JWT with `{"alg": "none"}` and no
   signature segment → `jwt.decode(..., algorithms=["none"])` → PyJWT's
   `NoneAlgorithm` accepts the token without verification → login as any
   user.
2. **RSA / HMAC confusion** — fetch the public RSA key from the
provider's
   JWKS (it's public), forge a JWT with `{"alg": "HS256"}` HMAC-signed
   using the public-key bytes as the secret → `jwt.decode(...,
   algorithms=["HS256"], key=public_key)` → verifier accepts → login as
   any user. (Modern PyJWT independently refuses to use a PEM-formatted
   key as an HMAC secret, which mitigates this leg for PEM key formats;
the fix here is the only mitigation for raw / DER / JWK octet keys and
   for older PyJWT versions.)

### What changed

**`api/apps/auth/oidc.py`:**

- New module constants `_ALLOWED_OIDC_SIGNING_ALGS` (asymmetric-only:
  `RS*`, `ES*`, `PS*`, `EdDSA` — explicitly excludes `none` and `HS*`)
  and `_DEFAULT_OIDC_SIGNING_ALGS = ("RS256",)` (the OIDC Core 1.0 §2
  spec default).
- New helper `_resolve_id_token_signing_algs(metadata)` — intersects the
  provider's advertised `id_token_signing_alg_values_supported` from
`/.well-known/openid-configuration` with the safe allowlist; falls back
  to RS256 when the field is missing or contains only unsafe values.
- `OIDCClient.__init__` now stores the resolved allowlist on
  `self.id_token_signing_algs` — pinned once, from a trusted source, at
  construction time.
- `parse_id_token` no longer calls `jwt.get_unverified_header` and no
  longer reads `alg` from the JWT header. It passes
  `self.id_token_signing_algs` to `jwt.decode(..., algorithms=...)`.
  `PyJWKClient.get_signing_key_from_jwt` still reads the `kid` from the
  header internally for JWKS lookup — that's fine, `kid` is not a
  security decision; the signature still proves which key was actually
  used.


**`test/testcases/test_web_api/test_auth_app/test_oidc_client_unit.py`:**

- Existing `test_parse_id_token_success_and_error` drops its
`jwt.get_unverified_header` mock (no longer called by `parse_id_token`).
- `_metadata` and `_make_client` helpers grew an optional `signing_algs`
parameter so tests can configure what the discovery document advertises.
- New `TestSSRFValidation` / algorithm-confusion regression block (7
  tests):
  - `test_id_token_signing_algs_default_to_rs256_when_metadata_missing`
  - `test_id_token_signing_algs_intersect_metadata_with_safe_allowlist`
  - `test_id_token_signing_algs_fall_back_when_only_unsafe_advertised`
  - `test_id_token_signing_algs_ignores_non_string_entries`
  - `test_id_token_signing_algs_handles_non_list_metadata_field`
  - `test_parse_id_token_passes_pinned_algorithms_to_jwt_decode` —
    sabotages `jwt.get_unverified_header` to raise on call, proving the
    verification path never consults the unverified header.
- `test_parse_id_token_rejects_alg_none` — uses real PyJWT to encode an
    `alg: "none"` token; `parse_id_token` raises `ValueError("Error
    parsing ID Token: …")` instead of accepting it.
  - `test_parse_id_token_rejects_hs256_when_allowlist_is_asymmetric` —
    uses real PyJWT to forge an `alg: "HS256"` token with a non-PEM
    shared secret (so PyJWT's incidental PEM-as-HMAC refusal isn't what
    blocks it); `parse_id_token` raises because `HS256` is not in the
    pinned allowlist.

Sanity-checked end-to-end with real PyJWT outside the project test
runner:

- `alg=none` forged token + `algorithms=["RS256"]` →
`InvalidAlgorithmError` ✓
- `alg=HS256` forged token + `algorithms=["RS256"]` →
`InvalidAlgorithmError` ✓
- Same `alg=HS256` token + `algorithms=["HS256"]` → **accepted**
({'sub': 'admin'})
  — confirming the attack path was real before the fix.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: galuis116 <contact@duerrimports.com>
2026-05-29 19:37:01 +08:00
kpdev
cb1ea5a47f Validate chunk image_base64 before doc-store write (#15364)
## Summary

Fixes [#15363](https://github.com/infiniflow/ragflow/issues/15363) —
`add_chunk` / `update_chunk` indexed chunks with `image_id` before
validating or storing `image_base64`, leaving orphan chunks on invalid
input.

## Related Issue

Fixes #15363

## Change Type

- [x] Bug fix
- [x] Regression tests

## What Changed

- Added `_decode_chunk_image_base64()` — strict base64 decode with
structured 4xx errors
- Added `_store_chunk_image_or_error()` — catches `store_chunk_image`
failures
- **`add_chunk` / `update_chunk`**: decode + store image **before**
`docStoreConn.insert` / `update`; only set `img_id` after successful
storage

## Files Changed

| File | Change |
|------|--------|
| `api/apps/restful_apis/chunk_api.py` | Helpers + reorder image
handling |
| `test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py`
| 3 regression tests |

## Validation

```bash
cd /root/gittensor/ragflow
pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_add_chunk_invalid_image_base64_does_not_index_chunk -v
pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_update_chunk_invalid_image_base64_does_not_update_chunk -v
pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py::test_restful_add_chunk_valid_image_base64_stores_before_insert -v
pytest test/testcases/test_web_api/test_chunk_app/test_chunk_routes_unit.py -v
```

## Test Plan

- [x] Invalid `image_base64` on add → 4xx, no doc-store insert
- [x] Invalid `image_base64` on update → 4xx, no doc-store update
- [x] Valid PNG base64 on add → image stored, chunk indexed with
`img_id`
- [ ] CI green
2026-05-29 19:36:46 +08:00
Dexterity
04aa8d04e8 fix(go-models): raise SSE scanner buffer so large stream chunks are not dropped (#15382)
### Summary

Closes #15381 

Every provider in `internal/entity/models/` reads its streaming response
with `bufio.NewScanner(resp.Body)` and iterates over `scanner.Scan()`.
The default `bufio.Scanner` maximum token size is 64KB, so when an
upstream sends a single SSE `data:` line larger than 64KB (long content
deltas, large tool or function call argument blobs, bundled
`reasoning_content`, or providers that emit a whole message in one
event) `scanner.Scan()` returns `false` and `scanner.Err()` returns
`bufio.ErrTooLong`. Streaming chat then ends with an error partway
through the response.

This change adds `scanner.Buffer(make([]byte, 64*1024), 1024*1024)`
immediately after every SSE scanner that was still bare, raising the cap
to 1MB. 1MB is the value already used for streaming chat in `openai.go`,
`modelscope.go`, `groq.go`, `mistral.go`, `xai.go` and the other already
patched providers (the 8MB cap in the repo is reserved for TTS and
embedding paths), so this simply converges the remaining providers onto
the established pattern. Nothing else changes: line parsing, `data:`
prefix handling, `[DONE]` detection, JSON unmarshalling, error handling,
and the existing `scanner.Err()` checks all stay the same.

Providers covered (23 scanners across 22 files): 302ai, aliyun,
baichuan, baidu, cohere, deepinfra, deepseek, gitee, huggingface,
lmstudio, minimax (the chat scanner, whose TTS scanner was already
bumped), moonshot, nvidia, ollama, openrouter, orcarouter, paddleocr,
siliconflow, tokenhub, vllm, volcengine, xunfei, zhipu-ai. `jiekouai.go`
is excluded because it is covered by the in flight #15337.

A table driven regression test (`sse_scanner_buffer_test.go`) streams a
single 128KB `data:` content delta followed by `data: [DONE]` through an
`httptest` server and asserts that `ChatStreamlyWithSender` delivers the
full content with no error across a representative subset of providers.
Without the buffer fix the test fails with `bufio.Scanner: token too
long`.

This PR also removes three duplicate declarations of the package level
`roundTripperFunc` test helper that several recently merged provider PRs
each added independently, which had left the `internal/entity/models`
test package unable to compile. The helper now lives in a single place
and is shared.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-29 19:34:00 +08:00
monsterDavid
53bb2bd9e8 fix(metadata): preserve empty AND results across filter conditions (#15386)
## Summary
- Fix `meta_filter()` AND logic so an empty result from an early
condition is not overwritten when a later condition matches.
- Add regression tests for empty-first AND, successful AND intersection,
and OR behavior after an empty first condition.

Fixes incorrect `/retrieval` metadata filtering when multiple AND
conditions are used and the first condition matches no documents.

Closes #15360

## Test plan
- [x] `pytest test/unit_test/common/test_metadata_filter_operators.py
-v` (19/19 passed)
2026-05-29 19:33:26 +08:00
bitloi
2d229dd8aa fix(go): resolve custom base_url for empty default region (#15043)
### What problem does this PR solve?

Fixes custom `base_url` resolution when a model instance has no
configured region.

Some drivers read custom base URLs from `BaseURL[""]` when
`apiConfig.Region` is empty, while others normalize empty region to
`"default"` and read `BaseURL["default"]`. This PR adds the `"default"`
alias only for empty-region custom base URLs while preserving the
existing empty-region key.

Closes #15042

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-29 19:33:09 +08:00
Haruko386
d766e49128 feat[Go]: implement /system/stats and refactor /system/config/log (#15407)
### What problem does this PR solve?

As title

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2026-05-29 19:32:21 +08:00
Hz_
d2f0a18f42 fix: persist logout access token invalidation (#15397)
### What this PR fixes

This PR fixes an issue in the Python backend where user logout did not
reliably persist the invalidated access_token to the database.
Although the logout endpoint returned success and logged that the token
had been invalidated, the user.access_token value could remain
unchanged in the database, which meant the previous login token could
stay valid longer than expected.

  ### What changed

  - Resolve the real user object before updating the token
  - Persist the invalidated access_token before calling logout_user()
- Return a server error if the token update is not written successfully

  ### Impact

- Logging out now correctly replaces the stored access_token with an
INVALID_... value
  - The previous login session is properly invalidated
- The change is limited to the logout flow and is intentionally small in
scope
2026-05-29 19:31:45 +08:00
Alexander Laurent
faa9c5469e feat: add Go MCP server delete API (#15262)
## What

#15240
Implementation for DELETE /api/v1/mcp/servers/:mcp_id
2026-05-29 19:29:55 +08:00
Hz_
09e91a8e61 Fix user registration initialization in Go API (#15349)
### What problem does this PR solve?

This PR fixes several behavior gaps in the Go implementation of the user
registration API.

### Type of change

- Make `nickname` required for user registration.
- Align registration error messages and response data with expected API
behavior.
- Handle password decryption errors for registration more consistently.
- Generate UUID v1-style IDs for new users, access tokens, tenants,
user-tenant records, and root files.
- Initialize default user fields during registration, including:
  - language
  - color schema
  - timezone
  - last login time
- Create user, tenant, user-tenant relation, tenant LLM records, and
root folder in a single DB transaction.
- Initialize default tenant LLM records from configured default models.
- Avoid partial registration data when one creation step fails.
- Use locale-based default language fallback for user profile responses.
2026-05-29 19:29:23 +08:00
呆萌闷油瓶
658ff06ca4 feat: add 4 new models for siliconflow (#15383)
### What problem does this PR solve?

Added 4 new models:
deepseek-ai/DeepSeek-V4-Pro
deepseek-ai/DeepSeek-V4-Flash
Pro/moonshotai/Kimi-K2.6
Pro/zai-org/GLM-5.1

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-29 19:28:29 +08:00
web-dev0521
bda2117a25 feat(connector): implement OneDrive data source connector (issue #15330) (#15331)
### What problem does this PR solve?

Closes #15330.

RAGFlow had no connector for OneDrive / OneDrive for Business. Users who
store working documents in OneDrive could not index them into a
knowledge base without manually downloading and re-uploading files.

This PR adds a net-new OneDrive data source that:

- Authenticates against Microsoft Graph with the same MSAL
client-credentials flow already used by the SharePoint and Teams
connectors (no new auth primitives).
- Enumerates every drive visible to the service principal and pages
through `/drives/{id}/root/delta`, persisting `@odata.deltaLink` values
per drive so subsequent syncs only fetch changed items.
- Optionally narrows ingestion to a sub-folder (`folder_path`) without
needing a separate code path.
- Surfaces typed errors on the validation probe (`GET /drives?$top=1`):
401 → `ConnectorMissingCredentialError`, 403 →
`InsufficientPermissionsError` (with a `Files.Read.All` hint), 5xx →
`UnexpectedValidationError`.
- Filters folders, soft-deleted items, and unsupported extensions (`.pdf
.docx .doc .xlsx .xls .pptx .ppt .txt .md .csv`).

#### Files

| File | Change |
|------|--------|
| `common/data_source/onedrive_connector.py` | **New** —
`OneDriveConnector` + `OneDriveCheckpoint`. |
| `common/data_source/config.py` | `DocumentSource.ONEDRIVE =
"onedrive"`. |
| `common/constants.py` | `FileSource.ONEDRIVE = "onedrive"`. |
| `common/data_source/__init__.py` | Export `OneDriveConnector`. |
| `rag/svr/sync_data_source.py` | `OneDrive(SyncBase)` with `batch_size`
normalisation; registered in `func_factory`. |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`DataSourceKey.ONEDRIVE`, visibility map (`syncDeletedFiles: true`),
info entry, form fields (tenant_id, client_id, client_secret,
folder_path, batch_size), default values. |
| `web/src/locales/en.ts`, `web/src/locales/zh.ts` |
`onedriveDescription` + 4 tooltip keys (EN + ZH). |
| `test/unit_test/data_source/test_onedrive_connector_unit.py` | **New**
— 13 unit tests (`p1`/`p2`) covering auth, validation, checkpoint
helpers, and document filtering. |

#### Required Azure AD permission

`Files.Read.All` (Application, admin-granted).

#### Out of scope

- Interactive end-user OAuth (delegated permissions) — the connector
uses app-only credentials, consistent with the SharePoint / Teams
precedent.
- Binary download of file contents — the sync layer emits `Document`s
carrying `webUrl` + metadata; bytes are hydrated downstream by the parse
pipeline.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-29 19:26:06 +08:00
buua436
bd6251f462 Fix: default OpenAI chat completions to non-stream (#15394)
### What problem does this PR solve?

default OpenAI chat completions to non-stream when `stream` is omitted
https://github.com/infiniflow/ragflow/issues/15356
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-29 17:47:47 +08:00
Lynn
dc4b82523b Feat: tenant llm provider (#14595)
### What problem does this PR solve?

Python implementation of the Go-based model_provider API suite.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: bill <yibie_jingnian@163.com>
2026-05-29 17:39:41 +08:00
glorydavid03023
b79f79d9b9 fix(go-models): harden Novita default transport handling (#15350)
## Summary
- Harden `NewNovitaModel` to avoid panics when `http.DefaultTransport`
is a custom non-`*http.Transport` RoundTripper.
- Fallback to a safe transport (`ProxyFromEnvironment`) while preserving
existing pooling/timeout settings.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2026-05-29 14:28:46 +08:00
bitloi
ea3a5dba11 fix: validate custom model inputs (#15200)
### What problem does this PR solve?

Closes #15199.

The add-custom-model endpoint is routed through
`/api/v1/providers/:provider_name/instances/:instance_name/models`, but
the handler previously trusted `provider_name` and `instance_name` from
the JSON body instead of the path target. A request could therefore hit
one provider/instance URL while operating on a different body
provider/instance.

The same handler only rejected `model_types` when the slice was nil. An
empty array passed validation and reached
`ModelProviderService.AddCustomModel`, where `request.ModelTypes[0]`
could panic.

This PR makes the path provider/instance authoritative, rejects
mismatched body values, rejects missing or empty `model_types`, and adds
a service-level guard so direct service callers cannot hit the same
panic path.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-29 10:15:01 +08:00
web-dev0521
550bdf215c feat(go-api): implement tenant member management (issue #15294) (#15295)
## Summary

Ports the Python `tenant_api` team/member management endpoints to Go,
adding 4 endpoints under `/api/v1/tenants/:tenant_id/`:

- `GET /tenants/:tenant_id/users` — list non-owner members with user
details (owner only)
- `POST /tenants/:tenant_id/users` — invite a user by email; creates
invite-role join record (owner only)
- `DELETE /tenants/:tenant_id/users` — remove a member by `user_id`;
owner can remove anyone, members can remove themselves
- `PATCH /tenants/:tenant_id` — accept a pending invitation,
transitioning role `invite → normal`

Closes #15294
2026-05-29 10:13:09 +08:00
Haruko386
834236a3ec feat[Go]: implement /api/v1/system/status GET (#15348)
### What problem does this PR solve?

As title

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2026-05-29 10:12:12 +08:00
oktofeesh
58eb957c30 fix(go-models): harden JieKouAI driver requests (#15337)
## Summary
- Harden JieKouAI request validation before outbound provider calls
- Force non-streaming and streaming chat methods to use their expected
stream modes
- Make model listing use a bodyless GET and parse model responses
without panics

Closes #14736

---------

Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2026-05-29 10:09:27 +08:00
nickmopen
e023c165b6 Fix(kb): enforce tenant authorization on UpdateMetadataSetting (#15268) (#15270)
## Summary

Closes #15268.

The `UpdateMetadataSetting` handler at `internal/handler/kb.go:126`
retrieved the authenticated user via `GetUser(c)` but discarded the user
object (`_, errorCode, errorMessage := GetUser(c)`), then forwarded the
caller-supplied `kb_id` straight to the service layer with no ownership
check. Any authenticated user could mutate the `parser_config` /
metadata of any knowledge base in the system by guessing or harvesting a
`kb_id` — a classic IDOR (CWE-284, OWASP A01).

This is the only handler in `internal/handler/kb.go` missing the check;
every sibling (`ListTags`, `ListTagsFromKbs`, `RenameTag`,
`KnowledgeGraph`, `DeleteKnowledgeGraph`, `GetMeta`, `GetBasicInfo`)
already calls `h.kbService.Accessible(kbID, user.ID)`. The same
defensive check on the document preview endpoint was added in PR #14625
— this PR closes the matching gap on the KB metadata endpoint.

---------

Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2026-05-29 10:08:55 +08:00
glorydavid03023
7fc909acc9 fix(go-models): harden ModelScope default transport handling (#15339)
## Summary
- Harden `NewModelScopeModel` to avoid panics when
`http.DefaultTransport` is a custom non-`*http.Transport` RoundTripper.
- Fallback to a safe transport (`ProxyFromEnvironment`) while preserving
existing pooling/timeout settings.
- Add `TestModelScopeNewModelWithCustomDefaultTransport` regression
coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2026-05-28 19:41:11 +08:00
web-dev0521
0a7662cf3e feat(go-api): implement GET /api/v1/agents list endpoint (issue #15328) (#15329)
## Summary

Closes: #15328 
- Implements `GET /api/v1/agents` — the agent/canvas listing endpoint
needed to complete the Home dashboard tile in `web/src/pages/home/`.
- Mirrors Python `api/apps/restful_apis/agent_api.py::list_agents`
exactly: tenant-join auth, optional `owner_ids` guard, keyword filter,
pagination, ordering, and `canvas_category` filter (default:
`agent_canvas`).
- **Scope:** read-only list only. Full agent CRUD and canvas runtime are
explicitly out of scope (separate slice of #15240).
2026-05-28 19:40:54 +08:00
web-dev0521
f80ec17fc5 feat(go-api): implement connector (data source) management endpoints (#15274)
## Summary

Ports the connector (data source) management endpoints that power
`web/src/pages/user-setting/data-source/` from Python
(`api/apps/restful_apis/connector_api.py`) to Go. Previously only `GET
/connectors` (list) was implemented in Go; this adds the rest of the
lifecycle.

Closes #15273 (subtask of #15240).

## Endpoints implemented

All under base path `/api/v1` (mirrors the Python routes):

| Method | Path | Description |
|--------|------|-------------|
| POST | `/connectors/{connector_id}/test` | Validate stored credentials
|

`GET /connectors` (list) was already present and is unchanged.

---------

Co-authored-by: Jin Hai <haijin.chn@gmail.com>
2026-05-28 19:40:15 +08:00
web-dev0521
98bc9ca6ac feat: implement Microsoft Teams data source connector (#15193)
### What problem does this PR solve?

Closes #15191.

RAGFlow shipped a Microsoft Teams connector stub
(`common/data_source/teams_connector.py`) whose document-loading methods
all returned `[]`, `Teams._generate()` was a `pass`, and Teams was
commented out of the data-source settings UI. As a result there was no
way to index Teams channel conversations into a knowledge base.

This PR implements the connector end to end on top of Microsoft Graph
(Office365-REST-Python-Client). It shares the MSAL client-credentials
auth shape with the SharePoint connector.

**Backend**

- `common/data_source/teams_connector.py`
- `load_credentials()` now builds the Graph client using an MSAL
client-credentials **token callback** — the form `GraphClient` actually
expects. (The previous stub passed a raw access-token string to
`GraphClient(...)`, which is not how that client is driven.) Token
acquisition is lazy, so credential loading performs no network call.
  - `validate_connector_settings()` lists teams via Graph.
- `load_from_checkpoint()` is now a generator that pages teams →
channels → messages, flattens each top-level post together with its
replies into one blob-based `Document` (`extension` `.txt`/`.html`,
`blob`, `size_bytes`, `doc_updated_at`). Incremental syncs are bounded
by message `lastModifiedDateTime` (falling back to `createdDateTime`).
Per-message errors surface as `ConnectorFailure` instead of aborting the
run.
- `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument`
batches and the checkpoint helpers return proper `TeamsCheckpoint`s.
- ACL → `ExternalAccess` mapping is intentionally left best-effort
(`load_from_checkpoint_with_perm_sync` delegates to the standard load)
because the sync pipeline does not currently persist `ExternalAccess`.
- `rag/svr/sync_data_source.py`
- Implemented `Teams._generate()` using the existing
`CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google
Drive), supporting full reindex and incremental polling from
`poll_range_start`.
- `TeamsConnector` is already exported from
`common/data_source/__init__.py`.

**Frontend (`web/`)**

- Enabled the `TEAMS` data-source enum and added its form fields
(`tenant_id`, `client_id`, `client_secret`), default values, display
metadata, and a Teams icon.
- Added `teamsDescription` / `teamsTenantIdTip` to `en.ts` and `zh.ts`.

**Tests**

- `test/unit_test/data_source/test_teams_connector_unit.py`: mock-based
unit tests covering credential loading (incomplete creds raise, happy
path sets the Graph client, fetch-without-creds raises), post/reply
flattening (incl. the HTML vs text extension), incremental
`lastModifiedDateTime` filtering, and slim-doc listing. All 6 pass;
`ruff check` is clean.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-28 17:10:38 +08:00
glorydavid03023
b7d88f0b09 fix(go-models): harden Voyage default transport handling (#15341)
## Summary
- Harden `NewVoyageModel` to avoid panics when `http.DefaultTransport`
is a custom non-`*http.Transport` RoundTripper.
- Fallback to a safe transport (`ProxyFromEnvironment`) while preserving
existing pooling/timeout settings.
- Add `TestVoyageNewModelWithCustomDefaultTransport` regression
coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-28 16:46:58 +08:00
glorydavid03023
ff9aa4e2c7 fix(go-models): harden LongCat default transport handling (#15340)
## Summary
- Harden `NewLongCatModel` to avoid panics when `http.DefaultTransport`
is a custom non-`*http.Transport` RoundTripper.
- Fallback to a safe transport (`ProxyFromEnvironment`) while preserving
existing pooling/timeout settings.
- Add `TestLongCatNewModelWithCustomDefaultTransport` regression
coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-28 16:45:59 +08:00