Commit Graph

1517 Commits

Author SHA1 Message Date
Jack
3eff41361b fix: prevent None values in auto-metadata from causing KeyError (#15842)
## Problem

When users configure auto-metadata for a dataset, parsing crashes with:

```
KeyError: 'properties' in gen_metadata → schema["properties"]
```

## Root Cause

Pydantic `AutoMetadataField` defaults `enum` and `description` to `None`
when the frontend omits these fields:

```python
class AutoMetadataField(Base):
    enum: Annotated[list[str] | None, Field(default=None)]
    description: Annotated[str | None, Field(default=None)]
```

These `None` values propagate through the call chain and cause two
crashes:
2026-06-09 19:10:48 +08:00
euvre
f97d6396b4 fix: BaiduYiyan API key validation fails in set_api_key (#15828)
### What problem does this PR solve?

When setting the API key for the BaiduYiyan provider, all model
validations fail with the error "Fail to access model using this api
key. No valid response received".

**Root cause:**

1. `BaiduYiyanChat` in `rag/llm/chat_model.py` does not override
`async_chat_streamly()`. The `verify_api_key()` function uses
`mdl.async_chat_streamly()` to validate, but `BaiduYiyanChat` inherits
`Base.async_chat_streamly()` which uses the OpenAI client, not the Baidu
Qianfan SDK (qianfan). Since BaiduYiyan has no OpenAI-compatible
base_url, validation always fails.

2. `verify_api_key()` in `provider_api_service.py` does not format the
raw API key string into the JSON format (`{"yiyan_ak": "...",
"yiyan_sk": "..."}`) that `BaiduYiyanChat.__init__()` expects via
`json.loads(key)`.

**Fix:**

1. Add `async_chat_streamly()` method to `BaiduYiyanChat` using the
qianfan SDK, consistent with the existing `chat_streamly()` method.
2. Add BaiduYiyan API key formatting in `provider_api_service.py`
`verify_api_key()` to match the format expected by
`BaiduYiyanChat.__init__()`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2026-06-09 19:05:58 +08:00
buua436
7b8d6f34b3 fix: force image parser json output (#15847)
### What problem does this PR solve?
Force image parser runtime output format to JSON so downstream chunking
reads OCR results from the JSON output and image parser chunks can be
displayed.

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-06-09 19:02:37 +08:00
buua436
c1496ffd43 fix: propagate memory tenant id in task collect (#15837)
### What problem does this PR solve?
Propagate `tenant_id` from memory task messages into task collection so
refactored task execution can build a valid context.

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-09 17:47:48 +08:00
Jonathan Chang
c586292993 feat: Implement checkpoint/resume support for GraphRAG community extraction and entity resolution (#15523)
## Summary

This PR adds checkpoint/resume support for the GraphRAG
`extract_community` and `resolve_entities` stages.

The implementation stores successful intermediate results in the
document store so interrupted ingestion can resume without repeating
already-completed LLM work. Checkpoints are loaded before each stage,
reused when available, saved after successful batch/community
processing, and cleaned up after the stage completes successfully.
## Related Issue
Closes: #15518
## Change Type
- [x] Feature
- [x] Bug fix
- [x] Test
- [ ] Refactor
- [ ] Documentation
- [ ] Breaking change
## Real Behavior Proof

Validation commands run locally:

```bash
uv run python -m py_compile \
  rag/graphrag/checkpoints.py \
  rag/graphrag/general/community_reports_extractor.py \
  rag/graphrag/entity_resolution.py \
  rag/graphrag/general/index.py \
  test/unit_test/rag/graphrag/test_checkpoints.py
```
Result:

```text
Passed
```

```bash
uv run pytest test/unit_test/rag/graphrag/test_checkpoints.py
```
Result:

```text
4 passed
```

```bash
uv run pytest \
  test/unit_test/rag/graphrag/test_phase_markers.py \
  test/unit_test/rag/graphrag/test_graphrag_utils.py \
  test/unit_test/rag/graphrag/test_checkpoints.py
```
Result:

```text
95 passed
```

```bash
git diff --check
```
Result:

```text
Passed
```

## Checklist

- [x] Implemented checkpoint/resume support for `extract_community`.
- [x] Implemented checkpoint/resume support for `resolve_entities`.
- [x] Avoided touching unrelated API behavior.
- [x] Added unit tests for the new checkpoint helper logic.
- [x] Verified Python syntax compilation.
- [x] Ran related GraphRAG unit tests successfully.
- [x] Ran `git diff --check`.
- [ ] Ran full project test suite.

---------

Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-06-09 15:34:47 +08:00
Wang Qi
93e4f6bc09 Fix: Add bge as embedding (#15784)
Fix: Add bge as embedding
2026-06-09 09:31:24 +08:00
Yash Raj Pandey
f2aadd3871 Fix: is_english() returns False for any list argument (broken language detection) (#15489)
### What problem does this PR solve?

`is_english()` in `rag/nlp/__init__.py` compiles a **single-character**
regex class and `fullmatch`es it against each item:

```python
pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]")   # no quantifier
...
eng = sum(1 for t in texts if pattern.fullmatch(t.strip()))
```

For a **string** argument the text is first split into single characters
(`texts = list(texts)`), so each `fullmatch` sees one character and
works. But for a **list** argument each item is a whole multi-character
string, and `fullmatch` of a one-character pattern against a
multi-character string always fails — so `is_english()` returns `False`
for **any** list, regardless of content.

```python
is_english("This is English")                              # True   (ok)
is_english(["The quick brown fox jumps.", "Hello world."]) # False  (bug — should be True)
is_english(["这是中文。"])                                    # False  (right answer, wrong reason)
```

Many call sites pass lists and were therefore silently always-`False`,
e.g.:

- `rag/llm/chat_model.py:1088`, `rag/llm/cv_model.py:168,1155` —
`is_english([ans])` when an answer is truncated at `max_tokens`, so an
English reply gets the Chinese "······由于长度的原因,回答被截断了,要继续吗?" continuation
suffix instead of the English one.
- `rag/app/book.py` — `remove_contents_table(...,
eng=is_english([...sections...]))`, so English books have their contents
table stripped in Chinese mode.
- `common/doc_store/es_conn_base.py:339`,
`rag/utils/opensearch_conn.py:733` — `is_english(txt.split())` in
highlight handling.
- plus `rag/app/qa.py`, `rag/flow/parser/utils.py`,
`common/doc_store/infinity_conn_base.py`.

### Fix

Add a `+` quantifier so an all-English multi-character item matches:

```python
pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]+")
```

The string path is unchanged (single characters still match) and
non-English lists still return `False`. Adds
`test/unit_test/rag/test_is_english.py`; the two list cases fail before
this change and pass after.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Used the Claude CLI while working on this.
2026-06-08 20:25:23 +08:00
Lynn
b9f06e6095 Feat: model list (#15774)
### What problem does this PR solve?

Support model list for VolcEngine.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-08 20:18:00 +08:00
Wang Qi
c5d0060e0b Delete not supported model providers list (#15783)
Delete not supported model providers list
2026-06-08 20:06:03 +08:00
Wang Qi
8e4fba6cd2 Fix OpenRouter key JSONDecodeError (#15776)
Fix OpenRouter key JSONDecodeError
2026-06-08 19:19:10 +08:00
euvre
d9a04ef702 fix: support auto mode in table parser document metadata aggregation (#15780)
### What problem does this PR solve?

Table parser metadata aggregation previously only ran when
`table_column_mode` was set to `manual`. In auto mode (default), all
columns default to `"both"` role, meaning they should also be aggregated
into document-level metadata for UI/chat filters. Additionally, the task
snapshot could be stale — `table_column_names` are written to KB
`parser_config` during `chunk()` but the task may have been created
before that.

Changes:
- Renames `aggregate_table_manual_doc_metadata` →
`aggregate_table_doc_metadata`
- Supports both `"manual"` and `"auto"` `table_column_mode` (defaults to
`"auto"`)
- Reloads `table_column_names` from KB DB when missing from task
snapshot
- Removes the manual-only guard in `task_executor` and refactored
`post_processor`
- Updates all tests with new function name and adds auto mode test cases

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-08 19:08:23 +08:00
euvre
2c64febc93 feat: add ModelMeta implementations for Xinference, LocalAI, BaiduYiyan, and Tencent Cloud (#15752)
### What problem does this PR solve?

This PR adds `ModelMeta` implementations for four additional LLM/RAG
ecosystem platforms, building on the ModelMeta infrastructure introduced
in #15711.

Currently, only `Ollama` and `VolcEngine` have `ModelMeta` classes that
enable remote model list fetching. This PR extends that support to four
more platforms.

### Changes

Added four new `ModelMeta` subclasses in `rag/llm/model_meta.py`:

| Platform | `_FACTORY_NAME` | Has model list | Has full model info |
Approach |

|----------|-----------------|----------------|---------------------|----------|
| **Xinference** | `"Xinference"` |  |  | Parses `model_type` and
`context_length` from `/v1/models` response. Maps 6 model types
(LLM/embedding/rerank/image/TTS/speech2text). |
| **LocalAI** | `"LocalAI"` |  |  | Uses Ollama-compatible `GET
/api/tags` + `POST /api/show` endpoints. Returns capabilities
(completion/embedding/vision/tools/thinking) and
`general.context_length`. |
| **BaiduYiyan** | `"BaiduYiyan"` |  |  | Uses Qianfan SDK static
model catalog + `get_model_info()` for `max_input_tokens`. Returns 60
models (56 chat + 4 embedding) with real context lengths. |
| **Tencent Cloud** | `"Tencent Cloud"` |  |  | `NotImplementedError`
— uses SDK-based SID/SK HMAC signing, no model list REST API available.
|

All classes are automatically discovered and registered via the existing
`__init__.py` mechanism — no additional configuration needed.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-08 19:05:25 +08:00
天海蒼灆
17f27b9df2 fix(browser): show resolved variables in workflow run log input (#15325)
### What problem does this PR solve?

Browser parsed sys.query from prompts but never called set_input_value,
so node_finished inputs displayed null in the agent orchestration run
log.
Additionally, Browser’s tenant-model path could trigger unsupported
structured-output modes (response_format/tool_choice) for some
OpenAI-compatible providers (notably DeepSeek thinking models), causing
step failures.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-08 18:12:56 +08:00
Rintaro
453ade288c fix(opensearch): keep "id" in _source on insert so document metadata isn't empty (#15473)
### What problem does this PR solve?

Follow-up to #15393. After #15393 fixed the OpenSearch `search()`
signature and
the doc-meta mapping, document metadata still renders as **"0 fields"**
for every
document on the OpenSearch backend (`DOC_ENGINE=opensearch`).

**Root cause.** `OSConnection.insert()` pops `id` out of the document
before
indexing:

meta_id = d_copy.pop("id", "") # id used as _id, then DROPPED from
_source

so the stored `_source` never contains an `id` field. But the doc-meta
read path
filters and sorts on that field:

- `DocMetadataService.get_metadata_for_documents()` builds
`condition = {"kb_id": kb_id, "id": doc_ids}` -> `OSConnection.search()`
emits
  `Q("terms", id=doc_ids)` (a term query on the `id` field), and
- `_search_metadata()` sorts with `order_by.asc("id")`.

With `id` absent from `_source`, the terms filter matches nothing, so
`get_metadata_for_documents()` returns an empty map and the UI shows "0
fields"
-- even though the metadata was written correctly (it is visible via a
kb_id-only query).

`ESConnection.insert()` already keeps `id` (`d_copy.get("id", "")`) with
the
comment *"also keep 'id' as a regular field for sorting"*. This is a
plain
OpenSearch-only divergence (`pop()` vs `get()`).

### Fix

Mirror Elasticsearch: use `get("id")` instead of `pop("id")` so `id`
survives in
`_source`. The doc-meta mapping already declares `id` as `keyword`, so
the field
is searchable/sortable once populated.

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)

### Affected backends
OpenSearch only. Elasticsearch already keeps `id`; Infinity / OceanBase
unaffected.

### How to reproduce
1. `DOC_ENGINE=opensearch`, create a KB, upload/parse a document, set
metadata.
2. Open the document list -> every document shows "0 fields" (the
metadata exists
in the `ragflow_doc_meta_*` index but its `_source` has no `id` field).

### Risk & backward compatibility
`insert()` is shared with the main chunk index; keeping `id` in
`_source` brings
OpenSearch in line with Elasticsearch (which already does this), so it
is parity,
not new behavior. No default / ES / Infinity / OceanBase behavior
change.

Note: affects new inserts only. Existing `ragflow_doc_meta_*` indices
created
before this change have no `id` in `_source`; re-sync metadata, or
backfill once
with `_update_by_query` (`ctx._source.id = ctx._id`).

### Test plan
- [ ] OpenSearch: after the fix the document list shows correct metadata
field
      counts (not "0 fields"); metadata filter/sort by id works.
- [ ] Elasticsearch regression: unchanged.
2026-06-08 17:31:04 +08:00
seekmistar01
68b9360536 fix(nlp): tokenize content_tks by whitespace in FulltextQueryer.paragraph (#15721)
## Summary
Closes #15720

`FulltextQueryer.paragraph` normalized its `content_tks` token string
with `[c.strip() for c in content_tks.strip() ...]`, which iterates the
string **character by character** — `"machine learning model"` becomes
20 single characters instead of 3 tokens. Those single chars are fed to
`tw.weights(..., preprocess=False)`, producing meaningless term weights
and a garbage `MatchTextExpr`.

`paragraph()` backs `Dealer.tag_content` (the KB auto-tagging feature),
so tag retrieval/scoring is silently broken for tag-enabled knowledge
bases. Every other method in this file tokenizes with `.split()` — this
is a `.strip()`-vs-`.split()` typo.

## Change
- `rag/nlp/query.py` — change `content_tks.strip()` to
`content_tks.split()` in the `paragraph` token-normalization line.

## Why it's safe
- The caller passes a space-separated token string; `.split()` recovers
the real tokens, matching the contract of `tw.weights` and the
`.split()` tokenization used by the sibling methods (`similarity`,
`question`).
- No behavior depends on the per-character expansion.

## Verification
- `python -m py_compile rag/nlp/query.py` — OK.
- Demonstrated: `"machine learning model"` → 20 single-character entries
before, 3 real tokens after. No test references `paragraph`.

Co-authored-by: seekmistar01 <seekmistar01@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:16:30 +08:00
Wang Qi
4bbd59823a Addd OpenRouter OpenAI API compatible list models (#15764)
Addd OpenRouter OpenAI API compatible list models
1. openrouter
2. OpenAI API compatible
3. VLLM
4. LM Studio

Open Router
<img width="1318" height="1217" alt="image"
src="https://github.com/user-attachments/assets/1d11b1e3-8c72-44fd-bff2-e9502d88d97d"
/>

VLLM
<img width="1433" height="931" alt="image"
src="https://github.com/user-attachments/assets/088801a6-0481-4623-976b-e7e93253ea07"
/>
2026-06-08 16:42:17 +08:00
Danut Matei
e2b0da9eea fix(opensearch): keep the BM25 leg in hybrid search (#15760)
### What problem does this PR solve?

Fixes the OpenSearch side of #10747: hybrid search drops the keyword
(BM25) leg and
ends up doing plain vector search.

When a search has both a text and a vector leg, `OSConnection.search()`
throws the text
query away:

    del q["query"]
    q["query"] = {"knn": knn_query}

The text clause only stays on as a filter inside the knn query, so it
narrows the
candidate set but doesn't count towards scoring. So hybrid search on
OpenSearch behaves
like plain vector search, unlike the Elasticsearch backend.

What I changed:

- when both legs are present, send a real hybrid query
`{"hybrid": {"queries": [bm25, {"knn": ...}]}}` and let a
normalization-processor
  search pipeline score and combine the two legs
- only the actual filters (kb_id, available_int, ...) go in the knn
filter, not the
  text must clause
- create the pipeline on startup if it's missing, so there's no separate
provisioning
step. name and weights can be set under `os:` in service_conf.yaml, or
via
`OS_HYBRID_PIPELINE`; defaults are `ragflow_hybrid_pipeline` and `[0.5,
0.5]`
- normalization-processor needs OpenSearch 2.10+. on older clusters, or
when the
pipeline can't be created, log a warning and fall back to vector-only
instead of
  pointing at a pipeline that doesn't exist

This is only the hybrid-search fix; `create_doc_meta_idx` is already on
main.

Testing (there's no OpenSearch path in CI): added a unit test
(`test/unit_test/rag/utils/test_opensearch_hybrid_search.py`, no
services needed) that
checks the query built in each case — hybrid + pipeline param for
text+vector, plain knn
for vector-only, plain bool for text-only, the knn filter never carrying
the text
query_string, and the vector-only fallback when the pipeline isn't
available. Also ran
it against a real OpenSearch 2.19.1 container with a doc that matches
the keyword but
sits outside the knn top-k: pure knn returns `['D1','D2','D5']` (keyword
doc missing),
the hybrid query returns `['A','D1','D2','D5']` (keyword doc present).

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: Danut Matei <matei.danut.dm@gmail.com>
2026-06-08 16:17:47 +08:00
buua436
6bf7056422 feat: add placeholder model metas (#15753)
### What problem does this PR solve?

add placeholder model metas

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-08 14:54:59 +08:00
cleanjunc
38f9ea5fec fix(rerank): normalize reranker scores onto a single scale before hybrid blend (#15429)
### What problem does this PR solve?

Closes #15428

The hybrid score in `rag/nlp/search.py` (`rerank_by_model`) blends
reranker similarity with token similarity on a fixed `[0, 1]` scale:

```python
return tkweight * np.array(tksim) + vtweight * vtsim + rank_fea  # tkweight=0.3, vtweight=0.7
```

The reranker implementations did not agree on that scale. Only three of
roughly 17 providers normalized their output, and `NvidiaRerank`
returned raw, unbounded logits. Weighted at `0.7`, a negative logit
could push a genuinely relevant chunk below pure keyword matches, and
its magnitude swamped `tksim`, which lives in `[0, 1]`. The practical
effect was that the same query produced differently scaled scores
depending on the configured reranker, and logit based providers degraded
retrieval quality instead of improving it.

This PR enforces a single scoring contract in one place:

- `Base.similarity` is now the only public entry point. It
short-circuits empty input and guarantees a normalized result. Each
provider implements its raw scoring in `_compute_rank`, which removes
sixteen duplicated empty input guards and the three scattered
normalization calls.
- Normalization is range aware. Providers that already return calibrated
`[0, 1]` relevance scores (Cohere, Jina, Voyage, and others) keep their
absolute magnitudes, so `similarity_threshold` filtering and the
reported `vector_similarity` stay meaningful. Only out-of-range output
such as NVIDIA logits is min-max rescaled into `[0, 1]`.
- The twelve leftover `[DEBUG ...]` prints in `rerank_by_model`,
introduced in #14231, are removed. They ran on every retrieval, added
per chunk overhead, and leaked queries, keywords, and document content
to stdout and logs.

A new regression suite in
`test/unit_test/rag/llm/test_rerank_normalization.py` covers logit
rescaling (positive, negative, and flat batches), preservation of
already calibrated scores, ordering, empty input handling, and the per
provider HTTP path. It also asserts that no provider overrides
`similarity()`, so the contract cannot silently drift.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-08 11:53:22 +08:00
cleanjunc
91983106f2 fix(retrieval): keep rerank window aligned to page_size for deep pagination (#15434)
### What problem does this PR solve?

Closes #15433

Reranked retrieval drops results and returns short pages once pagination
crosses the first candidate block, for the common page sizes 10 and 30.

In `rag/nlp/search.py`, the candidate window (`RERANK_LIMIT`) is rounded
up to a multiple of `page_size` to keep block based pagination aligned,
and then clamped back to 64:

```python
RERANK_LIMIT = math.ceil(64 / page_size) * page_size if page_size > 1 else 1  # e.g. 70 for page_size=10
RERANK_LIMIT = max(30, RERANK_LIMIT)
if rerank_mdl and top > 0:
    RERANK_LIMIT = min(RERANK_LIMIT, top, 64)  # clamps back to 64, breaking the multiple
```

`RERANK_LIMIT` is used both as the backend block size (`page =
global_offset // RERANK_LIMIT`) and as the modulus that slices a page
out of a reranked block (`begin = global_offset % RERANK_LIMIT`). When
it stops being a multiple of `page_size`, the block that gets fetched
and the slice taken from it no longer agree. With `page_size=10` and
`top=1024`, page 7 returns only 4 of 10 results and the head of the next
block is never shown on any page. This happens whenever the result set
spans more than one block, which is the default.

**Fix**

The window math is moved into a small reusable helper,
`Dealer._rerank_window`, which:

- targets a pool of about 64 candidates,
- bounds it by `top` when a reranker is active, and
- always rounds to a whole number of pages, so the window stays an exact
multiple of `page_size`.

The call site becomes a single line, and the alignment invariant now
lives in one documented place. Behavior is unchanged on every path that
was already aligned (the non reranked path and any `top` that already
produced a page multiple).

**Verification**

A simulation of the full retrieval path (per block rerank, similarity
threshold filter, and the exact `page // window` and `offset % window`
math) confirms the fix loses nothing where the old code lost real
results:

```
ps=10 top=1024:  new window=70  dropped_valid=0   |  old window=64  dropped_valid=16
ps=30 top=1024:  new window=90  dropped_valid=0   |  old window=64  dropped_valid=66
```

New unit tests in `test/unit_test/rag/test_search_pagination.py` cover
the alignment invariant, cross block pagination (every candidate
surfaced once, in order, no gaps, no short interior pages), the reported
regression, and parity with the old window on the previously correct
paths. All 114 cases pass and `ruff check` is clean.

Fixes the reranked deep pagination data loss described above.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-08 11:53:12 +08:00
qinling0210
c960dc2a4c Refine handling of POST /api/v1/datasets/search in GO (#15583)
### What problem does this PR solve?

Refine handling of POST /api/v1/datasets/search in GO

### Type of change

- [x] Refactoring
2026-06-08 11:49:37 +08:00
Lynn
b05d5a5228 Feat: get model list from remote (#15711)
### What problem does this PR solve?

Feat:
- Get model list from remote provider. 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-08 11:02:40 +08:00
web-dev0521
1d7e45115b feat(connectors): add Salesforce CRM data source connector (#15462)
### What problem does this PR solve?

Closes #15461.

RAGFlow had no way to ingest Salesforce CRM data, so support / sales
teams couldn't ground responses on live Accounts, Contacts,
Opportunities, Cases, or Knowledge articles. This adds a first-class
Salesforce data source connector that authenticates against a Connected
App via OAuth 2.0 client-credentials, queries selected SObjects via
SOQL, and turns each record into an indexable document with incremental
sync.

**Highlights**
- `common/data_source/salesforce_connector.py`: new
`SalesforceConnector` (`CheckpointedConnectorWithPermSync` +
`SlimConnectorWithPermSync`).
- OAuth 2.0 client-credentials flow; canonical `instance_url` from the
token response so multi-pod orgs route correctly.
- Per-object `SystemModstamp` cursor stored in
`SalesforceCheckpoint.cursors` — a failure mid-object doesn't rewind
sibling objects, and re-syncs only fetch changed rows.
- Deterministic record-to-text formatter (sorted keys) so SOQL field
reordering on the server doesn't mark every row "changed" on each poll.
- `_get_json` raises on non-2xx so 429 / 5xx never silently advance the
checkpoint past missing data.
- `Knowledge__kav` is in the default object set but is skipped silently
when the org doesn't have Salesforce Knowledge enabled (404 on
describe).
- Slim-doc IDs are scoped as `<Object>/<Id>` so prune deletes can't
collide across object types.
- `common/constants.py`, `common/data_source/config.py`,
`common/data_source/__init__.py`: register `salesforce` in `FileSource`
/ `DocumentSource` and export `SalesforceConnector`.
- `rag/svr/sync_data_source.py`: new `Salesforce(SyncBase)` class routed
through `load_from_checkpoint` (poll_source would re-walk every object
each run) and added to `func_factory`.
- Frontend:
- `web/src/pages/user-setting/data-source/constant/index.tsx`: new
`DataSourceKey.SALESFORCE`, form fields (instance URL, client ID/secret,
objects, api_version, batch size), `syncDeletedFiles` capability,
default form values, and tile entry with the new icon.
  - `web/src/locales/{en,zh}.ts`: description + per-field tooltips.
- `web/src/assets/svg/data-source/salesforce.svg`: 48x48 brand-style
icon to match the other Microsoft / cloud tiles.

**Verification**
- `npm run build` (vite + esbuild) passes (1m 26s).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-05 13:24:36 +08:00
Lynn
794c1f4b25 Fix: volc engine and other json key factories (#15653)
### What problem does this PR solve?

Fix:
- VolcEngine adapt to new api_key format
- Save dict api_key as json

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-05 09:45:44 +08:00
web-dev0521
98f2a2e60b feat(connectors): add Azure Blob Storage data source connector (#15466)
### What problem does this PR solve?

Closes #15465.

RAGFlow supports S3, Google Cloud Storage, R2, and OCI as data sources
but not Azure Blob Storage, leaving Azure users without a way to index
container objects into a knowledge base. This adds a first-class Azure
Blob Storage data-source connector — distinct from RAGFlow's existing
Azure storage *backends* (`rag/utils/azure_sas_conn.py`,
`rag/utils/azure_spn_conn.py`) which store RAGFlow's own files.

**Highlights**
- `common/data_source/azure_blob_connector.py`: new `AzureBlobConnector`
(`CheckpointedConnectorWithPermSync` + `SlimConnectorWithPermSync`).
- Uses the existing `azure-storage-blob` dependency (already in
`pyproject.toml`).
  - Three auth modes, tried in order of precedence:
1. **Account key** — `account_name` + `account_key` + `container_name`.
    2. **Connection string** — `connection_string` + `container_name`.
3. **SAS token** — `container_url` + `sas_token` (same shape as
`RAGFlowAzureSasBlob`).
- ETag fingerprint stored per blob in `AzureBlobCheckpoint.etags` —
unchanged blobs (same ETag as last run) are skipped without a download.
Only new/modified blobs are fetched.
  - Optional `prefix` scopes indexing to a virtual folder.
- `validate_connector_settings()` probes `get_container_properties()`
and maps `AuthenticationFailed / 403 / ContainerNotFound` to typed
connector exceptions.
  - Slim-doc IDs are blob names so prune reconciles correctly.
- `common/constants.py`, `common/data_source/config.py`,
`common/data_source/__init__.py`: register `azure_blob` in `FileSource`
/ `DocumentSource` and export `AzureBlobConnector`.
- `rag/svr/sync_data_source.py`: new `AzureBlob(SyncBase)` class routed
through `load_from_checkpoint` (ETag fingerprint owns change-detection)
and added to `func_factory`.
- Frontend:
- `web/src/pages/user-setting/data-source/constant/index.tsx`: new
`DataSourceKey.AZURE_BLOB`, auth-mode selector (account key / connection
string / SAS token), all credential fields, prefix + batch-size,
`syncDeletedFiles` capability, default form values, tile entry with
icon.
- `web/src/locales/{en,zh}.ts`: description + per-field tooltips for all
9 new keys.
- `web/src/assets/svg/data-source/azure-blob.svg`: Azure-branded
stacked-cylinders icon.

**Verification**
- `npm run build` (vite + esbuild) passes (37 s).

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-06-04 21:06:01 +08:00
Jack
b363146997 refactor: overhaul task executor with layered architecture and comprehensive test suite (#15471)
## Summary

Decomposes the monolithic `task_executor.py` (1945 lines) into a 6-layer
architecture with clear separation of concerns. The refactored code is
functionally equivalent to the original, verified through 400 passing
tests and a production-vs-dry-run comparison framework.

## Architecture

```
entry (task_manager)
  └─ orchestration (task_handler)
       ├─ services (chunk_service, embedding_service, dataflow_service, raptor_service, post_processor)
       │    └─ utilities (chunk_builder, chunk_post_processor, embedding_utils)
       └─ infrastructure (task_context, recording_context, interceptor)
```

Key design decisions:
- **TaskContext** — typed facade over raw task dict, injects rate
limiters + callbacks via composition
- **RecordingContext + Comparator** — enables side-by-side production vs
dry-run execution for safe migration
- **NullRecordingContext** — zero-allocation no-op for production, uses
`__slots__`
- **WriteOperationInterceptor** — FIFO replay of previous runs function
returns for comparison mode

## Migration Strategy

The original `handle_task()` in `task_executor.py` uses a 3-way switch
via `TE_RUN_MODE`:
- `TE_RUN_MODE=0` (default) → runs refactored code
- `TE_RUN_MODE=1` → runs both original + refactored, compares all
intermediate results
- `TE_RUN_MODE=2` → runs original code (fallback)

The comparison mode (`TE_RUN_MODE=1`) records ~40 intermediate values
(chunks, vectors, token counts, func return values) from the production
run and replays them during dry-run, then uses `ContextComparator` to
report mismatches.

## Functional Equivalence Fixes

All divergences between original and refactored code were identified and
fixed:
- Timeout decorators (handle/build_chunks/raptor/embedding)
- NullRecordingContext leak in finally block causing RuntimeError
- MinIO None-binary check with proper FileNotFoundError
- Dataflow dispatch after embedding binding + init_kb
- Memory task missing return after processing
- RAPTOR checkpoint progress reporting
- Tag cache (get_tags_from_cache/set_tags_to_cache) restoration
- dataflow_id correction in _load_dsl
- Language default Chinese, dead code guard removal
- embed_chunks made async with proper thread_pool_exec
- Full GraphRAG default configuration (10 parameters)
- Hardcoded q_768_vec fallback removal in RAPTOR

## Test Changes

- 20 new tests covering table parser manual mode, tag cache, embedding
edge cases, RAPTOR checkpoint, dataflow_id correction, storage binary
None, cancel cleanup, metadata=None boundary
- Unified `make_task_context`/`make_task_dict` factories eliminated 10+
duplicated helpers
- DataflowService tests migrated from internal method mocks to IO
boundary mocks (real orchestration code executes)
- Parametrized duplicate build_chunks post-processor tests
- 7 raptor tests modernized to @pytest.mark.asyncio
- Mock count per test reduced through boundary-level mocking strategy

**Test count: 400 passing, 0 warnings, 0 skips**

## Files Changed

| File | Change |
|------|--------|
| `rag/svr/task_executor.py` | +1 line (NullRecordingContext fix) |
| `rag/svr/task_executor_refactor/task_handler.py` | Orchestration
layer, 8 logic fixes |
| `rag/svr/task_executor_refactor/chunk_service.py` | +timeout +
None-check |
| `rag/svr/task_executor_refactor/embedding_service.py` | sync→async
rewrite |
| `rag/svr/task_executor_refactor/dataflow_service.py` | dataflow_id fix
+ timeout |
| `rag/svr/task_executor_refactor/raptor_service.py` | checkpoint fix +
assert |
| `rag/svr/task_executor_refactor/chunk_post_processor.py` | tag cache
restore |
| `rag/svr/task_executor_refactor/task_context.py` | language default
fix |
| `test/.../conftest.py` | +294 lines shared helpers |
| `test/.../*.py` | 15 test files refactored, 20 new tests |

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 17:18:31 +08:00
VictorECDSA
ff5971448b [Fix] naive: force-merge short markdown headers to prevent separate chunks (#15488)
## Problem

When uploading `.md` files with `parser=naive` and `delimiter="\n"`,
markdown headers (e.g., `## Quick Travel`) become separate chunks with
very short content (16-18 characters). This causes retrieval issues:
when the header is matched, the corresponding body text is not included
in the chunk.

## Related Issues

Closes #15487

## Checklist

- [x] Code changes are minimal and focused
- [x] Unit tests added (12/12 passed)
- [x] No breaking changes
2026-06-03 10:49:28 +08:00
Wang Qi
d41373cfa9 Feature: Add the new anthropic and voyage models (#15516)
add the newanthropic and voyage models. Strip opus 4.7 and 4.8 of
certain usnspported keys

Co-authored-by: Idriss Sbaaoui <112825897+6ba3i@users.noreply.github.com>
2026-06-02 17:29:18 +08:00
Aeovy
600590cd18 Fix: disable thinking to avoid potential infinite loops in Qwen3.5/Qwen3.6 models (#15101)
### What problem does this PR solve?

This PR fixes the issue where Qwen3.5/Qwen3.6 series models may spend
excessive time on simple document-parsing tasks, such as Auto Metadata
extraction, keyword extraction, question generation, and image
description when using the MinerU parser.

For these tasks, Qwen3.5/Qwen3.6 models may perform unnecessary
reasoning by default, which can lead to very long response times, high
token consumption, and, in some cases, potential infinite output loops.

Since Qwen3.5/Qwen3.6 multimodal models are instantiated as `CvModel`
when configured as `image2text`, the existing `enable_thinking=False`
logic in `chat_model.py` does not apply to them. This PR adds the
corresponding handling for the CV/image-to-text model path as well.

This helps reduce unnecessary thinking time, avoid potential infinite
loops, and improve parsing efficiency without noticeably affecting
output quality for these simple extraction and image-description tasks.

Fixes #15083.
2026-06-02 13:21:35 +08:00
kpdev
a4bc066f74 fix(rag): id2image parsing for hyphenated storage object keys (#15117) (#15118)
### What problem does this PR solve?

Fixes #15117.

Chunk images are stored with `img_id = f"{bucket}-{objname}"` in
`image2id()` (`rag/utils/base64_image.py`). When loading via
`id2image()`, the code used `image_id.split("-")` and required exactly
two segments. Object keys that contain hyphens (e.g. `page-1.jpg`)
produce more than two segments, so `id2image` returns `None` and chunk
image previews fail even though the blob exists.

This is the same parsing issue as #15115 (HTTP thumbnail route); this PR
fixes the indexing/retrieval path.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Test plan

- [x] `pytest test/unit_test/rag/utils/test_base64_image.py`
- [ ] Manual: index a chunk with an `objname` containing hyphens and
confirm `img_id` resolves to an image in retrieval

Fixes #15117.
2026-06-02 10:52:51 +08:00
nickmopen
bebf6ed244 fix(llm): strip non-generation keys from gen_conf for LiteLLM providers (#15427) (#15432)
### What problem does this PR solve?

Fixes #15427.

All LiteLLM-routed chats fail with:

- Anthropic: `litellm.BadRequestError: AnthropicException -
{"type":"invalid_request_error","message":"model_type: Extra inputs are
not permitted"}`
- OpenAI: `litellm.BadRequestError: OpenAIException - Unknown parameter:
'model_type'`

This is a regression from v0.25.4.

#### Root cause

A chat assistant's `llm_setting` is forwarded to the model as
`gen_conf`. `llm_setting` can legitimately carry RAGFlow-internal
metadata such as `model_type` (the chat REST APIs in
`api/apps/restful_apis/` read it back out of `llm_setting`), so that key
ends up inside `gen_conf`.

`Base._clean_conf` (OpenAI-compatible providers) already **whitelists**
the keys it forwards, so direct-OpenAI providers were unaffected.
`LiteLLMBase._clean_conf` only dropped `max_tokens` and passed
everything else straight through to `litellm.acompletion`, which
forwarded `model_type` to the upstream provider — and Anthropic / OpenAI
reject it. Because both Claude and GPT route through LiteLLM, every chat
broke.

#### Fix

- Extract the allowed-key set into a shared `ALLOWED_GEN_CONF_KEYS`
constant and reuse it in `Base._clean_conf`.
- Apply the same whitelist in `LiteLLMBase._clean_conf`, plus the
LiteLLM-specific reasoning params (`thinking`, `reasoning_effort`,
`extra_body`) that the model-family policies inject for reasoning
models.

This covers all four LiteLLM completion paths (`async_chat`,
`async_chat_streamly`, `async_chat_with_tools`,
`async_chat_streamly_with_tools`), since they all route through
`_clean_conf`.

#### Tests

Adds `test/unit_test/rag/llm/test_clean_conf_whitelist.py` covering both
backends: `model_type` (and other stray keys) are dropped, genuine
generation params and `thinking` survive, `max_tokens` is removed, and
the whitelist invariants hold.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Added test cases
2026-06-02 10:04:11 +08:00
Wang Qi
1a6df01b53 Bug fix: Enhance embeding model to give better error message (#15346)
To resolve https://github.com/infiniflow/ragflow/issues/15343 enhance
the model embedding message to give extact failure message to customer.


# QWen

## Retrieval
<img width="3321" height="1033" alt="image"
src="https://github.com/user-attachments/assets/6b82921a-a3a7-4a33-a383-1cf316398ee2"
/>

## Chat
<img width="2241" height="311" alt="image"
src="https://github.com/user-attachments/assets/ec311365-62d5-407a-8915-5c8d72be9716"
/>


# SiliconFlow
## Retrieval
<img width="3321" height="1033" alt="image"
src="https://github.com/user-attachments/assets/ee2cd191-a27d-4729-b53d-2fbdb4e352cd"
/>

## Chat
<img width="1562" height="210" alt="image"
src="https://github.com/user-attachments/assets/10376a8e-a3f4-422f-bc2e-96f2a8a96448"
/>

# Baichuan
## Retrieval
<img width="3321" height="1107" alt="image"
src="https://github.com/user-attachments/assets/dcb5409d-f7fc-4804-b186-5e1ee11e09c4"
/>

## Chat
<img width="2241" height="311" alt="image"
src="https://github.com/user-attachments/assets/ec311365-62d5-407a-8915-5c8d72be9716"
/>


# Zhipu
zhipu is good.
2026-06-01 19:18:16 +08:00
euvre
1e80419c21 fix: restore TitleChunker output for json/chunks upstream formats (#15396)
fix: restore TitleChunker output for json/chunks upstream formats

## Summary

The refactor commit e194027b (#14247) introduced two regressions that
caused `TitleChunker` to produce zero chunks when the upstream Parser
node outputs `json` or `chunks` format (e.g. PDF parsing).

## Root Cause

### 1. Dead code in `extract_line_records` (critical)

After refactor, when `payload` is `None` (which is the case for `json`
and `chunks` output formats), the method returns an empty list
immediately via `return []`, so no records are ever extracted from
structured upstream output. The original `json`/`chunks` handling code
became unreachable dead code.

### 2. Unconditional overwrite in `build_chunks_from_record_groups`

The `chunks` variable assigned in the `if` branch for markdown/text/html
formats was unconditionally overwritten by the statement below it, due
to a missing `else` keyword.

## Fix

- Remove the premature `return []` so the `json`/`chunks` branch is
reachable again.
- Add `else` branch in `build_chunks_from_record_groups` so the two
format families are handled independently.

## Test Plan

- [x] Verified no lint errors on the changed file
- [ ] Tested with a PDF document parsed via DeepDOC → TitleChunker
pipeline
- [ ] Tested with markdown input through TitleChunker
- [ ] Tested hierarchy and group chunking modes

## Impact

- Fixes the regression where documents parsed with `json`/`chunks`
output format produced no chunks from `TitleChunker`.
- No API or configuration changes. Fully backward compatible.

Signed-off-by: noob <yixiao121314@outlook.com>
2026-06-01 17:14:22 +08:00
Wang Qi
10e8690890 GraphRAG - NER - spacy - fix spacy extraction (#14783)
Fix spacy extraction
2026-06-01 13:05:54 +08:00
web-dev0521
cd18cfab79 feat(connector): implement Outlook data source connector (issue #15332) (#15333)
### What problem does this PR solve?

Closes #15332.

RAGFlow can index Gmail and generic IMAP mailboxes but had no native
connector for Outlook / Microsoft 365 mail. Organisations on Microsoft
365 had no way to bring mailbox content into a knowledge base through
Microsoft Graph.

This PR adds a net-new Outlook data source that:

- Authenticates against Microsoft Graph with the same MSAL
client-credentials flow already used by the SharePoint and Teams
  connectors (no new auth primitives).
- Pages over `/users/{id}/mailFolders/{folder}/messages/delta` per
mailbox and persists `@odata.deltaLink` values in
`OutlookCheckpoint.delta_links`, so incremental syncs only fetch changed
messages.
- Supports two scoping modes:
- **Tenant-wide** (default): enumerates every user in the tenant via
`/users` and syncs each mailbox. Requires `User.Read.All`.
- **Targeted**: when `user_ids` is provided (comma-separated UPNs or
object IDs), only those mailboxes are synced. `User.Read.All` is not
needed in this mode.
- Lets the caller pick the mail folder (`inbox`, `sentitems`, `archive`,
...). Defaults to `inbox`.
- Maps each message to a `Document` shaped after the Gmail connector:
one `TextSection` carrying `From/To/Cc/Subject` headers + body, with
HTML bodies stripped to text inline (no extra dependency).
- Surfaces typed errors on the validation probe:
401 → `ConnectorMissingCredentialError`, 403 →
`InsufficientPermissionsError` (with `Mail.Read` / `User.Read.All`
hint), 404 on a configured mailbox → `ConnectorValidationError`, 5xx →
`UnexpectedValidationError`.
- Skips messages flagged `@removed` by the delta semantics and messages
whose `receivedDateTime` is older than `poll_range_start`.

#### Files

| File | Change |
|------|--------|
| `common/data_source/outlook_connector.py` | **New** —
`OutlookConnector` (`CheckpointedConnectorWithPermSync` +
`SlimConnectorWithPermSync`) + `OutlookCheckpoint` + tiny `_strip_html`
helper. |
| `common/data_source/config.py` | `DocumentSource.OUTLOOK = "outlook"`.
|
| `common/constants.py` | `FileSource.OUTLOOK = "outlook"`. |
| `common/data_source/__init__.py` | Export `OutlookConnector`. |
| `rag/svr/sync_data_source.py` | `Outlook(SyncBase)` with `batch_size`
normalisation, CSV/list parsing of `user_ids`; registered in
`func_factory`. |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`DataSourceKey.OUTLOOK`, visibility map (`syncDeletedFiles: true`), info
entry, form fields (tenant_id, client_id, client_secret, folder,
user_ids, batch_size), default values. |
| `web/src/locales/en.ts`, `web/src/locales/zh.ts` |
`outlookDescription` + 5 tooltip keys (EN + ZH). |
| `test/unit_test/data_source/test_outlook_connector_unit.py` | **New**
— 19 unit tests (`p1`/`p2`/`p3`) covering auth, validation (tenant-wide
vs specific user vs error paths), checkpoint helpers, user enumeration
pagination, message filtering, HTML body stripping. |

#### Required Azure AD permissions

- `Mail.Read` (Application, admin-granted) — always.
- `User.Read.All` (Application, admin-granted) — only when `user_ids` is
left blank so the connector can enumerate mailboxes.

#### Out of scope

- **Attachment indexing.** The current connector emits message body +
headers; binary attachments are flagged via `metadata.has_attachments`
but not pulled. Adding attachment hydration is straightforward but
scoped out per the issue's "decide whether attachments are indexed in
the first version" note.
- **Delegated (per-user) OAuth.** The connector uses app-only
credentials, consistent with the SharePoint / Teams precedent in this
codebase.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-29 21:52:29 +08:00
Rintaro
11af34a895 fix(opensearch): repair document-metadata path broken by #14577 (#15393)
### What problem does this PR solve?

Document metadata is completely broken on the OpenSearch backend
(`DOC_ENGINE=opensearch`). Both failures were introduced by #14577,
which added
a doc-metadata dispatch surface but only validated it against
Elasticsearch.

**1. Index creation rejected (`mapper_parsing_exception`).**
`OSConnection.create_doc_meta_idx` feeds `conf/doc_meta_es_mapping.json`
verbatim to OpenSearch. That file declares a top-level `"dynamic":
"runtime"`.
Runtime fields are Elasticsearch-only; OpenSearch cannot parse the
value:

mapper_parsing_exception: Could not convert [dynamic.dynamic] to boolean
(400)

**2. `search()` signature mismatch (`TypeError`).**
`DocMetadataService` (added by #14577) calls `docStoreConn.search(...)`
with
snake_case kwargs (`select_fields=`, `index_names=`,
`knowledgebase_ids=`, …),
matching `ESConnection.search`. But `OSConnection.search` still uses
camelCase
parameters (`selectFields`, `indexNames`, `knowledgebaseIds`, …):

TypeError: OSConnection.search() got an unexpected keyword argument
'select_fields'

The UI then shows "0 fields" for every document on OpenSearch.

### Fix

1. In `OSConnection.create_doc_meta_idx`, normalize a top-level
`"dynamic": "runtime"` to `True` **for the OpenSearch request only**.
The
shared mapping file is left untouched, so the Elasticsearch backend
keeps its
runtime-field behavior. Dynamic field discovery is preserved on
OpenSearch.
2. Rename the `OSConnection.search()` parameters (and their in-method
local
uses) from camelCase to snake_case so they match `ESConnection.search()`
and
the `DocMetadataService` call sites. The change is confined to
`search()`;
`get/insert/update/delete` keep their existing positional signatures
(they
   are called positionally from `rag/nlp/search.py`).

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)

### Affected backends
OpenSearch only. Elasticsearch, Infinity and OceanBase are untouched.

### How to reproduce
1. `DOC_ENGINE=opensearch`, restart the stack.
2. Upload/parse a document, then open the dataset's document list / set
metadata.
- Before: index creation 400s (`Could not convert [dynamic.dynamic]`),
and/or
     `TypeError ... 'select_fields'`; document metadata shows 0 fields.

### Risk & backward compatibility
- ES default deployment: no change. `doc_meta_es_mapping.json` is not
modified,
  so ES still receives `"dynamic": "runtime"`.
- `search()` rename is internal; the only kwarg caller
(`DocMetadataService`)
  already uses the snake_case names this PR aligns to.

### Test plan
- [ ] `DOC_ENGINE=opensearch`: per-tenant `ragflow_doc_meta_*` index is
created
(no `mapper_parsing_exception`); document metadata reads/writes work.
- [ ] `DOC_ENGINE=elasticsearch` regression: doc-meta index still
created with
      runtime mapping; metadata unchanged.
2026-05-29 21:49:36 +08:00
Rintaro
3dfc16973c fix(opensearch): implement get_scores for KNN second-pass scoring (#15390)
### What problem does this PR solve?

On the OpenSearch backend (`DOC_ENGINE=opensearch`), every retrieval
that
performs the KNN second-pass scoring crashes with:

    AttributeError: 'OSConnection' object has no attribute 'get_scores'

**Root cause.** #14970 ("Refactor: Drop the vector fetch for ES") added
a
`get_scores()` helper to `ESConnectionBase`
(`common/doc_store/es_conn_base.py`)
and introduced `Dealer._knn_scores()` in `rag/nlp/search.py`, which
calls
`self.dataStore.get_scores(res)`. `search.py` routes Infinity and
OceanBase to
their own similarity paths via `DOC_ENGINE_INFINITY` /
`DOC_ENGINE_OCEANBASE`,
but OpenSearch sets neither flag, so it falls into the Elasticsearch
branch and
calls `get_scores`. `OSConnection` (which subclasses
`DocStoreConnection`
directly, not `ESConnectionBase`) never received that method, so any
vector-search hit triggers the crash. It reproduces with any normal
embedding
(e.g. 1024-dim mistral-embed) as soon as a KNN query returns hits.

### Fix

Add `OSConnection.get_scores()`, mirroring
`ESConnectionBase.get_scores()`.
OpenSearch hit headers expose `_score` exactly like Elasticsearch (the
existing
`OSConnection.__getSource` already reads `d["_score"]`), so the
implementation
is identical.

Scope note: Infinity and OceanBase deliberately do not use `get_scores`
(#14970 routes them elsewhere), so this fix is intentionally limited to
the
OpenSearch backend, which is the only one reaching the ES KNN-score
path.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Affected backends
OpenSearch only. Elasticsearch already implements `get_scores`; Infinity
/
OceanBase are routed away from it.

### How to reproduce
1. `DOC_ENGINE=opensearch` (docker `.env`), restart the stack.
2. Create a knowledge base with any dense embedding model and parse a
document.
3. Run a retrieval / chat over that KB -> 500 with the AttributeError
above.

### Risk & backward compatibility
None for the default Elasticsearch deployment -- the change only adds a
method
to `OSConnection`. No default values or ES/Infinity/OceanBase behavior
change.

### Test plan
- [ ] With `DOC_ENGINE=opensearch`, retrieval over a KB returns scored
chunks
      (no AttributeError).
- [ ] `DOC_ENGINE=elasticsearch` regression: retrieval unchanged.
- [ ] Empty-result path: `_knn_scores` early-returns `{}` (guarded),
get_scores
      handles an empty `hits` list gracefully.
2026-05-29 21:49:15 +08:00
呆萌闷油瓶
658ff06ca4 feat: add 4 new models for siliconflow (#15383)
### What problem does this PR solve?

Added 4 new models:
deepseek-ai/DeepSeek-V4-Pro
deepseek-ai/DeepSeek-V4-Flash
Pro/moonshotai/Kimi-K2.6
Pro/zai-org/GLM-5.1

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-29 19:28:29 +08:00
web-dev0521
bda2117a25 feat(connector): implement OneDrive data source connector (issue #15330) (#15331)
### What problem does this PR solve?

Closes #15330.

RAGFlow had no connector for OneDrive / OneDrive for Business. Users who
store working documents in OneDrive could not index them into a
knowledge base without manually downloading and re-uploading files.

This PR adds a net-new OneDrive data source that:

- Authenticates against Microsoft Graph with the same MSAL
client-credentials flow already used by the SharePoint and Teams
connectors (no new auth primitives).
- Enumerates every drive visible to the service principal and pages
through `/drives/{id}/root/delta`, persisting `@odata.deltaLink` values
per drive so subsequent syncs only fetch changed items.
- Optionally narrows ingestion to a sub-folder (`folder_path`) without
needing a separate code path.
- Surfaces typed errors on the validation probe (`GET /drives?$top=1`):
401 → `ConnectorMissingCredentialError`, 403 →
`InsufficientPermissionsError` (with a `Files.Read.All` hint), 5xx →
`UnexpectedValidationError`.
- Filters folders, soft-deleted items, and unsupported extensions (`.pdf
.docx .doc .xlsx .xls .pptx .ppt .txt .md .csv`).

#### Files

| File | Change |
|------|--------|
| `common/data_source/onedrive_connector.py` | **New** —
`OneDriveConnector` + `OneDriveCheckpoint`. |
| `common/data_source/config.py` | `DocumentSource.ONEDRIVE =
"onedrive"`. |
| `common/constants.py` | `FileSource.ONEDRIVE = "onedrive"`. |
| `common/data_source/__init__.py` | Export `OneDriveConnector`. |
| `rag/svr/sync_data_source.py` | `OneDrive(SyncBase)` with `batch_size`
normalisation; registered in `func_factory`. |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`DataSourceKey.ONEDRIVE`, visibility map (`syncDeletedFiles: true`),
info entry, form fields (tenant_id, client_id, client_secret,
folder_path, batch_size), default values. |
| `web/src/locales/en.ts`, `web/src/locales/zh.ts` |
`onedriveDescription` + 4 tooltip keys (EN + ZH). |
| `test/unit_test/data_source/test_onedrive_connector_unit.py` | **New**
— 13 unit tests (`p1`/`p2`) covering auth, validation, checkpoint
helpers, and document filtering. |

#### Required Azure AD permission

`Files.Read.All` (Application, admin-granted).

#### Out of scope

- Interactive end-user OAuth (delegated permissions) — the connector
uses app-only credentials, consistent with the SharePoint / Teams
precedent.
- Binary download of file contents — the sync layer emits `Document`s
carrying `webUrl` + metadata; bytes are hydrated downstream by the parse
pipeline.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-29 19:26:06 +08:00
Lynn
dc4b82523b Feat: tenant llm provider (#14595)
### What problem does this PR solve?

Python implementation of the Go-based model_provider API suite.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: bill <yibie_jingnian@163.com>
2026-05-29 17:39:41 +08:00
web-dev0521
98bc9ca6ac feat: implement Microsoft Teams data source connector (#15193)
### What problem does this PR solve?

Closes #15191.

RAGFlow shipped a Microsoft Teams connector stub
(`common/data_source/teams_connector.py`) whose document-loading methods
all returned `[]`, `Teams._generate()` was a `pass`, and Teams was
commented out of the data-source settings UI. As a result there was no
way to index Teams channel conversations into a knowledge base.

This PR implements the connector end to end on top of Microsoft Graph
(Office365-REST-Python-Client). It shares the MSAL client-credentials
auth shape with the SharePoint connector.

**Backend**

- `common/data_source/teams_connector.py`
- `load_credentials()` now builds the Graph client using an MSAL
client-credentials **token callback** — the form `GraphClient` actually
expects. (The previous stub passed a raw access-token string to
`GraphClient(...)`, which is not how that client is driven.) Token
acquisition is lazy, so credential loading performs no network call.
  - `validate_connector_settings()` lists teams via Graph.
- `load_from_checkpoint()` is now a generator that pages teams →
channels → messages, flattens each top-level post together with its
replies into one blob-based `Document` (`extension` `.txt`/`.html`,
`blob`, `size_bytes`, `doc_updated_at`). Incremental syncs are bounded
by message `lastModifiedDateTime` (falling back to `createdDateTime`).
Per-message errors surface as `ConnectorFailure` instead of aborting the
run.
- `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument`
batches and the checkpoint helpers return proper `TeamsCheckpoint`s.
- ACL → `ExternalAccess` mapping is intentionally left best-effort
(`load_from_checkpoint_with_perm_sync` delegates to the standard load)
because the sync pipeline does not currently persist `ExternalAccess`.
- `rag/svr/sync_data_source.py`
- Implemented `Teams._generate()` using the existing
`CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google
Drive), supporting full reindex and incremental polling from
`poll_range_start`.
- `TeamsConnector` is already exported from
`common/data_source/__init__.py`.

**Frontend (`web/`)**

- Enabled the `TEAMS` data-source enum and added its form fields
(`tenant_id`, `client_id`, `client_secret`), default values, display
metadata, and a Teams icon.
- Added `teamsDescription` / `teamsTenantIdTip` to `en.ts` and `zh.ts`.

**Tests**

- `test/unit_test/data_source/test_teams_connector_unit.py`: mock-based
unit tests covering credential loading (incomplete creds raise, happy
path sets the Graph client, fetch-without-creds raises), post/reply
flattening (incl. the HTML vs text extension), incremental
`lastModifiedDateTime` filtering, and slim-doc listing. All 6 pass;
`ruff check` is clean.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-28 17:10:38 +08:00
web-dev0521
5de021ebb4 feat: implement Slack data source connector (#15188)
### What problem does this PR solve?

Closes #15187.

RAGFlow shipped a Slack connector
(`common/data_source/slack_connector.py`) but it was never usable:
`Slack._generate()` in the sync worker was a `pass` stub, the
connector's document-generating code was incompatible with the current
data model,
and Slack was commented out of the data-source settings UI. As a result,
teams had no way to index Slack channels/threads into a knowledge base.

This PR completes the connector end to end.

**Backend**

- `common/data_source/slack_connector.py`
- Rewrote `thread_to_doc` to produce a blob-based `Document`
(`extension`/`blob`/`size_bytes`). The previous implementation built the
doc with a `sections=[...]` argument and omitted the now-required
`blob`/`extension`/ `size_bytes` fields, so it raised a validation error
against the current `Document` model. Thread messages are now cleaned
and flattened into a single UTF-8 text blob.
- Added `load_from_state()` / `poll_source(start, end)` generators. The
connector's checkpoint interface is a no-op stub, so both full and
incremental syncs run through a single channel-iterating generator built
on the existing module helpers (`get_channels`, `filter_channels`,
`get_channel_messages`, `_process_message`), with per-channel thread
de-duplication.
- `rag/svr/sync_data_source.py`
- Implemented `Slack._generate()`. Credentials are loaded via
`StaticCredentialsProvider` (the connector requires `slack_bot_token`
and does not support `load_credentials`). Supports full reindex and
incremental polling from `poll_range_start`, plus the optional channel
filter. Modeled on the Confluence/Dropbox wrappers.
- `SlackConnector` was already exported from
`common/data_source/__init__.py`.

**Frontend (`web/`)**

- Enabled the `SLACK` data-source enum and added its form fields (Slack
bot token + optional channel filter), default values, display metadata,
and a Slack icon.
- Added `slackDescription` / `slackBotTokenTip` / `slackChannelsTip`
strings to `en.ts` and `zh.ts`.

**Tests**

- `test/unit_test/data_source/test_slack_connector_unit.py`: unit tests
covering credential loading (`load_credentials` raises,
`set_credentials_provider` initializes clients, missing credentials
raises) and document generation (standalone message + flattened thread,
blob/extension/size_bytes/metadata, and the incremental poll time
window). All 5 pass; `ruff check` is clean.

Required Slack scopes: `channels:read`, `channels:history`,
`users:read`.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-28 15:46:07 +08:00
web-dev0521
c4c4e228e3 feat: implement SharePoint data source connector (#15190)
### What problem does this PR solve?

Closes #15189.

RAGFlow shipped a SharePoint connector stub
(`common/data_source/sharepoint_connector.py`) whose document-loading
methods all returned `[]`, `SharePoint._generate()` was a `pass`, and
SharePoint was commented out of the data-source settings UI. As a result
there was no way to index files stored in SharePoint document libraries.

This PR implements the connector end to end on top of Microsoft Graph
(Office365-REST-Python-Client).

**Backend**

- `common/data_source/sharepoint_connector.py`
- `load_credentials()` now builds the Graph client using an MSAL
client-credentials **token callback** — the form `GraphClient` actually
expects. (The previous stub passed a raw access-token string to
`GraphClient(...)`, which is not how that client is driven.) Token
acquisition is lazy, so credential loading does no network call.
- `validate_connector_settings()` resolves the configured site via
Graph.
- `load_from_checkpoint()` is now a generator that enumerates every
document library under the site, walks folders depth-first, downloads
each file, and yields blob-based `Document` objects (`extension` /
`blob` / `size_bytes` / `doc_updated_at`). Incremental syncs are bounded
by file `lastModifiedDateTime`. Per-file errors are surfaced as
`ConnectorFailure` rather than aborting the run.
- `retrieve_all_slim_docs_perm_sync()` yields id-only `SlimDocument`
batches (no downloads) and the checkpoint helpers return proper
checkpoints.
- ACL → `ExternalAccess` mapping is intentionally left best-effort
(`load_from_checkpoint_with_perm_sync` delegates to the standard load)
because the sync pipeline does not currently persist `ExternalAccess`;
this can be extended once that plumbing exists.
- `rag/svr/sync_data_source.py`
- Implemented `SharePoint._generate()` using the existing
`CheckpointOutputWrapper` pattern (same shape as Confluence/Jira/Google
Drive), supporting full reindex and incremental polling from
`poll_range_start`.
- `SharePointConnector` is already exported from
`common/data_source/__init__.py`.

**Frontend (`web/`)**

- Enabled the `SHAREPOINT` data-source enum and added its form fields
`site_url`, `tenant_id`, `client_id`, `client_secret`), default values,
display metadata, and a SharePoint icon.
- Added `sharepointDescription` / `sharepointSiteUrlTip` to `en.ts` and
`zh.ts`.

**Tests**

- `test/unit_test/data_source/test_sharepoint_connector_unit.py`:
mock-based unit tests covering credential loading (incomplete creds
raise, happy path sets the Graph client, fetch-without-creds raises),
drive traversal + file download, incremental `lastModifiedDateTime`
filtering, and slim-doc listing. All 6 pass; `ruff check` is clean.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-05-28 13:26:08 +08:00
Jack
f0cb7a544b Refactor: Task Executor (#15154)
### What problem does this PR solve?

1. Break huge function into smaller pieces
2. Add unit test for the smaller pieces function
3. Layer-ed design
a. infra layer - task_context.py, recording_context.py,
write_operation_interceptor.py, ...
    b. service layer - *_service.py
    c. business layer - task_handler.py
4. Default behavior: use "refactor-ed version" - can switch to original
version by change env variable

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Performance Improvement

---------

Co-authored-by: Liu An <asiro@qq.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-05-27 21:54:17 +08:00
Wang Qi
619b971785 Fix: empty file with better message (#15232)
Fix: empty file with better message
2026-05-26 12:28:53 +08:00
wdeveloper16
4b36801b53 fix: resolve asyncio correctness issues (fire-and-forget tasks, event loop nesting) (#14761)
## Summary

Fixes the confirmed asyncio anti-patterns from #14755. Only the three
verified bugs are addressed; patterns already correctly using
`asyncio.new_event_loop()` in a fresh thread are left untouched.

### Changes

**`api/apps/restful_apis/tenant_api.py` — fire-and-forget
`send_invite_email`**

`asyncio.create_task()` was called without storing the `Task` reference.
CPython's GC can collect an unfinished task, silently cancelling it and
swallowing exceptions. Fixed by storing the task in a module-level
`_background_tasks: set[Task]` with a `done_callback` to discard it on
completion — the standard Python idiom for safe background tasks.

**`api/apps/restful_apis/agent_api.py` — fire-and-forget
`background_run`**

Same root cause in the webhook "Immediately" execution path. Same fix
applied.

**`rag/llm/chat_model.py` (`LocalLLM._stream_response`) —
`asyncio.get_event_loop()` on running loop**

`asyncio.get_event_loop()` returns Quart's running event loop when
called from an async context.
Calling `loop.run_until_complete()` on it raises `RuntimeError`.
Replaced with `asyncio.new_event_loop()` so the generator
uses a dedicated fresh loop, closed in a `finally` block.

## What was NOT changed

- `llm_service._sync_from_async_stream` and
`evaluation_service._sync_from_async_gen`: both already correctly use
`asyncio.new_event_loop()` inside a fresh thread.
- `llm_service._run_coroutine_sync`: only caller is `rag/app/resume.py`
(sync context), so `thread.join()` is correct there.
- `requests` in agent tools: sync methods dispatched through thread
pools; httpx migration is a separate, larger refactor.

## Test plan

- [ ] Invite a team member and confirm the email is sent with no task
warnings in logs.
- [ ] Trigger a webhook agent in "Immediately" mode; confirm canvas
state is persisted after background run.
- [ ] Verify `LocalLLM` (Jina backend) chat and streaming work
end-to-end.

Closes #14755

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-05-25 22:45:40 +08:00
Wang Qi
7e6844118b Fix search vector_similarity_weight (#15108)
### What problem does this PR solve?

Fix search vector_similarity_weight

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-22 16:05:13 +08:00
Wang Qi
a9ec78cb9c Refactor: enahnce retry and timeout (#14983)
### What problem does this PR solve?

1. Enhance retry and timeout, and adjust the default timeout
2. NER: spacy do not batch chunks
3. extract _has_cancel_and_exit
4. enhance log messages

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
2026-05-22 13:16:39 +08:00
buua436
04bdb41909 Fix: guard missing task language (#15136)
### What problem does this PR solve?

guard missing task language

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-05-22 11:46:38 +08:00
Wang Qi
c5a46fda44 Fix: <asyncio.locks.Semaphore object at 0xabcd [locked]> is bound to a different event loop (#15100)
Fix: <asyncio.locks.Semaphore object at 0xabcd [locked]> is bound to a
different event loop
2026-05-21 19:23:41 +08:00