Commit Graph

160 Commits

Author SHA1 Message Date
philluiz2323
bf18b59264 fix(agent): enforce tenant ownership on agentbots completions/inputs (#15457)
### What problem does this PR solve?

Fixes #15456.

The SDK agent-bot routes `POST /api/v1/agentbots/<agent_id>/completions`
and `GET /api/v1/agentbots/<agent_id>/inputs`
(`api/apps/restful_apis/bot_api.py`) authenticate the caller with a beta
API token — which only yields the caller's `tenant_id` — but then load
and run the agent named in the URL **without verifying the agent belongs
to the caller's tenant**. `UserCanvasService.get_agent_dsl_with_release`
even accepts a `tenant_id` it never uses, and `begin_inputs` calls
`get_by_id` directly. Any holder of a single valid beta token could
therefore run another tenant's agent (leaking its DSL/prompts/tool
config) or read another tenant's agent metadata and begin input form,
just by substituting a victim `agent_id`.

This PR adds the project's existing ownership gate,
`UserCanvasService.accessible(agent_id, tenant_id)`, to both endpoints
right after token authentication — mirroring the checks already enforced
on the equivalent first-party routes in
`api/apps/restful_apis/agent_api.py` (lines 75/578/775) and on the
sibling `chatbot_completions` / `create_agent_session` /
`delete_agent_session` handlers in the same file. On failure it returns
the same `Can't find agent by ID: <id>` message already used by
`begin_inputs`, so it does not reveal whether an `agent_id` exists in
another tenant.

Added a regression test
(`test/unit_test/api/apps/restful_apis/test_agentbots_access_control.py`,
following the existing stubbed-loader pattern from
`test_get_agent_session.py`) asserting that an inaccessible `agent_id`
is rejected before the agent is loaded (`begin_inputs`) or executed
(`completions`), and that an accessible agent still proceeds.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 13:08:23 +08:00
jony376
74c2944ca1 fix(agent): bind session_id to path agent_id on GET/DELETE agent sessions (#15374)
## Related issues

Closes #15128

### What problem does this PR solve?

`GET` and `DELETE` `/api/v1/agents/<agent_id>/sessions/<session_id>`
verified canvas access for `agent_id` in the URL but loaded/deleted
sessions only by `session_id`, without checking `conv.dialog_id ==
agent_id`.

Any user with access to **any** agent could read or delete another
agent's `API4Conversation` session (messages, references, DSL, etc.)
when they knew the session UUID.

Agent completions in the same file already enforce this binding; chat
sessions do too — these two routes were inconsistent.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Changes

| File | Change |
|------|--------|
| `api/apps/restful_apis/agent_api.py` | Require `conv.dialog_id ==
agent_id` in `get_agent_session` and `delete_agent_session_item`; return
generic `"Session not found!"` on mismatch |
| `test/unit_test/api/apps/restful_apis/test_get_agent_session.py` | Add
IDOR regression tests for GET/DELETE; fix success fixture to include
`dialog_id`; track `delete_by_id` calls |

### Test plan

- [x] Unit tests added for GET/DELETE IDOR and success paths
- [ ] `pytest
test/unit_test/api/apps/restful_apis/test_get_agent_session.py`

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 12:58:29 +08:00
seekmistar01
2589b6e26d fix(agent): Switch no longer matches an empty condition (all([]) is True) (#15644)
## Summary
Fixes the agent `Switch` component matching an **empty/all-skipped
condition** unconditionally because `all([]) is True`.

## Root cause
`res` only accumulates for items with a non-empty `cpn_id` (blank ones
`continue`). For a condition with empty `items` (or all-blank `cpn_id`),
`res == []`, and `if all(res):` is `True`, so the Switch routes to that
condition's `to` target before reaching the else/`end_cpn_ids` branch.

## Fix
```diff
-            if all(res):
+            if res and all(res):
```
An empty result set no longer counts as a match; genuinely-satisfied
"and" conditions still route (the real `all(res)` path is preserved).

## Files changed
- `agent/component/switch.py`
- `test/unit_test/agent/component/test_switch_empty_condition.py` (new)

## Verification
- `ruff check` / `ruff format --check` — clean
- Added unit tests (mirroring the existing `_FakeCanvas` component-test
pattern): an empty/all-skipped "and" condition now falls through to
`end_cpn_ids`; a genuinely-satisfied "and" condition still routes to its
target.
- Local full pytest not run (heavy RAG deps); CI validates.

## Note
Implemented with LLM assistance (model: claude-opus-4-8).

Closes #15643

---------

Co-authored-by: seekmistar01 <seekmistar01@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 12:15:29 +08:00
philluiz2323
56ece6848b fix: guard SSRF in ExeSQL agent tool DB host (#15609)
### What problem does this PR solve?

Closes #15608.

The ExeSQL agent tool (`agent/tools/exesql.py`) opens database
connections to a node-author-controlled host/port with no SSRF
validation. The sibling `test_db_connection` endpoint already validates
the host via `common.ssrf_guard.assert_host_is_safe` (added by PR
#14860), but the tool that actually performs the connection at agent run
time was left unguarded — so the guard is bypassed simply by running the
agent. An agent author can point the host at `127.0.0.1`,
`169.254.169.254` (cloud metadata), or any internal RFC1918 host/port,
turning ExeSQL into an internal port-scanner / metadata-fetch primitive.

### Fix

Mirror the accepted endpoint guard: validate (and resolve) the host
once, before the `db_type` dispatch, and connect to the validated public
IP so a later DNS change cannot rebind the host to an internal address.

- Add `from common.ssrf_guard import assert_host_is_safe`.
- `safe_host = assert_host_is_safe(self._param.host)` before the
dispatch (rejects loopback, link-local/metadata, RFC1918, and
unresolvable hosts).
- Substitute the validated IP into all 6 driver branches: mysql/mariadb,
oceanbase, postgres, mssql, trino, IBM DB2.

Adds `test/unit_test/agent/tools/test_exesql_ssrf.py` covering loopback,
link-local/metadata, RFC1918, and empty-host rejection (before any
connection), plus an allowed host dialing the validated IP.

### Validation

- `python3 -m py_compile agent/tools/exesql.py`
- `ruff check agent/tools/exesql.py
test/unit_test/agent/tools/test_exesql_ssrf.py`
- `pytest test/unit_test/agent/tools/test_exesql_ssrf.py` — 5 passed

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 12:11:21 +08:00
jiashi19
db188cc705 Feat/agent thinking switch (#15446)
### What problem does this PR solve?

This PR adds an Agent LLM setting to control thinking mode for official
providers that expose a thinking switch.

Related to #12842.  
Closes #15445.

Some providers expose thinking controls through provider-specific
request fields, but Agent LLM settings did not have a unified option for
users to enable or disable thinking mode.

This PR adds a `Thinking` selector with:

- System default
- Enabled
- Disabled
<img width="452" height="278" alt="8566b0b4-0546-4c8a-913d-f9bbd38319f6"
src="https://github.com/user-attachments/assets/25b497f7-1ba0-4bfe-940d-6fe79287d6ab"
/>
<img width="471" height="971" alt="8a0a6bee-f45f-48d5-bd83-17af260de3db"
src="https://github.com/user-attachments/assets/41ad43c1-5087-48f1-bf37-f2ca14c2be2f"
/>
Initial support is limited to the verified official providers:

- Qwen / DashScope: `enable_thinking`
- Kimi / Moonshot: `thinking.type`
- GLM / ZHIPU-AI: `thinking.type`

For LiteLLM-based providers, provider-specific fields are forwarded
through `extra_body` before `drop_params` filtering so the request
parameters are preserved.



### Type of change

- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: jiashi <jiashi19@outlook.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 12:02:55 +08:00
Harsh Kashyap
ab1015db2e fix(agent): restore be_output and test DeepL error return (#16363)
## Summary

#16332 fixed the missing `return` in DeepL's except branch, but
`ComponentBase.be_output` was removed during the agent refactor (#9113)
while several components still call it. DeepL (and other tools) would
raise `AttributeError` before any error message could be returned.

- Restore `ComponentBase.be_output` as `pd.DataFrame([{"content": v}])`
(same as pre-refactor behavior)
- Add regression test that `_run` returns the `**Error**:` message when
translation fails

Related to #16329

## Test plan

- [x] `test_run_returns_error_on_translation_failure`
- [x] Existing `test_deepl.py` check() tests still pass

---------

Co-authored-by: Harsh Kashyap <harshkashyap@Harshs-MacBook-Pro.local>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 11:54:40 +08:00
cleanjunc
ef42fd771d fix(agent): add HTTP timeout to external API tools (#15436)
### What problem does this PR solve?

Closes #15435 

Several agent tools call external HTTP APIs through `requests` with no
request timeout. When an upstream host accepts the connection but never
responds (a slow or overloaded API, a half open connection, a stuck load
balancer), the call blocks forever. These tools run inside agent canvas
execution, so a single stalled socket freezes the entire agent run with
no recovery.

Ten call sites were affected:

- `agent/tools/qweather.py` (4 calls)
- `agent/tools/jin10.py` (4 calls)
- `agent/tools/tushare.py` (1 call)
- `agent/tools/github.py` (1 call)

The `github.py` tool already carried the `@timeout` decorator from
`common/connection_utils.py`, but that does not protect against this
case. In the default configuration the decorator waits on its result
queue with no timeout, and a daemon thread blocked inside a socket read
cannot be killed, so the run still hangs. The per request timeout added
here is what actually bounds the call.

This is the same bug class as the merged Go stream timeout fix,
surfacing in the Python tool layer.

Changes:

- Pass `timeout=DEFAULT_TIMEOUT` on all 10 calls, reusing the existing
shared constant in `common/http_client.py` (configurable via
`HTTP_CLIENT_TIMEOUT`) so there is one source of truth rather than
scattered literals.
- Add an AST based unit test at
`test/unit_test/agent/tools/test_http_timeout.py` that scans every tool
module and fails if any `requests` or `httpx` request call omits a
`timeout`, guarding current and future call sites.

Verification:

- Reproduced the indefinite block against a stalling local server, and
confirmed that adding a timeout raises `ReadTimeout` promptly.
- Confirmed the `@timeout` decorator does not interrupt a blocked no
timeout request in its default configuration.
- The new test flags exactly the 10 original call sites on the pre fix
code and passes (22 modules) after the fix.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 11:53:49 +08:00
Khaostica
d4a0e711ae feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068)
### What problem does this PR solve?

Closes #14773.

Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a
dataset ingestion that always embeds and indexes the result. There is no
way to drive Pipeline-style chunking from an Agent workflow without
paying that vectorization/persistence cost.

This PR adds a single new Agent component, `PipelineChunker`, that:

- Takes one or more file references (from `Begin` / `UserFillUp`
uploads) as input.
- Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`,
`qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`,
`picture`, `audio`, `resume`, `tag`) against each file.
- Emits the resulting chunks as `chunks: list[str]` and `chunks_full:
list[dict]` for downstream Agent nodes.
- Performs **no embedding and no persistence** — chunks live only in
canvas variables for the duration of the run, exactly as requested in
the issue.

The component is auto-discovered by `agent/component/__init__.py`; no
registry edits required. Chunker functions are imported lazily so the
component itself does not pull `deepdoc` / OCR / VLM at
component-discovery time. File resolution mirrors the existing
`ExcelProcessor` convention.

Out of scope for this PR (potential follow-ups):

- Vectorization / KB persistence (explicit ask in the issue).
- Frontend canvas UI for the new component.
- Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker`
(consumes a parser node's structured output rather than a raw file — a
separate, larger feature).

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

---

## Files changed

- `agent/component/pipeline_chunker.py` — new component (~180 lines)
- `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120
lines)

## Test plan

- [x] `ruff check` on changed files — clean.
- [x] `ruff format` applied to the new component file.
- [x] `python -m py_compile` on both new files — both compile.
- [x] New unit test file carries `pytestmark = pytest.mark.p2` so it
runs under marker-filtered CI.
- [x] Every new function, method, and class has a docstring (CodeRabbit
80% docstring-coverage gate).
- [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x
-q` — **7 passed in 1.95s** locally. Tests stub
`api.db.services.file_service` and `rag.app.*` so they exercise the
parameter validation and parser-id lookup table without requiring the
full backend / model stack.

## Manual integration plan (post-merge)

1. Drop the component into an Agent canvas after a `Begin` node with a
file input.
2. Set `parser_id = "naive"` (or any other strategy) and reference the
file input in `inputs`.
3. Wire the `chunks` output into a downstream `LLM` / `Message` /
`Iteration` node — chunks are available as plain text without any
embedding or KB write.

Co-authored-by: John Baillie <johnbaillie2007@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 11:52:58 +08:00
Zhichang Yu
c4fe68eaa0 Harden closed-advisory fixes (#16409)
## Summary
- harden reopened advisory fixes across REST connector, invoke, document
downloads, and markdown rendering
- add targeted regression coverage for redirect-safe SSRF handling,
invoke SSRF checks, document access control, and markdown sanitization
- verify each referenced GHSA against the original GitHub advisory text
and align the closed-advisory plan with the implemented remediation

## What changed
- add tenant access checks to document download endpoints to avoid
cross-tenant document disclosure
- add per-hop SSRF validation, DNS pinning, redirect handling, and
redirect limits to the REST API connector
- ensure invoke requests validate and pin the resolved host and never
follow redirects implicitly
- keep the generic rate-limited request path wrapped, not just GET and
POST helpers
- sanitize markdown HTML before rendering in the highlight markdown
component

## Validation
- `cd web && npm test -- --runInBand
src/components/highlight-markdown/__tests__/index.test.tsx`
- `.venv/bin/python -m pytest -q
test/unit_test/data_source/test_rest_api_connector.py`
- targeted `test/testcases/test_web_api/...` unit additions were
reviewed, but the suite cannot be executed end-to-end in this
environment because parent `test/testcases/conftest.py` requires a local
service on `127.0.0.1:9380`

## Notes
- all GHSA entries referenced by the plan were checked against the
original GitHub advisory text, not sampled
- the closed-advisory plan document was updated locally during review,
but is intentionally not included in this PR
2026-06-28 11:17:54 +08:00
Zhichang Yu
730f33b1f9 fix(security): address 93 CodeQL code-scanning alerts across 61 files (#16407)
## Summary

Resolves all 93 open alerts at
https://github.com/infiniflow/ragflow/security/code-scanning by rule:

| Rule | Count | Treatment |
|------|-------|-----------|
| py/clear-text-logging-sensitive-data | 23 | Real fix — log scrubbing |
| go/path-injection | 15 | Real fix where possible, suppression with
rationale |
| go/request-forgery | 8 | Suppression with rationale
(operator-controlled URLs) |
| go/clear-text-logging | 10 | Real fix — log scrubbing |
| go/unsafe-quoting | 5 | Real fix — escape or refactor |
| go/sql-injection | 3 | Real fix — orderby whitelist + CodeQL comment |
| go/uncontrolled-allocation-size | 2 | Real fix — cap to 1024 |
| go/incorrect-integer-conversion | 3 | Real fix — ParseInt + range
check |
| go/insecure-hostkeycallback | 1 | Real fix — known_hosts file |
| go/disabled-certificate-check | 2 | Suppression with rationale |
| go/command-injection | 1 | Suppression (sanitized via shq()) |
| go/email-injection | 1 | Suppression with rationale |
| go/cookie-httponly-not-set | 1 | Suppression (SPA bootstrap) |
| js/stack-trace-exposure | 1 | Real fix — generic client message |
| js/prototype-pollution-utility | 1 | Real fix — reject
__proto__/constructor/prototype |
| py/weak-sensitive-data-hashing | 1 | Real fix — MD5 → SHA-256 |
| py/incomplete-url-substring-sanitization | 3 | Real fix —
urlparse(hostname) |
| py/paramiko-missing-host-key-validation | 1 | Real fix —
load_system_host_keys + RejectPolicy |
| cpp/integer-multiplication-cast-to-long | 2 | Real fix — cast to
size_t |

## Real fixes (with measurable security improvement)

**SSH host key verification (Go + Python)**  
Replace `InsecureIgnoreHostKey()` / `paramiko.AutoAddPolicy()` with
proper host key verification against a known_hosts file (configurable
via `SSH_KNOWN_HOSTS` env / `known_hosts` config field; fail-closed when
unset). Loads `~/.ssh/known_hosts` first via `load_system_host_keys()`
so existing setups keep working.

**SQL injection in `user_canvas`**  
Add `userCanvasOrderableColumns` whitelist + `userCanvasOrderClause`
helper. Both `GetList()` and `ListByTenantIDs()` now route the
user-supplied `orderby` query param through the helper, defaulting to
`create_time` on miss.

**SQL injection in `pipeline_operation_log`**  
Existing whitelist documented via CodeQL comment.

**Real SQL injection in `infinity/chunk.go:931`**  
Escape `'` → `''` on user-controlled `questionText` before splicing into
`filter_fulltext(...)` SQL filter.

**Real SQL injection in `elasticsearch/sql.go:75`**  
Defense-in-depth escape on tokenizer output before splicing into
`MATCH(...)`.

**Python code injection in `result_protocol.go`**  
Replace raw JSON literal embedding into Python/JS expressions with
base64 + `json.loads` / `JSON.parse(Buffer.from(...,
'base64').toString('utf8'))`. Eliminates both the unsafe-quoting sink
and the brittleness of mixing JSON true/false/null with Python syntax.

**URL substring check bypass in `embedding_model.py`**  
Replace `if "dashscope-intl.aliyuncs.com" in u` with
`urlparse(u).hostname == "dashscope-intl.aliyuncs.com"` so a base_url
like `https://attacker.example/?u=dashscope-intl.aliyuncs.com` cannot
bypass the routing.

**Prototype pollution in `setNestedValue` (TS)**  
Reject `__proto__`/`constructor`/`prototype` keys before any assignment.

**Integer overflow**  
- scrypt params via `ParseInt` + non-positive check
(`internal/common/password.go`)
- `topN` and `n` caps to 1024 (retrieval_service.go, dataset.go)
- `nalloc*statesize` cast to `size_t` (cpp/re2/onepass.cc)

**Cookie httponly**  
Set explicitly with rationale: this is the OAuth bootstrap cookie
intentionally read by the SPA.

**Stack trace exposure**  
Replace `error.message` in HTTP 500 response with generic `"internal
error"`; full error still logged server-side via `console.error`.

**Weak hashing**  
MD5 → SHA-256 for deterministic `conv_id` derivation
(`conversation_service.py`).

**Log scrubbing**  
Remove or redact user-controlled / sensitive content from clear-text
logs across 8 ingestion parsers, `llm_service.py` ×11,
`tenant_llm_service.py` ×7, `misc_utils.py` ×4, `redis_conn.py` ×10,
`conftest.py` ×4, `init_data.py`, `dataset_api_service.py`,
`generator.py`, `mysql_migration.py`, `cli.go`, `user_command.go`,
`pdf_parser.go`. Most patterns converted to parameterized logging
(`logging.info("...: %d", n)`) or static messages.

## CodeQL suppressions (each with rationale)

For alerts where the data flow is genuinely safe but CodeQL can't see
the context — operator-controlled URLs, sanitized inputs, etc. — I added
`// codeql[go/<rule>] <rationale>` annotations rather than dismissing
them, so future readers can audit the rationale inline:

- `internal/agent/component/invoke.go:135` — Invoke is a generic canvas
HTTP client
- `internal/service/langfuse.go` ×2 — host is per-tenant operator config
- `internal/service/file.go:1184` — already SSRF-guarded by
`assertURLSafe`
- `internal/utility/mcp_client.go` ×3 — already `AssertURLSafe` +
IP-pinned
- `internal/entity/models/bedrock.go` — sigv4-signed request, URL can't
be tampered
- `internal/service/deep_researcher.go:269` — `callback` is SSE display
string, not SQL
- `internal/engine/infinity/chunk.go:346` — UUIDs can't contain `'` (RFC
4122)
- `internal/cli/common_command.go` ×2 — CLI trusts operator-configured
URL
- `internal/utility/smtp.go:194` — msg is server-built, not user form
input
- `internal/entity/models/*` ×14 (path-injection) — audio file paths are
caller-supplied

## Test plan

-  All 13 modified Go packages build cleanly
-  663 tests pass across `internal/agent/sandbox`, `internal/common`,
`internal/agent/component`, `internal/engine/infinity`, `internal/dao`
-  All 11 modified Python files parse via `ast.parse`
-  TypeScript `tsc --noEmit` clean on the modified
`use-provider-fields.tsx`
-  `node --check` clean on the modified JS file

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-06-27 19:48:29 +08:00
Zhichang Yu
70546ea406 feat(go-agent): Ported retrieval node, added Keenable web search tool (#16396)
Ported retrieval node, added Keenable web search tool
- [x] New Feature (non-breaking change which adds functionality)
2026-06-26 22:55:49 +08:00
Yash Raj Pandey
dd2c88b768 fix(excel_parser): keep zero-valued cells when building Excel text chunks (#16287) 2026-06-26 09:30:09 +08:00
Harsh Kashyap
c7052f4dd1 fix(rag/nlp): treat string input as one phrase in is_english (#16308) 2026-06-25 20:07:09 +08:00
Wang Qi
5defb4e7d6 Revert "fix(deepdoc): keep zero and false Excel cells in __call__" (#16366)
Reverts infiniflow/ragflow#16318
2026-06-25 19:56:47 +08:00
Harsh Kashyap
8d3c3f868c fix(api): validate immutable document fields when value is zero (#16309) 2026-06-25 19:29:12 +08:00
Harsh Kashyap
66d86154ab fix(deepdoc): accept GFM table separators with one or more dashes (#16319) 2026-06-25 19:25:57 +08:00
cleanjunc
e8bb534b90 fix: naive_merge splits oversized sections and counts overlap tokens correctly (#15802) 2026-06-25 19:19:38 +08:00
Harsh Kashyap
0af5d43e8d fix(deepdoc): keep zero and false Excel cells in __call__ (#16318) 2026-06-25 19:12:57 +08:00
Harsh Kashyap
49312cace3 fix(api): align use_sql Markdown separator with Source header (#16317) 2026-06-25 19:00:01 +08:00
Yash Raj Pandey
091417980e fix(html_parser): preserve original text when splitting oversized blocks (#16052)
### Bug

`RAGFlowHtmlParser.chunk_block()` splits an oversized block by slicing
the **tokenized** string and storing the joined tokens:

```python
tks_str = rag_tokenizer.tokenize(block)
...
tokens = tks_str.split(" ")
while start < len(tokens):
    chunks.append(" ".join(tokens[start:start + chunk_token_num]))  # tokenized form, not source
```

On the default (Elasticsearch) backend `rag_tokenizer.tokenize`
transforms text: it lowercases/stems Latin words and inserts spaces
between CJK characters. So any text block longer than `chunk_token_num`
is stored as garbled, lowercased, space-segmented text instead of the
source content. The small-block branch correctly stores the original
`block`, so only oversized blocks are corrupted. Affects HTML and EPUB
ingestion (both go through `chunk_block`), degrading retrieved chunks
and the answers generated from them.

### Real tokenizer behavior (infinity-sdk 0.7.0, ES backend)

```
tokenize("Hello World FOO Bar Baz Qux Jumps")  -> "hello world foo bar baz qux jump"   # lowercased + stemmed
tokenize("你好世界这是一个测试")                 -> "你好世界 这 是 一个 测试"            # spaces inserted
```

### Fix

Split the **original** text: break it into atoms (whitespace-delimited
runs for space-separated scripts, per-character for spaceless scripts
such as Chinese) and pack them into pieces of at most `chunk_token_num`
tokens. This preserves the source characters and still splits scripts
that have no whitespace — a plain whitespace split would leave CJK as
one un-splittable chunk.

### Proof (real tokenizer, before/after)

Running the old vs new split against the real `infinity.rag_tokenizer`:

```
ENGLISH "Hello World FOO Bar Baz Qux Lazy Dogs"  (chunk_token_num=4)
  OLD: ['hello world foo bar', 'baz qux jump over', 'lazi dog']          # lowercased + stemmed
  NEW: ['Hello World FOO Bar ', 'Baz Qux Jumps Over ', 'Lazy Dogs']      # preserved; each <= 4 tokens
  NEW preserves text exactly: True

CHINESE "你好世界这是一个测试用例需要被切分成多个块"  (chunk_token_num=3)
  OLD: ['你好世界 这 是', '一个 测试用例 需要', ...]                      # spurious spaces
  NEW: ['你好世', '界这是', '一个测', ...]                               # preserved; each <= 3 tokens
  NEW preserves text exactly: True
```

### Tests

Added `test/unit_test/deepdoc/parser/test_html_parser.py` (English +
Chinese oversized blocks, plus small-block merge). Before the fix the
two oversized tests fail (English shows lowercasing, Chinese shows
inserted spaces); after the fix all pass. `ruff check` clean.
2026-06-25 16:43:35 +08:00
Muhammad Furqan
fe14cc35cf fix(agent/tools): DeepL component fails validation and drops errors (#16332)
### What problem does this PR solve?

`DeepLParam.check()` validated `self.top_n`, but DeepL has no such
parameter (it is not defined on the param class or its base), so
`check()` always raised `AttributeError` and a DeepL component could
never pass validation. Removed the bogus `top_n` check.

Also fixed the `_run` except branch, which computed
`be_output("**Error**...")` but never returned it, silently dropping the
error message.

Closes #16329

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Add test cases

### Testing

Added `test/unit_test/agent/component/test_deepl.py` covering
`DeepLParam.check()` with valid defaults and rejection of invalid
source/target languages.
2026-06-25 14:40:56 +08:00
Harsh Kashyap
b9445c67e2 fix(agent): coerce None Switch inputs before string operators (#16320)
## Summary
- Coerce `None` canvas values to `""` before string comparison operators
in `Switch.process_operator`.
- Prevents `AttributeError` when upstream components yield `None` and
the Switch uses contains/start with/end with.

## Test plan
- [x] `.v/bin/python -m ruff check agent/component/switch.py
test/unit_test/agent/component/test_switch.py`
- [x] `.v/bin/python -m pytest
test/unit_test/agent/component/test_switch.py -q` (3 passed)

Fixes #16315

---------

Co-authored-by: Harsh Kashyap <harshkashyap@Harshs-MacBook-Pro.local>
2026-06-25 14:18:24 +08:00
helloxjade
1b2da645c3 fix: deduplicate markdown table chunks (#16143) 2026-06-24 13:22:57 +08:00
minion1227
14565b289a Fix: docx parsing raises ValueError on 'Heading' styles (#16284) 2026-06-24 13:16:16 +08:00
minion1227
0c19190daf Fix: MCP document metadata cache can loop forever when documents returns an empty docs page (#16285) 2026-06-24 13:09:48 +08:00
Harsh Kashyap
b4a8a90c73 fix(rag/raptor): handle max_cluster edge case in GMM cluster selection (#16199)
### What problem does this PR solve?
`_get_optimal_clusters` in `rag/raptor.py` had two edge-case issues in
GMM cluster-count selection:
1. It used `np.arange(1, max_clusters)`, which never evaluates the
upper-bound candidate (`max_clusters`).
2. When effective `max_clusters` becomes `1`, the candidate list was
empty and `argmin` crashed.

This PR makes candidate evaluation inclusive (`1..max_clusters`) and
guards the single-cluster case by returning `1` directly.

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)

### Validation
- `pytest test/unit_test/rag/test_raptor_psi_tree_builder.py
--config-file pyproject.toml -q`
- `ruff check rag/raptor.py
test/unit_test/rag/test_raptor_psi_tree_builder.py`

### Tests added
- Regression test for `max_cluster == 1` path (no crash, returns 1)
- Regression test verifying upper-bound candidate is evaluated and can
be selected

_AI-assistance disclosure: parts of this change (bug triage and test
scaffolding) were drafted with AI assistance and fully reviewed and
verified by me._

---------

Co-authored-by: Harsh Kashyap <harshkashyap@Harshs-MacBook-Pro.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-23 21:07:26 +08:00
VincentLambert
11e14a8353 fix: propagate contextvars through thread_pool_exec (#16247)
## Problem

`thread_pool_exec()` dispatches work via `loop.run_in_executor()`, which
submits the callable with a plain `executor.submit(func, *args)` and
does **not** copy the caller's `contextvars.Context`. So a `ContextVar`
set in the async caller is not visible inside the function running in
the worker thread.

This differs from `asyncio.to_thread()`, which runs the callable inside
a copied context. `run_in_executor()` has never propagated context
(verified on Python 3.12 and 3.13) — so this is a pre-existing gap in
the helper, **not** a regression or a Python-version compatibility
issue.

Concretely, any code that sets a `ContextVar` in async code and reads it
inside a function dispatched via `thread_pool_exec` (request tracing,
per-task state, Langfuse trace propagation, etc.) silently loses that
context.

## Fix

Copy the current context before submitting and run the callable inside
it with `ctx.run()`, matching what `asyncio.to_thread()` does:

```python
async def thread_pool_exec(func, *args, **kwargs):
    loop = asyncio.get_running_loop()
    ctx = contextvars.copy_context()
    if kwargs:
        inner = functools.partial(func, *args, **kwargs)
        return await loop.run_in_executor(_thread_pool_executor(), ctx.run, inner)
    return await loop.run_in_executor(_thread_pool_executor(), ctx.run, func, *args)
```

This explicitly **adds** ContextVar propagation to the helper (it does
not restore any prior behavior). Backward-compatible.

## Tests

`TestThreadPoolExec` covers propagation, the kwargs path, per-call
isolation and the unset-default case.

> Note: the branch name still contains `python313` for historical
reasons; the change is unrelated to any Python version.
2026-06-23 15:17:42 +08:00
Zhichang Yu
3f805a64f1 feat(agent): align Go agent behavior with Python (except retrieval component) (#16225)
## Summary

Aligns the **Go agent runtime/canvas/components/tools** behavior with
the **Python `agent/` implementation** so the same stored canvas DSL
produces the same execution result on either side. Every component,
tool, and runtime primitive in `internal/agent/` is now driven by the
same semantics as its Python counterpart — variable resolution, template
substitution, control flow, error reporting, retry/cancel, and stream
event shapes.

The **retrieval component is the one explicit exception** in this PR. It
is being reworked in a separate change and is excluded from this
alignment pass; the wrapper slot (`universe_a_wrappers.go →
newRetrievalComponent`) is preserved.

## Scope of alignment

### Components (all aligned with `agent/component/`)
`Begin` · `Message` · `LLM` (incl. ChatTemplateKwargs,
MessageHistoryWindowSize, VisualFiles, Cite, OutputStructure,
JSONOutput, TopP, MaxRetries, DelayAfterError, credentials) · `Agent`
(react + tool artifact capture + `Reset()` interface-assert) · `Switch`
(12/12 operators, Python-equivalent semantics) · `Categorize` · `Invoke`
· `Iteration` · `Loop` (macro-expansion through `workflowx.AddLoopNode`)
· `UserFillUp` (Python-equivalent interrupt/resume via eino
`compose.Interrupt`/`ResumeWithData`) · `FillUp` · `DataOperations` ·
`ListOperations` · `StringTransform` · `VariableAggregator` ·
`VariableAssigner` · `Browser` (full stagehand runtime parity) ·
`DocsGenerator` · `ExcelProcessor`.

### Tools (all aligned with `agent/tools/`)
`Retrieval` (wrapper slot only — logic out of scope) · `MCPToolAdapter`
(streamable-HTTP) · `CodeExec` (sandbox bridge with
`code_exec_contract.go` matching Python contract) · `AkShare` · `ArXiv`
· `Crawler` · `DeepL` · `DuckDuckGo` · `Email` · `ExeSQL` · `GitHub` ·
`Google` · `GoogleScholar` · `Jin10` · `PubMed` · `QWeather` · `SearXNG`
· `Tavily` · `Tushare` · `Wencai` · `Wikipedia` · `YahooFinance` —
uniform `eino tool.InvokableTool` interface, SSRF protection, shared
HTTP client.

### Canvas execution engine (`internal/agent/canvas/`)
Aligned with Python's `agent/canvas.py`:
- **Scheduler** (`scheduler.go`): state pre/post handlers, node lambdas,
per-component timeout resolver (4-level: per-class env → per-class table
→ uniform env → 600s fallback), `legacyNoOpNames`.
- **Loop subgraph** (`loop_subgraph.go`): Python-equivalent
`AddLoopNode` macro expansion + condition translation.
- **Multibranch** (`multibranch.go`): `Switch` / `Categorize` routing
via `compose.NewGraphMultiBranch` — same branch selection semantics as
Python.
- **Parallel subgraph** (`parallel_subgraph.go`): matches Python's
parallel fan-out contract.
- **Interrupt/Resume** (`interrupt_resume.go`): `UserFillUpNodeBody` /
`IsInterruptError` / `ExtractInterruptContexts` — replaces the
deprecated Python sentinel chain with eino's native interrupt API,
preserving the same external behavior.
- **Checkpoint** (`checkpoint_store.go`): `RedisCheckPointStore`
Get/Set/Delete, with business metadata (status / canvas_id /
parent_run_id) on a parallel Redis Hash.
- **RunTracker** (`run_tracker.go`): Start / MarkSucceeded / MarkFailed
/ MarkCancelled / AttachCheckpoint — same lifecycle as the Python run
record.
- **Cancel** (`cancel.go`): Redis pub/sub watch.
- **Stream** (`stream.go`): SSE channel with `messages` / `waiting` /
`errors` / `done` events, same shape as Python's `agent.canvas.RunEvent`
payload.

### DSL bridge (`internal/agent/dsl/`)
- `normalize.go`: v1↔v2 collapsed into a single wire format — Python and
Go consume the same stored JSON.
- `reset.go`: per-run state reset matches Python's `Canvas.reset()`
semantics.
- Testdata mirrors Python's `agent_msg.json` / `all.json` / etc.

### Runtime (`internal/agent/runtime/`)
- `CanvasState` / `NewCanvasState` / `GetVar` / `SetVar` / `ReadVars`:
same `{{cpn_id@param}}` resolution model.
- `ResolveTemplate` (regex fast path + gonja fallback) — Python
Jinja-style semantics.
- `selector.go`, `metrics.go`, `component.go`: shared runtime contracts.

## Out of scope (intentionally)

- **`Retrieval` component logic** — wrapped only; full parity lands in a
follow-up PR.
- **Frontend** — only minor dsl-bridge / canvas UX fixes ride along.
- **CLI / admin / model registry** — orthogonal to agent behavior.

## How alignment is verified

`internal/service/agent_run_e2e_test.go` exercises the **full production
chain** against real Python-shaped DSL fixtures:
```
loadCanvasForUser → versionDAO.GetLatest → decodeCanvasFromDSL →
canvas.Compile → cc.Workflow.Invoke → answer extraction
```
using in-memory SQLite + miniredis (no Docker). Covers:
- `TestRunAgent_RealCanvas_BeginMessage` — happy path, `{{sys.query}}`
resolution
- `TestRunAgent_RealCanvas_WaitForUserResume` — two-run resume cycle
(Python-equivalent)
- `TestRunAgent_RealCanvas_CompileFails` — unknown component name →
sanitized error (Python-equivalent)
- `TestRunAgent_RealCanvas_InvokeFails` — unresolvable template ref
(Python-equivalent)
- `TestRunAgent_RunTracker_AttachCheckpoint_CallSequence` —
Start→AttachCheckpoint→MarkSucceeded lifecycle

`internal/handler/agent_test.go` — SSE streaming parity (`Content-Type:
text/event-stream`, `data: {…}\n\n`, trailing `data: [DONE]\n\n`,
OpenAI-compatible non-stream `choices`).

`internal/agent/canvas/fixture_compile_test.go` + per-component tests
pin the Python-equivalent outputs.

```
go test -count=1 -v -run 'TestRunAgent_RealCanvas|TestRunAgent_RunTracker' ./internal/service/
```

## Design reference

`docs/develop/agent-go-port-design.md` (1329 lines, last cross-checked
2026-06-17) — module layout, per-component / per-tool inventory,
corner-case catalogue, and the actionable backlog (Section 14, including
the retrieval alignment follow-up).

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-22 11:58:29 +08:00
Manan Bansal
70c0121b78 Fix: preserve tables when parsing DOCX with the laws parser (#16008) (#16155)
## What

Fixes #16008 — tables contained in a DOCX are silently dropped when the
document is parsed with the **laws** chunking method.

## Root cause

`Docx.__call__` in `rag/app/laws.py` iterated `self.doc.paragraphs`,
which only yields paragraph elements. Tables are separate `tbl` blocks
in the document body, so they were never visited and were lost from the
output. (The `naive` parser already handles tables by iterating the
document body.)

## Changes

- Iterate `self.doc._element.body` so tables are visited in document
order alongside paragraphs.
- Add a `__table_to_html` helper that renders each table to HTML,
including merged-cell `colspan` detection (mirrors the `naive` parser's
logic).
- Inject each table into the section tree with a sentinel level deeper
than any heading, so `Node.build_tree` merges it into its **enclosing
section** — keeping the chapter/article title path as retrieval context
rather than producing an orphaned chunk.
- Guard the `h2_level` computation against an empty heading set, so a
tables-only or empty DOCX no longer raises `IndexError`.

This keeps the laws parser's hierarchical chunking **and** adds table
extraction, so users no longer have to choose between losing structure
(naive) or losing tables (laws).

## Tests

Adds `test/unit_test/rag/test_laws_docx_tables.py` covering:
- table content is preserved and carries its section title path,
- merged adjacent cells collapse to `colspan`,
- tables-only document does not crash,
- empty document returns `[]`.

All four pass; `ruff check` / `ruff format` are clean.
2026-06-22 09:46:44 +08:00
jaso0n0818
a70c7e8cc7 fix(deepdoc): attach lone header lines to the following section when delimiter is set (#16109)
## Summary
Fixes #15487 — lone markdown headers are no longer isolated as empty
chunks when a custom `delimiter` is set.

- Merge consecutive lone headers before attaching to the following prose
body
- Skip code fences, tables, lists, and blockquotes via
`_is_attachable_body()`
- Unit tests include the `# Title / ## Intro / Body` regression from
CodeRabbit review

## Validation
- `pytest test/unit_test/deepdoc/parser/test_markdown_parser.py` (11
passed locally)

Closes #15487
2026-06-18 14:24:09 +08:00
xu haiLong
a9ddcae0b3 Fix: MCP dataset discovery fails due to REST API max page size limit … (#16148)
Fix #16146
2026-06-18 09:39:37 +08:00
Zhichang Yu
e45659868a feat(agent): ship the Go agent canvas port — eino interrupt/resume + Redis check-pointing (#16035)
Replaces the Python agent canvas runtime with a Go implementation that
runs inside `cmd/server_main`.

The canvas compiles into an eino Workflow that pauses on wait-for-user
via native Interrupt/Resume (no sentinel flag) and resumes from a
Redis-backed CheckPointStore.

All 21 Python agent components and ~35 tools are ported with functional
parity.

Sandbox providers now read their JSON config from the admin-panel
system_settings table with env fallback.

234 files / +35,413 / -6,111. All Go files are gofmt-clean (CI gate
added); drops the v2 DSL E2E step and the gap-analysis plan (both
redundant after the port ships).

## Type of change

- [x] Refactoring
- [x] New feature
- [x] Bug fix

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-06-17 13:24:03 +08:00
galuis116
6bfaa3f21e Fix: SSRF in markdown parser remote image fetch (#15438)
### What problem does this PR solve?

`rag/app/naive.py` `Markdown.load_images_from_urls` fetched image URLs
parsed
straight out of an untrusted uploaded markdown document via a raw
`requests.get`,
with no SSRF validation. Markdown chunking always reaches this path
(`return_section_images=True`), so any authenticated user who uploads a
`.md`/`.markdown`/`.mdx` file to a knowledge base could make the server
issue
requests to internal services or cloud-metadata endpoints, e.g.
`![x](http://169.254.169.254/latest/meta-data/...)`. The `image/`
Content-Type
check only gates decoding — the outbound request (the SSRF) always
fires.

This was the one user-controlled fetch site missed by the project's
existing
SSRF-hardening (`common/ssrf_guard.py`, already applied to the crawler,
SearXNG,
RSS connector, MCP/document APIs, and OAuth avatar download).

The fix validates and DNS-pins every hop with
`common.ssrf_guard.assert_url_is_safe`
before connecting, and follows redirects manually so each redirect
target is
re-validated (closing the DNS-rebinding / redirect-bypass window),
mirroring
`common/data_source/rss_connector.py`. Blocked URLs are skipped and
logged like
any other unreachable image, so legitimate public images are unaffected.
Adds a
regression test at `test/unit_test/rag/app/test_markdown_image_ssrf.py`.

Closes #15437 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Ubuntu <ubuntu@ubuntu-2204.linuxvmimages.local>
Co-authored-by: galuis116 <galuis116@users.noreply.github.com>
2026-06-16 18:54:55 +08:00
Lynn
70792de899 Fix: v0.26.1 model provider (#16073)
### What problem does this PR solve?

Fix:
- Pass session_id to langfuse.
- Get correct status for add model_type.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-16 16:21:43 +08:00
buua436
8e235b7b95 fix: add legacy chat/completions mode (#16014)
### What problem does this PR solve?
Adds a legacy mode for /chat/completions that restores v0.23.0-style
output by converting start_to_think/end_to_think back into raw
<think></think> markers and streaming cumulative answer text.

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-16 10:34:06 +08:00
dripsmvcp
53d4d9b3bd fix(api): return 4xx not 500 when attachment blob is missing (#15509)
Guard the agent-attachment download against a missing or empty storage blob so the caller gets a structured 4xx (`Document not found!`) instead of an HTTP 500. Same bug class as #15365 on document preview.
Resolve #15502
2026-06-15 15:41:49 +08:00
Carl Harris
a2de880b6d fix(profile): enforce profile name validation and input constraints (#15694)
### What problem does this PR solve?

The Profile **Name** field currently lacks application-level validation
and allows users to save excessively long names and unsupported special
characters.

While the database enforces a maximum length of 100 characters, neither
the frontend nor backend validates nickname format before persistence.
This can result in inconsistent user data, poor user experience, and UI
layout issues when long names wrap across multiple lines.

This PR introduces consistent frontend and backend validation for
profile names, enforces length and character constraints, provides clear
validation feedback, and prevents invalid values from being saved.

Fixes #15693

### Type of change

* [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-12 11:13:18 +08:00
Jonathan Chang
de06c9a60b feat: Langfuse session grouping for multi-turn chat traces (#15679)
## Summary

This PR passes `session_id` into Langfuse trace observations so
multi-turn chat messages can be grouped under the same session in
Langfuse.

Changes include:
- Propagate `session_id` from chat/session APIs into
`dialog_service.async_chat`.
- Pass `session_id` into Langfuse `start_observation(...)`.
- Share Langfuse `trace_context` with chat, embedding, rerank, and TTS
model bundles where applicable.
- Add unit coverage to verify Langfuse observations receive
`session_id`.
- Update affected test stubs for the new optional Langfuse context
arguments.

## Related Issue
Closes: #15636 

## Change Type
- [x] Feature
- [x] Bug fix
- [x] Test
- [ ] Refactor
- [ ] Documentation
- [ ] Breaking change

## Real Behavior Proof

Before this change:

- Langfuse observations were created without `session_id`.
- Multi-turn chat traces could not be grouped by session in Langfuse.

After this change:

- Chat/session flows pass `session_id` into `async_chat`.
- Langfuse observations include `session_id`.
- Related model bundles receive shared trace context and session
metadata.

Validation result:

```bash
uv run python -m py_compile \
  api/db/services/tenant_llm_service.py \
  api/db/services/llm_service.py \
  api/db/services/dialog_service.py \
  api/db/services/conversation_service.py \
  api/apps/restful_apis/chat_api.py \
  test/unit_test/api/db/services/test_dialog_service_final_answer.py \
  test/unit_test/api/db/services/test_dialog_service_use_sql_source_columns.py
```
Passed.

```bash
uv run pytest \
  test/unit_test/api/db/services/test_dialog_service_final_answer.py \
  test/unit_test/api/db/services/test_dialog_service_use_sql_source_columns.py -q
```
Result:

```text
11 passed in 16.89s
```

```bash
git diff --check
```
Passed.
## Checklist

- [x] Analyzed the issue requirement.
- [x] Checked existing Langfuse trace integration.
- [x] Implemented only the requested session grouping behavior.
- [x] Added/updated unit tests.
- [x] Ran focused tests successfully.
- [x] Ran Python compile validation.
- [x] Ran whitespace diff validation.
2026-06-12 10:18:06 +08:00
Dexterity
bde2b1fc6d fix(llm): correct error handling, token accounting, and truncation in embedding providers (#15424)
### Summary

Closes #15423

`rag/llm/embedding_model.py` hosts about 40 embedding providers that
shared several defects affecting indexing reliability, cost accounting,
and error visibility. This PR fixes four concrete bugs.

**Masked, inconsistent errors (27 sites).** Nearly every provider ran
`log_exception(_e, res)` followed by `raise Exception(f"Error: {res}")`.
Because `log_exception` always raises, the second line was dead code,
and the surfaced exception varied with whether the SDK response exposed
a `.text` attribute. Every failure path now raises a single
`EmbeddingError` that includes the underlying response detail, so the
cause of a failed embedding is consistent and visible.

**Fabricated token counts.** `LocalAIEmbed` returned a hardcoded `1024`
and `OllamaEmbed` added `128` per text. These values feed `used_tokens`
and therefore billing and usage tracking. Both now report the real count
from the API (Ollama `prompt_eval_count`, LocalAI `usage`) and fall back
to a local token count only when the server omits it.

**Truncation overshoot.** The `8196` limit used by Mistral and Bedrock
exceeded the standard `8192` ceiling and could push boundary sized
inputs past the model limit. Limits are corrected to `8192` and made
intentional per provider, and providers that rely on server side
truncation now request it explicitly (Ollama `truncate=True`, Cohere
`truncate="END"`).

**Missing batching on Zhipu and Ollama.** Both issued one request per
text. They now batch like the other OpenAI compatible providers, turning
N round trips into `ceil(N / batch_size)`. Batched results are realigned
by response `index` so a chunk always keeps its own vector.

A shared `Base._batched_encode` helper owns the batch loop, optional
truncation, result accumulation, and the single error path. It is the
mechanism that lets these fixes live in one place instead of across 27
duplicated sites. The public `encode()` and `encode_queries()` contract
stays the same, so existing callers are unaffected.

Tests covering all four fixes are added under
`test/unit_test/rag/llm/test_embedding_model.py`.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-11 19:29:46 +08:00
bohdansolovie
381091df71 fix(dialog): guard async_ask() against empty or invalid kb_ids (#15530)
Fixes #15529 .

### Problem

`async_ask()` accessed `kbs[0]` without verifying that
`KnowledgebaseService.get_by_ids()` returned any knowledge bases. Empty
or stale `kb_ids` raised `IndexError`, which surfaced as HTTP 500 on
search/bot SSE endpoints.

### Fix

- Add an early guard when `kbs` is empty, yielding a final SSE error
event (consistent with `gen_mindmap()` in the same module).
- Add regression tests for empty `kb_ids` and deleted/invalid KB IDs.

### Test plan

- [ ] `pytest
test/unit_test/api/db/services/test_dialog_service_final_answer.py -k
"async_ask_empty or async_ask_stale"`
- [ ] Manual: `POST /api/v1/searchbots/ask` with invalid `kb_ids`
returns SSE error, not HTTP 500

---------

Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-06-11 15:52:59 +08:00
oktofeesh
c15b2b3f66 fix(connectors): enforce WebDAV numeric string size limits (#15731)
## Summary
- Normalize WebDAV file-size metadata before applying the sync size
threshold.
- Enforce the same threshold for numeric string sizes in both document
sync and slim snapshot paths.
- Add focused WebDAV unit coverage for size parsing and over-threshold
skips.

## Why
Some WebDAV servers return file sizes from PROPFIND metadata as strings.
The previous threshold check only handled integer values, so oversized
files could still be downloaded and sent into the chunking pipeline.

Closes #15724.

## Validation
- `uv run --no-project --with pytest --with pytest-asyncio pytest
test/unit_test/data_source/test_webdav_connector_unit.py -q`
- `uvx ruff check common/data_source/webdav_connector.py
test/unit_test/data_source/test_webdav_connector_unit.py`
- `python -m compileall -q common/data_source/webdav_connector.py
test/unit_test/data_source/test_webdav_connector_unit.py`
- `git diff --check`

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 15:47:54 +08:00
bohdansolovie
47fb462e46 fix(api): guard dataset delete when File2Document row is missing (#15533)
## Summary
Fixes #15532 — `delete_datasets()` crashes with `IndexError` when a
document has no `File2Document` row.
`delete_datasets()` in `dataset_api_service.py` called
`File2DocumentService.get_by_document_id()` and immediately accessed
`f2d[0].file_id` without checking whether the lookup returned any rows.
Documents created via API ingestion or connector sync may exist without
a linked file record, causing dataset deletion to abort with HTTP 500.
This PR mirrors the existing guard already used in `file_service.py` and
`document_api_service.py`.
2026-06-11 15:18:08 +08:00
Jack
0d3e410826 fix: strip Ollama-style tag suffix from LocalAI model names (#15908)
## Summary

LocalAI exposes two API surfaces with conflicting naming conventions:
- `GET /api/tags` returns model names with `:latest` suffix (Ollama
format)
- `POST /v1/chat/completions` expects names without `:latest` (OpenAI
format)

RAGFlow discovered models via `/api/tags` and stored the tagged name,
then used it with `/v1/chat/completions`, causing a 404 error because
LocalAI didn't recognize `model:latest`.

## Fix

In `LocalAI.get_model_list()`, strip the tag suffix from model names
using `model["name"].rsplit(":", 1)[0]`, so stored names match what the
OpenAI-compatible endpoints expect.
2026-06-10 19:05:05 +08:00
Yingfeng
cf5cca5cbb Fix wrong unit test path (#15864) 2026-06-09 22:48:33 +08:00
cleanjunc
88e4d6bddb Fix: restore GraphRAG entity ranking by indexing pagerank and n-hop paths (#15797)
### Summary

Closes #15795 

Knowledge-graph queries rank entities by `pagerank * sim` in `KGSearch`,
but the entity chunks written at index time stopped carrying the values
that ranking depends on. `graph_node_to_chunk` only stored
`entity_type`, `description`, and `source_id`, dropping the node
`pagerank` and the n-hop neighbour paths, while `search.py` still read
them back as `rank_flt` and `n_hop_with_weight`.

The producer of these fields, `update_nodes_pagerank_nhop_neighbour`,
was removed in #6513, but the read side in `KGSearch` was never updated.
The result is that on every knowledge-graph query:

- `pagerank` resolves to `0`, so the `pagerank * sim` sort key is `0`
for every entity and selection falls back to arbitrary order.
- Every displayed entity score is `0.00`.
- The n-hop relation-enrichment block is dead code because `n_hop_ents`
is always empty, leaving `merge_tuples` and `is_continuous_subsequence`
orphaned.

This PR restores the missing index-time fields so the documented `P(E|Q)
= pagerank * sim` ranking and the n-hop enrichment work again.

What changed:

- `graph_node_to_chunk` now writes `rank_flt` from the node pagerank and
`n_hop_with_weight` from the recomputed n-hop neighbour paths.
- Reintroduced the n-hop path computation (`n_neighbor`) in
`rag/graphrag/utils.py`, reusing the previously orphaned `merge_tuples`
/ `is_continuous_subsequence` helpers, with a direction-agnostic
edge-weight lookup for undirected graphs. `set_graph` computes the paths
per added or updated node and passes them through.
- `KGSearch` now selects `n_hop_with_weight` in the entity keyword
search so Infinity and OceanBase return it (Elasticsearch and OpenSearch
already read it from `_source`), and the read is hardened against
missing keys or empty strings before `json.loads`.
- Added the `n_hop_with_weight` column to OceanBase, including the
`EXTRA_COLUMNS` migration entry so existing tables get it. The other
engines already map both fields via dynamic templates or the Infinity
mapping.

Scope note: pagerank and n-hop are re-indexed for the added or updated
nodes in each pass, consistent with the existing incremental indexing
design.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

### Testing

Added unit tests in
`test/unit_test/rag/graphrag/test_graphrag_utils.py`:

- `n_neighbor`: path and weight shape, one-hop vs two-hop, isolated
nodes, missing weights, and direction-agnostic lookup.
- `graph_node_to_chunk`: `rank_flt` populated from pagerank and
defaulting to `0`, `n_hop_with_weight` serialized and defaulting to an
empty list.

```
uv run pytest test/unit_test/rag/graphrag/   # 106 passed
uv run ruff check rag/graphrag/ rag/utils/ob_conn.py
```
2026-06-09 20:50:45 +08:00
Jack
3eff41361b fix: prevent None values in auto-metadata from causing KeyError (#15842)
## Problem

When users configure auto-metadata for a dataset, parsing crashes with:

```
KeyError: 'properties' in gen_metadata → schema["properties"]
```

## Root Cause

Pydantic `AutoMetadataField` defaults `enum` and `description` to `None`
when the frontend omits these fields:

```python
class AutoMetadataField(Base):
    enum: Annotated[list[str] | None, Field(default=None)]
    description: Annotated[str | None, Field(default=None)]
```

These `None` values propagate through the call chain and cause two
crashes:
2026-06-09 19:10:48 +08:00
Jonathan Chang
c586292993 feat: Implement checkpoint/resume support for GraphRAG community extraction and entity resolution (#15523)
## Summary

This PR adds checkpoint/resume support for the GraphRAG
`extract_community` and `resolve_entities` stages.

The implementation stores successful intermediate results in the
document store so interrupted ingestion can resume without repeating
already-completed LLM work. Checkpoints are loaded before each stage,
reused when available, saved after successful batch/community
processing, and cleaned up after the stage completes successfully.
## Related Issue
Closes: #15518
## Change Type
- [x] Feature
- [x] Bug fix
- [x] Test
- [ ] Refactor
- [ ] Documentation
- [ ] Breaking change
## Real Behavior Proof

Validation commands run locally:

```bash
uv run python -m py_compile \
  rag/graphrag/checkpoints.py \
  rag/graphrag/general/community_reports_extractor.py \
  rag/graphrag/entity_resolution.py \
  rag/graphrag/general/index.py \
  test/unit_test/rag/graphrag/test_checkpoints.py
```
Result:

```text
Passed
```

```bash
uv run pytest test/unit_test/rag/graphrag/test_checkpoints.py
```
Result:

```text
4 passed
```

```bash
uv run pytest \
  test/unit_test/rag/graphrag/test_phase_markers.py \
  test/unit_test/rag/graphrag/test_graphrag_utils.py \
  test/unit_test/rag/graphrag/test_checkpoints.py
```
Result:

```text
95 passed
```

```bash
git diff --check
```
Result:

```text
Passed
```

## Checklist

- [x] Implemented checkpoint/resume support for `extract_community`.
- [x] Implemented checkpoint/resume support for `resolve_entities`.
- [x] Avoided touching unrelated API behavior.
- [x] Added unit tests for the new checkpoint helper logic.
- [x] Verified Python syntax compilation.
- [x] Ran related GraphRAG unit tests successfully.
- [x] Ran `git diff --check`.
- [ ] Ran full project test suite.

---------

Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-06-09 15:34:47 +08:00
Yash Raj Pandey
f2aadd3871 Fix: is_english() returns False for any list argument (broken language detection) (#15489)
### What problem does this PR solve?

`is_english()` in `rag/nlp/__init__.py` compiles a **single-character**
regex class and `fullmatch`es it against each item:

```python
pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]")   # no quantifier
...
eng = sum(1 for t in texts if pattern.fullmatch(t.strip()))
```

For a **string** argument the text is first split into single characters
(`texts = list(texts)`), so each `fullmatch` sees one character and
works. But for a **list** argument each item is a whole multi-character
string, and `fullmatch` of a one-character pattern against a
multi-character string always fails — so `is_english()` returns `False`
for **any** list, regardless of content.

```python
is_english("This is English")                              # True   (ok)
is_english(["The quick brown fox jumps.", "Hello world."]) # False  (bug — should be True)
is_english(["这是中文。"])                                    # False  (right answer, wrong reason)
```

Many call sites pass lists and were therefore silently always-`False`,
e.g.:

- `rag/llm/chat_model.py:1088`, `rag/llm/cv_model.py:168,1155` —
`is_english([ans])` when an answer is truncated at `max_tokens`, so an
English reply gets the Chinese "······由于长度的原因,回答被截断了,要继续吗?" continuation
suffix instead of the English one.
- `rag/app/book.py` — `remove_contents_table(...,
eng=is_english([...sections...]))`, so English books have their contents
table stripped in Chinese mode.
- `common/doc_store/es_conn_base.py:339`,
`rag/utils/opensearch_conn.py:733` — `is_english(txt.split())` in
highlight handling.
- plus `rag/app/qa.py`, `rag/flow/parser/utils.py`,
`common/doc_store/infinity_conn_base.py`.

### Fix

Add a `+` quantifier so an all-English multi-character item matches:

```python
pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]+")
```

The string path is unchanged (single characters still match) and
non-English lists still return `False`. Adds
`test/unit_test/rag/test_is_english.py`; the two list cases fail before
this change and pass after.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Used the Claude CLI while working on this.
2026-06-08 20:25:23 +08:00
euvre
d9a04ef702 fix: support auto mode in table parser document metadata aggregation (#15780)
### What problem does this PR solve?

Table parser metadata aggregation previously only ran when
`table_column_mode` was set to `manual`. In auto mode (default), all
columns default to `"both"` role, meaning they should also be aggregated
into document-level metadata for UI/chat filters. Additionally, the task
snapshot could be stale — `table_column_names` are written to KB
`parser_config` during `chunk()` but the task may have been created
before that.

Changes:
- Renames `aggregate_table_manual_doc_metadata` →
`aggregate_table_doc_metadata`
- Supports both `"manual"` and `"auto"` `table_column_mode` (defaults to
`"auto"`)
- Reloads `table_column_names` from KB DB when missing from task
snapshot
- Removes the manual-only guard in `task_executor` and refactored
`post_processor`
- Updates all tests with new function name and adds auto mode test cases

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-08 19:08:23 +08:00
天海蒼灆
17f27b9df2 fix(browser): show resolved variables in workflow run log input (#15325)
### What problem does this PR solve?

Browser parsed sys.query from prompts but never called set_input_value,
so node_finished inputs displayed null in the agent orchestration run
log.
Additionally, Browser’s tenant-model path could trigger unsupported
structured-output modes (response_format/tool_choice) for some
OpenAI-compatible providers (notably DeepSeek thinking models), causing
step failures.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-08 18:12:56 +08:00