### What problem does this PR solve?
`DeepLParam.check()` validated `self.top_n`, but DeepL has no such
parameter (it is not defined on the param class or its base), so
`check()` always raised `AttributeError` and a DeepL component could
never pass validation. Removed the bogus `top_n` check.
Also fixed the `_run` except branch, which computed
`be_output("**Error**...")` but never returned it, silently dropping the
error message.
Closes#16329
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Add test cases
### Testing
Added `test/unit_test/agent/component/test_deepl.py` covering
`DeepLParam.check()` with valid defaults and rejection of invalid
source/target languages.
### What problem does this PR solve?
Fixes the PubMed tool always emitting `Authors: Unknown Authors`. The
`safe_find` closure in `_format_pubmed_content` was hardcoded to search
from the article root, so the per-author `LastName`/`ForeName` lookups
never matched.
`safe_find` now accepts an optional `base` node (defaults to `child`,
preserving the existing field lookups), and the author loop passes the
current `<Author>` element.
Closes#16328
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Add test cases
### Testing
Added `test/testcases/test_web_api/test_canvas_app/test_pubmed_unit.py`
covering per-author parsing, intact title/journal/DOI fields, and the
no-authors fallback.
Before: `Authors: Unknown Authors`
After: `Authors: Furqan Khan, Jane Smith`
### What problem does this PR solve?
`_get_optimal_clusters` in `rag/raptor.py` had two edge-case issues in
GMM cluster-count selection:
1. It used `np.arange(1, max_clusters)`, which never evaluates the
upper-bound candidate (`max_clusters`).
2. When effective `max_clusters` becomes `1`, the candidate list was
empty and `argmin` crashed.
This PR makes candidate evaluation inclusive (`1..max_clusters`) and
guards the single-cluster case by returning `1` directly.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Validation
- `pytest test/unit_test/rag/test_raptor_psi_tree_builder.py
--config-file pyproject.toml -q`
- `ruff check rag/raptor.py
test/unit_test/rag/test_raptor_psi_tree_builder.py`
### Tests added
- Regression test for `max_cluster == 1` path (no crash, returns 1)
- Regression test verifying upper-bound candidate is evaluated and can
be selected
_AI-assistance disclosure: parts of this change (bug triage and test
scaffolding) were drafted with AI assistance and fully reviewed and
verified by me._
---------
Co-authored-by: Harsh Kashyap <harshkashyap@Harshs-MacBook-Pro.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
## Problem
`thread_pool_exec()` dispatches work via `loop.run_in_executor()`, which
submits the callable with a plain `executor.submit(func, *args)` and
does **not** copy the caller's `contextvars.Context`. So a `ContextVar`
set in the async caller is not visible inside the function running in
the worker thread.
This differs from `asyncio.to_thread()`, which runs the callable inside
a copied context. `run_in_executor()` has never propagated context
(verified on Python 3.12 and 3.13) — so this is a pre-existing gap in
the helper, **not** a regression or a Python-version compatibility
issue.
Concretely, any code that sets a `ContextVar` in async code and reads it
inside a function dispatched via `thread_pool_exec` (request tracing,
per-task state, Langfuse trace propagation, etc.) silently loses that
context.
## Fix
Copy the current context before submitting and run the callable inside
it with `ctx.run()`, matching what `asyncio.to_thread()` does:
```python
async def thread_pool_exec(func, *args, **kwargs):
loop = asyncio.get_running_loop()
ctx = contextvars.copy_context()
if kwargs:
inner = functools.partial(func, *args, **kwargs)
return await loop.run_in_executor(_thread_pool_executor(), ctx.run, inner)
return await loop.run_in_executor(_thread_pool_executor(), ctx.run, func, *args)
```
This explicitly **adds** ContextVar propagation to the helper (it does
not restore any prior behavior). Backward-compatible.
## Tests
`TestThreadPoolExec` covers propagation, the kwargs path, per-call
isolation and the unset-default case.
> Note: the branch name still contains `python313` for historical
reasons; the change is unrelated to any Python version.
## What
Fixes#16008 — tables contained in a DOCX are silently dropped when the
document is parsed with the **laws** chunking method.
## Root cause
`Docx.__call__` in `rag/app/laws.py` iterated `self.doc.paragraphs`,
which only yields paragraph elements. Tables are separate `tbl` blocks
in the document body, so they were never visited and were lost from the
output. (The `naive` parser already handles tables by iterating the
document body.)
## Changes
- Iterate `self.doc._element.body` so tables are visited in document
order alongside paragraphs.
- Add a `__table_to_html` helper that renders each table to HTML,
including merged-cell `colspan` detection (mirrors the `naive` parser's
logic).
- Inject each table into the section tree with a sentinel level deeper
than any heading, so `Node.build_tree` merges it into its **enclosing
section** — keeping the chapter/article title path as retrieval context
rather than producing an orphaned chunk.
- Guard the `h2_level` computation against an empty heading set, so a
tables-only or empty DOCX no longer raises `IndexError`.
This keeps the laws parser's hierarchical chunking **and** adds table
extraction, so users no longer have to choose between losing structure
(naive) or losing tables (laws).
## Tests
Adds `test/unit_test/rag/test_laws_docx_tables.py` covering:
- table content is preserved and carries its section title path,
- merged adjacent cells collapse to `colspan`,
- tables-only document does not crash,
- empty document returns `[]`.
All four pass; `ruff check` / `ruff format` are clean.
## Summary
Fixes#15487 — lone markdown headers are no longer isolated as empty
chunks when a custom `delimiter` is set.
- Merge consecutive lone headers before attaching to the following prose
body
- Skip code fences, tables, lists, and blockquotes via
`_is_attachable_body()`
- Unit tests include the `# Title / ## Intro / Body` regression from
CodeRabbit review
## Validation
- `pytest test/unit_test/deepdoc/parser/test_markdown_parser.py` (11
passed locally)
Closes#15487
### What problem does this PR solve?
- Update version tags in README files (including translations) from
v0.26.0 to v0.26.1
- Modify Docker image references and documentation to reflect new
version
- Update version badges and image descriptions
- Maintain consistency across all language variants of README files
### Type of change
- [x] Documentation Update
Replaces the Python agent canvas runtime with a Go implementation that
runs inside `cmd/server_main`.
The canvas compiles into an eino Workflow that pauses on wait-for-user
via native Interrupt/Resume (no sentinel flag) and resumes from a
Redis-backed CheckPointStore.
All 21 Python agent components and ~35 tools are ported with functional
parity.
Sandbox providers now read their JSON config from the admin-panel
system_settings table with env fallback.
234 files / +35,413 / -6,111. All Go files are gofmt-clean (CI gate
added); drops the v2 DSL E2E step and the gap-analysis plan (both
redundant after the port ships).
## Type of change
- [x] Refactoring
- [x] New feature
- [x] Bug fix
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
### What problem does this PR solve?
`rag/app/naive.py` `Markdown.load_images_from_urls` fetched image URLs
parsed
straight out of an untrusted uploaded markdown document via a raw
`requests.get`,
with no SSRF validation. Markdown chunking always reaches this path
(`return_section_images=True`), so any authenticated user who uploads a
`.md`/`.markdown`/`.mdx` file to a knowledge base could make the server
issue
requests to internal services or cloud-metadata endpoints, e.g.
``. The `image/`
Content-Type
check only gates decoding — the outbound request (the SSRF) always
fires.
This was the one user-controlled fetch site missed by the project's
existing
SSRF-hardening (`common/ssrf_guard.py`, already applied to the crawler,
SearXNG,
RSS connector, MCP/document APIs, and OAuth avatar download).
The fix validates and DNS-pins every hop with
`common.ssrf_guard.assert_url_is_safe`
before connecting, and follows redirects manually so each redirect
target is
re-validated (closing the DNS-rebinding / redirect-bypass window),
mirroring
`common/data_source/rss_connector.py`. Blocked URLs are skipped and
logged like
any other unreachable image, so legitimate public images are unaffected.
Adds a
regression test at `test/unit_test/rag/app/test_markdown_image_ssrf.py`.
Closes#15437
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Ubuntu <ubuntu@ubuntu-2204.linuxvmimages.local>
Co-authored-by: galuis116 <galuis116@users.noreply.github.com>
### What problem does this PR solve?
Fix:
- Pass session_id to langfuse.
- Get correct status for add model_type.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Adds a legacy mode for /chat/completions that restores v0.23.0-style
output by converting start_to_think/end_to_think back into raw
<think></think> markers and streaming cumulative answer text.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Guard the agent-attachment download against a missing or empty storage blob so the caller gets a structured 4xx (`Document not found!`) instead of an HTTP 500. Same bug class as #15365 on document preview.
Resolve#15502
### What problem does this PR solve?
| # | Method | Endpoint | Description | Git Equivalent |
|---|--------|----------|-------------|----------------|
| 1 | `POST` | `/api/v1/{prefix}/{folder_id}/commits` | Create a
snapshot commit with file changes (add/modify/delete/rename) | `git add`
+ `git commit` |
| 2 | `GET` | `/api/v1/{prefix}/{folder_id}/commits` | List commit
history (paginated) | `git log` |
| 3 | `GET` | `/api/v1/{prefix}/{folder_id}/commits/{commit_id}` | Get
commit detail with file changes | `git show` |
| 4 | `GET` | `/api/v1/{prefix}/{folder_id}/commits/{commit_id}/files` |
List file changes in a commit | `git show --name-status` |
| 5 | `GET` |
`/api/v1/{prefix}/{folder_id}/commits/diff?from=...&to=...` | Compare
two commits and return differences | `git diff` |
| 6 | `GET` | `/api/v1/{prefix}/{folder_id}/changes` | Get uncommitted
changes (add/modify/delete) | `git status` |
| 7 | `GET` | `/api/v1/{prefix}/{folder_id}/commits/{commit_id}/tree` |
Get the folder tree snapshot at commit time | `git ls-tree` |
| 8 | `GET` |
`/api/v1/{prefix}/{folder_id}/commits/{commit_id}/files/{file_id}/content`
| Get a file's content as it existed in a specific commit | `git show
HEAD:file` |
| 9 | `GET` | `/api/v1/{prefix}/{file_id}/versions` | Get version
history for a specific file across all commits | `git log -- file` |
Where `{prefix}/{id}` can be:
- `folders/{folder_id}` — direct folder access
- `workspaces/{workspace_id}` — alias of `folders/{folder_id}`
- `datasets/{dataset_id}` — resolves to the dataset's folder
- `memories/{memory_id}` — resolves to the memory's folder
- `skills/{skill_id}` — resolves to the skill's folder
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
### What problem does this PR solve?
The Profile **Name** field currently lacks application-level validation
and allows users to save excessively long names and unsupported special
characters.
While the database enforces a maximum length of 100 characters, neither
the frontend nor backend validates nickname format before persistence.
This can result in inconsistent user data, poor user experience, and UI
layout issues when long names wrap across multiple lines.
This PR introduces consistent frontend and backend validation for
profile names, enforces length and character constraints, provides clear
validation feedback, and prevents invalid values from being saved.
Fixes#15693
### Type of change
* [x] Bug Fix (non-breaking change which fixes an issue)
## Summary
This PR passes `session_id` into Langfuse trace observations so
multi-turn chat messages can be grouped under the same session in
Langfuse.
Changes include:
- Propagate `session_id` from chat/session APIs into
`dialog_service.async_chat`.
- Pass `session_id` into Langfuse `start_observation(...)`.
- Share Langfuse `trace_context` with chat, embedding, rerank, and TTS
model bundles where applicable.
- Add unit coverage to verify Langfuse observations receive
`session_id`.
- Update affected test stubs for the new optional Langfuse context
arguments.
## Related Issue
Closes: #15636
## Change Type
- [x] Feature
- [x] Bug fix
- [x] Test
- [ ] Refactor
- [ ] Documentation
- [ ] Breaking change
## Real Behavior Proof
Before this change:
- Langfuse observations were created without `session_id`.
- Multi-turn chat traces could not be grouped by session in Langfuse.
After this change:
- Chat/session flows pass `session_id` into `async_chat`.
- Langfuse observations include `session_id`.
- Related model bundles receive shared trace context and session
metadata.
Validation result:
```bash
uv run python -m py_compile \
api/db/services/tenant_llm_service.py \
api/db/services/llm_service.py \
api/db/services/dialog_service.py \
api/db/services/conversation_service.py \
api/apps/restful_apis/chat_api.py \
test/unit_test/api/db/services/test_dialog_service_final_answer.py \
test/unit_test/api/db/services/test_dialog_service_use_sql_source_columns.py
```
Passed.
```bash
uv run pytest \
test/unit_test/api/db/services/test_dialog_service_final_answer.py \
test/unit_test/api/db/services/test_dialog_service_use_sql_source_columns.py -q
```
Result:
```text
11 passed in 16.89s
```
```bash
git diff --check
```
Passed.
## Checklist
- [x] Analyzed the issue requirement.
- [x] Checked existing Langfuse trace integration.
- [x] Implemented only the requested session grouping behavior.
- [x] Added/updated unit tests.
- [x] Ran focused tests successfully.
- [x] Ran Python compile validation.
- [x] Ran whitespace diff validation.
### Summary
Closes#15423
`rag/llm/embedding_model.py` hosts about 40 embedding providers that
shared several defects affecting indexing reliability, cost accounting,
and error visibility. This PR fixes four concrete bugs.
**Masked, inconsistent errors (27 sites).** Nearly every provider ran
`log_exception(_e, res)` followed by `raise Exception(f"Error: {res}")`.
Because `log_exception` always raises, the second line was dead code,
and the surfaced exception varied with whether the SDK response exposed
a `.text` attribute. Every failure path now raises a single
`EmbeddingError` that includes the underlying response detail, so the
cause of a failed embedding is consistent and visible.
**Fabricated token counts.** `LocalAIEmbed` returned a hardcoded `1024`
and `OllamaEmbed` added `128` per text. These values feed `used_tokens`
and therefore billing and usage tracking. Both now report the real count
from the API (Ollama `prompt_eval_count`, LocalAI `usage`) and fall back
to a local token count only when the server omits it.
**Truncation overshoot.** The `8196` limit used by Mistral and Bedrock
exceeded the standard `8192` ceiling and could push boundary sized
inputs past the model limit. Limits are corrected to `8192` and made
intentional per provider, and providers that rely on server side
truncation now request it explicitly (Ollama `truncate=True`, Cohere
`truncate="END"`).
**Missing batching on Zhipu and Ollama.** Both issued one request per
text. They now batch like the other OpenAI compatible providers, turning
N round trips into `ceil(N / batch_size)`. Batched results are realigned
by response `index` so a chunk always keeps its own vector.
A shared `Base._batched_encode` helper owns the batch loop, optional
truncation, result accumulation, and the single error path. It is the
mechanism that lets these fixes live in one place instead of across 27
duplicated sites. The public `encode()` and `encode_queries()` contract
stays the same, so existing callers are unaffected.
Tests covering all four fixes are added under
`test/unit_test/rag/llm/test_embedding_model.py`.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
- Update version tags in README files (including translations) from
v0.25.6 to v0.26.0
- Modify Docker image references and documentation to reflect new
version
- Update version badges and image descriptions
- Maintain consistency across all language variants of README files
### Type of change
- [x] Documentation Update
Fixes#15529 .
### Problem
`async_ask()` accessed `kbs[0]` without verifying that
`KnowledgebaseService.get_by_ids()` returned any knowledge bases. Empty
or stale `kb_ids` raised `IndexError`, which surfaced as HTTP 500 on
search/bot SSE endpoints.
### Fix
- Add an early guard when `kbs` is empty, yielding a final SSE error
event (consistent with `gen_mindmap()` in the same module).
- Add regression tests for empty `kb_ids` and deleted/invalid KB IDs.
### Test plan
- [ ] `pytest
test/unit_test/api/db/services/test_dialog_service_final_answer.py -k
"async_ask_empty or async_ask_stale"`
- [ ] Manual: `POST /api/v1/searchbots/ask` with invalid `kb_ids`
returns SSE error, not HTTP 500
---------
Co-authored-by: Wang Qi <wangq8@outlook.com>
### What problem does this PR solve?
`POST /api/v1/datasets/{dataset_id}/documents/stop`
(`stop_parse_documents`) cancels parsing tasks and sets `run` to
`CANCEL`, but it does **not** remove chunks already indexed in the doc
store or reset `progress` / `chunk_num`. REST callers can end up with a
“cancelled” document that still returns partial chunks in `GET
.../chunks` and in retrieval.
Legacy `DELETE /api/v1/datasets/{dataset_id}/chunks` (`stop_parsing`)
already performs full cleanup: it resets counters and calls
`docStoreConn.delete`. This PR aligns the newer stop endpoint with that
behavior so both paths leave the dataset consistent.
Fixes [#15788](https://github.com/infiniflow/ragflow/issues/15788).
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### Changes
- Update `stop_parse_documents` in `document_api.py` to reset `progress`
and `chunk_num` to `0` and delete partial chunks via
`docStoreConn.delete` after `cancel_all_task_of`.
- Add unit test `test_stop_parse_documents_cleans_partial_chunks` to
assert counters reset and doc store delete is invoked.
### Test plan
- [x] Unit test: `pytest
test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py::TestDocRoutesUnit::test_stop_parse_documents_cleans_partial_chunks
-v`
- [ ] Manual: upload a slow document, start parse, call `POST
.../documents/stop` while `RUNNING`, verify `GET .../chunks` returns
zero chunks and UI `chunk_count` is 0
- [ ] Control: legacy `DELETE .../chunks` behavior unchanged
---------
Co-authored-by: Wang Qi <wangq8@outlook.com>
## Summary
- Normalize WebDAV file-size metadata before applying the sync size
threshold.
- Enforce the same threshold for numeric string sizes in both document
sync and slim snapshot paths.
- Add focused WebDAV unit coverage for size parsing and over-threshold
skips.
## Why
Some WebDAV servers return file sizes from PROPFIND metadata as strings.
The previous threshold check only handled integer values, so oversized
files could still be downloaded and sent into the chunking pipeline.
Closes#15724.
## Validation
- `uv run --no-project --with pytest --with pytest-asyncio pytest
test/unit_test/data_source/test_webdav_connector_unit.py -q`
- `uvx ruff check common/data_source/webdav_connector.py
test/unit_test/data_source/test_webdav_connector_unit.py`
- `python -m compileall -q common/data_source/webdav_connector.py
test/unit_test/data_source/test_webdav_connector_unit.py`
- `git diff --check`
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
## Summary
Fixes [#15585](https://github.com/infiniflow/ragflow/issues/15585).
- Route markdown preview through the shared `request` client (same as
txt/image previewers) so `Authorization` headers and interceptors are
applied consistently.
- Add a unit test covering `AUTH_BETA` token loading for embedded search
auth.
## Root cause
Search result preview for `.md`/`.mdx` used raw `fetch`, which did not
apply the same auth path as other preview types. That led to `401` on
`GET /api/v1/documents/{id}/preview` even when the user was logged in or
using an embedded search `auth` query param.
## Test plan
- [ ] Log in, run a search, open a markdown citation link — preview
loads (no 401).
- [ ] Open an embedded shared search URL with `auth` query param,
preview a markdown file — preview loads.
- [ ] Confirm PDF/txt preview still works in the same search UI.
---------
Co-authored-by: MkDev11 <89318445+bitloi@users.noreply.github.com>
Co-authored-by: Wang Qi <wangq8@outlook.com>
## Summary
Fixes#15532 — `delete_datasets()` crashes with `IndexError` when a
document has no `File2Document` row.
`delete_datasets()` in `dataset_api_service.py` called
`File2DocumentService.get_by_document_id()` and immediately accessed
`f2d[0].file_id` without checking whether the lookup returned any rows.
Documents created via API ingestion or connector sync may exist without
a linked file record, causing dataset deletion to abort with HTTP 500.
This PR mirrors the existing guard already used in `file_service.py` and
`document_api_service.py`.
## Summary
LocalAI exposes two API surfaces with conflicting naming conventions:
- `GET /api/tags` returns model names with `:latest` suffix (Ollama
format)
- `POST /v1/chat/completions` expects names without `:latest` (OpenAI
format)
RAGFlow discovered models via `/api/tags` and stored the tagged name,
then used it with `/v1/chat/completions`, causing a 404 error because
LocalAI didn't recognize `model:latest`.
## Fix
In `LocalAI.get_model_list()`, strip the tag suffix from model names
using `model["name"].rsplit(":", 1)[0]`, so stored names match what the
OpenAI-compatible endpoints expect.
### Summary
Closes#15795
Knowledge-graph queries rank entities by `pagerank * sim` in `KGSearch`,
but the entity chunks written at index time stopped carrying the values
that ranking depends on. `graph_node_to_chunk` only stored
`entity_type`, `description`, and `source_id`, dropping the node
`pagerank` and the n-hop neighbour paths, while `search.py` still read
them back as `rank_flt` and `n_hop_with_weight`.
The producer of these fields, `update_nodes_pagerank_nhop_neighbour`,
was removed in #6513, but the read side in `KGSearch` was never updated.
The result is that on every knowledge-graph query:
- `pagerank` resolves to `0`, so the `pagerank * sim` sort key is `0`
for every entity and selection falls back to arbitrary order.
- Every displayed entity score is `0.00`.
- The n-hop relation-enrichment block is dead code because `n_hop_ents`
is always empty, leaving `merge_tuples` and `is_continuous_subsequence`
orphaned.
This PR restores the missing index-time fields so the documented `P(E|Q)
= pagerank * sim` ranking and the n-hop enrichment work again.
What changed:
- `graph_node_to_chunk` now writes `rank_flt` from the node pagerank and
`n_hop_with_weight` from the recomputed n-hop neighbour paths.
- Reintroduced the n-hop path computation (`n_neighbor`) in
`rag/graphrag/utils.py`, reusing the previously orphaned `merge_tuples`
/ `is_continuous_subsequence` helpers, with a direction-agnostic
edge-weight lookup for undirected graphs. `set_graph` computes the paths
per added or updated node and passes them through.
- `KGSearch` now selects `n_hop_with_weight` in the entity keyword
search so Infinity and OceanBase return it (Elasticsearch and OpenSearch
already read it from `_source`), and the read is hardened against
missing keys or empty strings before `json.loads`.
- Added the `n_hop_with_weight` column to OceanBase, including the
`EXTRA_COLUMNS` migration entry so existing tables get it. The other
engines already map both fields via dynamic templates or the Infinity
mapping.
Scope note: pagerank and n-hop are re-indexed for the added or updated
nodes in each pass, consistent with the existing incremental indexing
design.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Testing
Added unit tests in
`test/unit_test/rag/graphrag/test_graphrag_utils.py`:
- `n_neighbor`: path and weight shape, one-hop vs two-hop, isolated
nodes, missing weights, and direction-agnostic lookup.
- `graph_node_to_chunk`: `rank_flt` populated from pagerank and
defaulting to `0`, `n_hop_with_weight` serialized and defaulting to an
empty list.
```
uv run pytest test/unit_test/rag/graphrag/ # 106 passed
uv run ruff check rag/graphrag/ rag/utils/ob_conn.py
```
## Problem
When users configure auto-metadata for a dataset, parsing crashes with:
```
KeyError: 'properties' in gen_metadata → schema["properties"]
```
## Root Cause
Pydantic `AutoMetadataField` defaults `enum` and `description` to `None`
when the frontend omits these fields:
```python
class AutoMetadataField(Base):
enum: Annotated[list[str] | None, Field(default=None)]
description: Annotated[str | None, Field(default=None)]
```
These `None` values propagate through the call chain and cause two
crashes:
## Summary
This PR adds checkpoint/resume support for the GraphRAG
`extract_community` and `resolve_entities` stages.
The implementation stores successful intermediate results in the
document store so interrupted ingestion can resume without repeating
already-completed LLM work. Checkpoints are loaded before each stage,
reused when available, saved after successful batch/community
processing, and cleaned up after the stage completes successfully.
## Related Issue
Closes: #15518
## Change Type
- [x] Feature
- [x] Bug fix
- [x] Test
- [ ] Refactor
- [ ] Documentation
- [ ] Breaking change
## Real Behavior Proof
Validation commands run locally:
```bash
uv run python -m py_compile \
rag/graphrag/checkpoints.py \
rag/graphrag/general/community_reports_extractor.py \
rag/graphrag/entity_resolution.py \
rag/graphrag/general/index.py \
test/unit_test/rag/graphrag/test_checkpoints.py
```
Result:
```text
Passed
```
```bash
uv run pytest test/unit_test/rag/graphrag/test_checkpoints.py
```
Result:
```text
4 passed
```
```bash
uv run pytest \
test/unit_test/rag/graphrag/test_phase_markers.py \
test/unit_test/rag/graphrag/test_graphrag_utils.py \
test/unit_test/rag/graphrag/test_checkpoints.py
```
Result:
```text
95 passed
```
```bash
git diff --check
```
Result:
```text
Passed
```
## Checklist
- [x] Implemented checkpoint/resume support for `extract_community`.
- [x] Implemented checkpoint/resume support for `resolve_entities`.
- [x] Avoided touching unrelated API behavior.
- [x] Added unit tests for the new checkpoint helper logic.
- [x] Verified Python syntax compilation.
- [x] Ran related GraphRAG unit tests successfully.
- [x] Ran `git diff --check`.
- [ ] Ran full project test suite.
---------
Co-authored-by: Wang Qi <wangq8@outlook.com>
### What problem does this PR solve?
`is_english()` in `rag/nlp/__init__.py` compiles a **single-character**
regex class and `fullmatch`es it against each item:
```python
pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]") # no quantifier
...
eng = sum(1 for t in texts if pattern.fullmatch(t.strip()))
```
For a **string** argument the text is first split into single characters
(`texts = list(texts)`), so each `fullmatch` sees one character and
works. But for a **list** argument each item is a whole multi-character
string, and `fullmatch` of a one-character pattern against a
multi-character string always fails — so `is_english()` returns `False`
for **any** list, regardless of content.
```python
is_english("This is English") # True (ok)
is_english(["The quick brown fox jumps.", "Hello world."]) # False (bug — should be True)
is_english(["这是中文。"]) # False (right answer, wrong reason)
```
Many call sites pass lists and were therefore silently always-`False`,
e.g.:
- `rag/llm/chat_model.py:1088`, `rag/llm/cv_model.py:168,1155` —
`is_english([ans])` when an answer is truncated at `max_tokens`, so an
English reply gets the Chinese "······由于长度的原因,回答被截断了,要继续吗?" continuation
suffix instead of the English one.
- `rag/app/book.py` — `remove_contents_table(...,
eng=is_english([...sections...]))`, so English books have their contents
table stripped in Chinese mode.
- `common/doc_store/es_conn_base.py:339`,
`rag/utils/opensearch_conn.py:733` — `is_english(txt.split())` in
highlight handling.
- plus `rag/app/qa.py`, `rag/flow/parser/utils.py`,
`common/doc_store/infinity_conn_base.py`.
### Fix
Add a `+` quantifier so an all-English multi-character item matches:
```python
pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]+")
```
The string path is unchanged (single characters still match) and
non-English lists still return `False`. Adds
`test/unit_test/rag/test_is_english.py`; the two list cases fail before
this change and pass after.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Used the Claude CLI while working on this.
### What problem does this PR solve?
Table parser metadata aggregation previously only ran when
`table_column_mode` was set to `manual`. In auto mode (default), all
columns default to `"both"` role, meaning they should also be aggregated
into document-level metadata for UI/chat filters. Additionally, the task
snapshot could be stale — `table_column_names` are written to KB
`parser_config` during `chunk()` but the task may have been created
before that.
Changes:
- Renames `aggregate_table_manual_doc_metadata` →
`aggregate_table_doc_metadata`
- Supports both `"manual"` and `"auto"` `table_column_mode` (defaults to
`"auto"`)
- Reloads `table_column_names` from KB DB when missing from task
snapshot
- Removes the manual-only guard in `task_executor` and refactored
`post_processor`
- Updates all tests with new function name and adds auto mode test cases
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Browser parsed sys.query from prompts but never called set_input_value,
so node_finished inputs displayed null in the agent orchestration run
log.
Additionally, Browser’s tenant-model path could trigger unsupported
structured-output modes (response_format/tool_choice) for some
OpenAI-compatible providers (notably DeepSeek thinking models), causing
step failures.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
### What problem does this PR solve?
`RAGFlowExcelParser.html()` iterates `(len(rows) - 1) // chunk_rows + 1`
times. `rows[0]` is the header, so `len(rows) - 1` is the data-row
count. When that count is an exact multiple of `chunk_rows`, the `+ 1`
over-counts by one: the final iteration's data slice is empty, but the
header row is still appended — producing a chunk that contains only the
table header and no data.
This is reachable via `rag/app/naive.py` (`html4excel`, `chunk_rows=12`)
and `rag/app/one.py`. A sheet with 12/24/36… data rows (or 256/512… with
the default `chunk_rows=256`) produces an extra
`<table><caption>…</caption><tr><th>…</th></tr></table>` chunk. It is
non-empty, so it passes the `if _` filter and gets indexed as a real
(empty) chunk.
| data rows (chunk_rows=12) | before | after |
|---|---|---|
| 12 | 2 chunks (1 header-only) | 1 |
| 24 | 3 chunks (1 header-only) | 2 |
| 13 | 2 (unchanged) | 2 |
### Fix
Iterate `ceil(n_data / chunk_rows)` times instead of `n_data //
chunk_rows + 1`. Adds
`test/unit_test/deepdoc/parser/test_excel_parser.py`; the
header-only-chunk cases fail before this change and pass after.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Used the Claude CLI while working on this.
### What problem does this PR solve?
Refine the stream parsing for `<think>` / `</think>` so MiniMax and
DeepSeek-style chunking both flush in the right order without mixing
think and answer buffers.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fixes the OpenSearch side of #10747: hybrid search drops the keyword
(BM25) leg and
ends up doing plain vector search.
When a search has both a text and a vector leg, `OSConnection.search()`
throws the text
query away:
del q["query"]
q["query"] = {"knn": knn_query}
The text clause only stays on as a filter inside the knn query, so it
narrows the
candidate set but doesn't count towards scoring. So hybrid search on
OpenSearch behaves
like plain vector search, unlike the Elasticsearch backend.
What I changed:
- when both legs are present, send a real hybrid query
`{"hybrid": {"queries": [bm25, {"knn": ...}]}}` and let a
normalization-processor
search pipeline score and combine the two legs
- only the actual filters (kb_id, available_int, ...) go in the knn
filter, not the
text must clause
- create the pipeline on startup if it's missing, so there's no separate
provisioning
step. name and weights can be set under `os:` in service_conf.yaml, or
via
`OS_HYBRID_PIPELINE`; defaults are `ragflow_hybrid_pipeline` and `[0.5,
0.5]`
- normalization-processor needs OpenSearch 2.10+. on older clusters, or
when the
pipeline can't be created, log a warning and fall back to vector-only
instead of
pointing at a pipeline that doesn't exist
This is only the hybrid-search fix; `create_doc_meta_idx` is already on
main.
Testing (there's no OpenSearch path in CI): added a unit test
(`test/unit_test/rag/utils/test_opensearch_hybrid_search.py`, no
services needed) that
checks the query built in each case — hybrid + pipeline param for
text+vector, plain knn
for vector-only, plain bool for text-only, the knn filter never carrying
the text
query_string, and the vector-only fallback when the pipeline isn't
available. Also ran
it against a real OpenSearch 2.19.1 container with a doc that matches
the keyword but
sits outside the knn top-k: pure knn returns `['D1','D2','D5']` (keyword
doc missing),
the hybrid query returns `['A','D1','D2','D5']` (keyword doc present).
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Signed-off-by: Danut Matei <matei.danut.dm@gmail.com>
### What problem does this PR solve?
Closes#15428
The hybrid score in `rag/nlp/search.py` (`rerank_by_model`) blends
reranker similarity with token similarity on a fixed `[0, 1]` scale:
```python
return tkweight * np.array(tksim) + vtweight * vtsim + rank_fea # tkweight=0.3, vtweight=0.7
```
The reranker implementations did not agree on that scale. Only three of
roughly 17 providers normalized their output, and `NvidiaRerank`
returned raw, unbounded logits. Weighted at `0.7`, a negative logit
could push a genuinely relevant chunk below pure keyword matches, and
its magnitude swamped `tksim`, which lives in `[0, 1]`. The practical
effect was that the same query produced differently scaled scores
depending on the configured reranker, and logit based providers degraded
retrieval quality instead of improving it.
This PR enforces a single scoring contract in one place:
- `Base.similarity` is now the only public entry point. It
short-circuits empty input and guarantees a normalized result. Each
provider implements its raw scoring in `_compute_rank`, which removes
sixteen duplicated empty input guards and the three scattered
normalization calls.
- Normalization is range aware. Providers that already return calibrated
`[0, 1]` relevance scores (Cohere, Jina, Voyage, and others) keep their
absolute magnitudes, so `similarity_threshold` filtering and the
reported `vector_similarity` stay meaningful. Only out-of-range output
such as NVIDIA logits is min-max rescaled into `[0, 1]`.
- The twelve leftover `[DEBUG ...]` prints in `rerank_by_model`,
introduced in #14231, are removed. They ran on every retrieval, added
per chunk overhead, and leaked queries, keywords, and document content
to stdout and logs.
A new regression suite in
`test/unit_test/rag/llm/test_rerank_normalization.py` covers logit
rescaling (positive, negative, and flat batches), preservation of
already calibrated scores, ordering, empty input handling, and the per
provider HTTP path. It also asserts that no provider overrides
`similarity()`, so the contract cannot silently drift.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Closes#15433
Reranked retrieval drops results and returns short pages once pagination
crosses the first candidate block, for the common page sizes 10 and 30.
In `rag/nlp/search.py`, the candidate window (`RERANK_LIMIT`) is rounded
up to a multiple of `page_size` to keep block based pagination aligned,
and then clamped back to 64:
```python
RERANK_LIMIT = math.ceil(64 / page_size) * page_size if page_size > 1 else 1 # e.g. 70 for page_size=10
RERANK_LIMIT = max(30, RERANK_LIMIT)
if rerank_mdl and top > 0:
RERANK_LIMIT = min(RERANK_LIMIT, top, 64) # clamps back to 64, breaking the multiple
```
`RERANK_LIMIT` is used both as the backend block size (`page =
global_offset // RERANK_LIMIT`) and as the modulus that slices a page
out of a reranked block (`begin = global_offset % RERANK_LIMIT`). When
it stops being a multiple of `page_size`, the block that gets fetched
and the slice taken from it no longer agree. With `page_size=10` and
`top=1024`, page 7 returns only 4 of 10 results and the head of the next
block is never shown on any page. This happens whenever the result set
spans more than one block, which is the default.
**Fix**
The window math is moved into a small reusable helper,
`Dealer._rerank_window`, which:
- targets a pool of about 64 candidates,
- bounds it by `top` when a reranker is active, and
- always rounds to a whole number of pages, so the window stays an exact
multiple of `page_size`.
The call site becomes a single line, and the alignment invariant now
lives in one documented place. Behavior is unchanged on every path that
was already aligned (the non reranked path and any `top` that already
produced a page multiple).
**Verification**
A simulation of the full retrieval path (per block rerank, similarity
threshold filter, and the exact `page // window` and `offset % window`
math) confirms the fix loses nothing where the old code lost real
results:
```
ps=10 top=1024: new window=70 dropped_valid=0 | old window=64 dropped_valid=16
ps=30 top=1024: new window=90 dropped_valid=0 | old window=64 dropped_valid=66
```
New unit tests in `test/unit_test/rag/test_search_pagination.py` cover
the alignment invariant, cross block pagination (every candidate
surfaced once, in order, no gaps, no short interior pages), the reported
regression, and parity with the old window on the previously correct
paths. All 114 cases pass and `ruff check` is clean.
Fixes the reranked deep pagination data loss described above.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Prepend a leading slash and reject `..` segments so scoped OneDrive
delta queries use `root:/path:/delta` instead of `root:path:/delta`.
Fixes#15500
### What problem does this PR solve?
The OneDrive connector builds Microsoft Graph delta URLs from optional
`config.folder_path`. When users enter a path without a leading slash
(e.g. `Documents/Reports` instead of `/Documents/Reports`), the
connector produces a malformed URL such as
`root:Documents/Reports:/delta`. Per [Microsoft Graph path-based
addressing](https://learn.microsoft.com/en-us/graph/onedrive-addressing-driveitems),
the segment after `root:` must start with `/` (e.g.
`root:/Documents/Reports:/delta`). Sync and validation then fail or
return no documents, which is hard to diagnose from the UI because the
optional folder field does not enforce the format.
This PR normalizes `folder_path` at connector construction time (prepend
`/`, trim whitespace and trailing slashes) and rejects `..` segments
before any Graph request is made.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## What
- make `Switch` ignore conditions that have no evaluable items
- add a regression for blank `cpn_id` items falling through to the else
branch
- keep the existing non-empty `and` condition behavior covered
Fixes#15643.
## Verified
- `python -m py_compile agent\component\switch.py
test\unit_test\agent\component\test_switch.py`
- `python -m pytest test\unit_test\agent\component\test_switch.py -q` ->
`2 passed`
- `python -m ruff check agent\component\switch.py
test\unit_test\agent\component\test_switch.py`
- `git diff --check`
I also checked `python -m ruff format --check` on the touched files. It
would reformat pre-existing style in `agent/component/switch.py` beyond
this bug fix, so I kept the patch scoped instead of reformatting the
whole file.