ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Author	SHA1	Message	Date
Harsh Kashyap	c7052f4dd1	fix(rag/nlp): treat string input as one phrase in is_english (#16308 )	2026-06-25 20:07:09 +08:00
Wang Qi	5defb4e7d6	Revert "fix(deepdoc): keep zero and false Excel cells in __call__" (#16366 ) Reverts infiniflow/ragflow#16318	2026-06-25 19:56:47 +08:00
Harsh Kashyap	8d3c3f868c	fix(api): validate immutable document fields when value is zero (#16309 )	2026-06-25 19:29:12 +08:00
Harsh Kashyap	66d86154ab	fix(deepdoc): accept GFM table separators with one or more dashes (#16319 )	2026-06-25 19:25:57 +08:00
cleanjunc	e8bb534b90	fix: naive_merge splits oversized sections and counts overlap tokens correctly (#15802 )	2026-06-25 19:19:38 +08:00
Harsh Kashyap	0af5d43e8d	fix(deepdoc): keep zero and false Excel cells in __call__ (#16318 )	2026-06-25 19:12:57 +08:00
Harsh Kashyap	49312cace3	fix(api): align use_sql Markdown separator with Source header (#16317 )	2026-06-25 19:00:01 +08:00
Yash Raj Pandey	091417980e	fix(html_parser): preserve original text when splitting oversized blocks (#16052 ) ### Bug `RAGFlowHtmlParser.chunk_block()` splits an oversized block by slicing the tokenized string and storing the joined tokens: ```python tks_str = rag_tokenizer.tokenize(block) ... tokens = tks_str.split(" ") while start < len(tokens): chunks.append(" ".join(tokens[start:start + chunk_token_num])) # tokenized form, not source ``` On the default (Elasticsearch) backend `rag_tokenizer.tokenize` transforms text: it lowercases/stems Latin words and inserts spaces between CJK characters. So any text block longer than `chunk_token_num` is stored as garbled, lowercased, space-segmented text instead of the source content. The small-block branch correctly stores the original `block`, so only oversized blocks are corrupted. Affects HTML and EPUB ingestion (both go through `chunk_block`), degrading retrieved chunks and the answers generated from them. ### Real tokenizer behavior (infinity-sdk 0.7.0, ES backend) ``` tokenize("Hello World FOO Bar Baz Qux Jumps") -> "hello world foo bar baz qux jump" # lowercased + stemmed tokenize("你好世界这是一个测试") -> "你好世界这是一个测试" # spaces inserted ``` ### Fix Split the original text: break it into atoms (whitespace-delimited runs for space-separated scripts, per-character for spaceless scripts such as Chinese) and pack them into pieces of at most `chunk_token_num` tokens. This preserves the source characters and still splits scripts that have no whitespace — a plain whitespace split would leave CJK as one un-splittable chunk. ### Proof (real tokenizer, before/after) Running the old vs new split against the real `infinity.rag_tokenizer`: ``` ENGLISH "Hello World FOO Bar Baz Qux Lazy Dogs" (chunk_token_num=4) OLD: ['hello world foo bar', 'baz qux jump over', 'lazi dog'] # lowercased + stemmed NEW: ['Hello World FOO Bar ', 'Baz Qux Jumps Over ', 'Lazy Dogs'] # preserved; each <= 4 tokens NEW preserves text exactly: True CHINESE "你好世界这是一个测试用例需要被切分成多个块" (chunk_token_num=3) OLD: ['你好世界这是', '一个测试用例需要', ...] # spurious spaces NEW: ['你好世', '界这是', '一个测', ...] # preserved; each <= 3 tokens NEW preserves text exactly: True ``` ### Tests Added `test/unit_test/deepdoc/parser/test_html_parser.py` (English + Chinese oversized blocks, plus small-block merge). Before the fix the two oversized tests fail (English shows lowercasing, Chinese shows inserted spaces); after the fix all pass. `ruff check` clean.	2026-06-25 16:43:35 +08:00
Muhammad Furqan	fe14cc35cf	fix(agent/tools): DeepL component fails validation and drops errors (#16332 ) ### What problem does this PR solve? `DeepLParam.check()` validated `self.top_n`, but DeepL has no such parameter (it is not defined on the param class or its base), so `check()` always raised `AttributeError` and a DeepL component could never pass validation. Removed the bogus `top_n` check. Also fixed the `_run` except branch, which computed `be_output("Error...")` but never returned it, silently dropping the error message. Closes #16329 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Add test cases ### Testing Added `test/unit_test/agent/component/test_deepl.py` covering `DeepLParam.check()` with valid defaults and rejection of invalid source/target languages.	2026-06-25 14:40:56 +08:00
Muhammad Furqan	3747a6bfeb	fix(agent/tools): PubMed tool always returns "Unknown Authors" (#16330 ) ### What problem does this PR solve? Fixes the PubMed tool always emitting `Authors: Unknown Authors`. The `safe_find` closure in `_format_pubmed_content` was hardcoded to search from the article root, so the per-author `LastName`/`ForeName` lookups never matched. `safe_find` now accepts an optional `base` node (defaults to `child`, preserving the existing field lookups), and the author loop passes the current `<Author>` element. Closes #16328 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Add test cases ### Testing Added `test/testcases/test_web_api/test_canvas_app/test_pubmed_unit.py` covering per-author parsing, intact title/journal/DOI fields, and the no-authors fallback. Before: `Authors: Unknown Authors` After: `Authors: Furqan Khan, Jane Smith`	2026-06-25 14:34:37 +08:00
Harsh Kashyap	b9445c67e2	fix(agent): coerce None Switch inputs before string operators (#16320 ) ## Summary - Coerce `None` canvas values to `""` before string comparison operators in `Switch.process_operator`. - Prevents `AttributeError` when upstream components yield `None` and the Switch uses contains/start with/end with. ## Test plan - [x] `.v/bin/python -m ruff check agent/component/switch.py test/unit_test/agent/component/test_switch.py` - [x] `.v/bin/python -m pytest test/unit_test/agent/component/test_switch.py -q` (3 passed) Fixes #16315 --------- Co-authored-by: Harsh Kashyap <harshkashyap@Harshs-MacBook-Pro.local>	2026-06-25 14:18:24 +08:00
kpdev	68d2ca0ff1	fix(api): use dataset-owner tenant for legacy /chunks docstore cleanup (#15961 )	2026-06-24 14:24:40 +08:00
helloxjade	1b2da645c3	fix: deduplicate markdown table chunks (#16143 )	2026-06-24 13:22:57 +08:00
minion1227	14565b289a	Fix: docx parsing raises ValueError on 'Heading' styles (#16284 )	2026-06-24 13:16:16 +08:00
minion1227	0c19190daf	Fix: MCP document metadata cache can loop forever when documents returns an empty docs page (#16285 )	2026-06-24 13:09:48 +08:00
Harsh Kashyap	b4a8a90c73	fix(rag/raptor): handle max_cluster edge case in GMM cluster selection (#16199 ) ### What problem does this PR solve? `_get_optimal_clusters` in `rag/raptor.py` had two edge-case issues in GMM cluster-count selection: 1. It used `np.arange(1, max_clusters)`, which never evaluates the upper-bound candidate (`max_clusters`). 2. When effective `max_clusters` becomes `1`, the candidate list was empty and `argmin` crashed. This PR makes candidate evaluation inclusive (`1..max_clusters`) and guards the single-cluster case by returning `1` directly. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Validation - `pytest test/unit_test/rag/test_raptor_psi_tree_builder.py --config-file pyproject.toml -q` - `ruff check rag/raptor.py test/unit_test/rag/test_raptor_psi_tree_builder.py` ### Tests added - Regression test for `max_cluster == 1` path (no crash, returns 1) - Regression test verifying upper-bound candidate is evaluated and can be selected _AI-assistance disclosure: parts of this change (bug triage and test scaffolding) were drafted with AI assistance and fully reviewed and verified by me._ --------- Co-authored-by: Harsh Kashyap <harshkashyap@Harshs-MacBook-Pro.local> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-23 21:07:26 +08:00
VincentLambert	11e14a8353	fix: propagate contextvars through thread_pool_exec (#16247 ) ## Problem `thread_pool_exec()` dispatches work via `loop.run_in_executor()`, which submits the callable with a plain `executor.submit(func, args)` and does not* copy the caller's `contextvars.Context`. So a `ContextVar` set in the async caller is not visible inside the function running in the worker thread. This differs from `asyncio.to_thread()`, which runs the callable inside a copied context. `run_in_executor()` has never propagated context (verified on Python 3.12 and 3.13) — so this is a pre-existing gap in the helper, not a regression or a Python-version compatibility issue. Concretely, any code that sets a `ContextVar` in async code and reads it inside a function dispatched via `thread_pool_exec` (request tracing, per-task state, Langfuse trace propagation, etc.) silently loses that context. ## Fix Copy the current context before submitting and run the callable inside it with `ctx.run()`, matching what `asyncio.to_thread()` does: ```python async def thread_pool_exec(func, args, kwargs): loop = asyncio.get_running_loop() ctx = contextvars.copy_context() if kwargs: inner = functools.partial(func, args, *kwargs) return await loop.run_in_executor(_thread_pool_executor(), ctx.run, inner) return await loop.run_in_executor(_thread_pool_executor(), ctx.run, func, args) ``` This explicitly adds ContextVar propagation to the helper (it does not restore any prior behavior). Backward-compatible. ## Tests `TestThreadPoolExec` covers propagation, the kwargs path, per-call isolation and the unset-default case. > Note: the branch name still contains `python313` for historical reasons; the change is unrelated to any Python version.	2026-06-23 15:17:42 +08:00
Zhichang Yu	3f805a64f1	feat(agent): align Go agent behavior with Python (except retrieval component) (#16225 ) ## Summary Aligns the Go agent runtime/canvas/components/tools behavior with the Python `agent/` implementation so the same stored canvas DSL produces the same execution result on either side. Every component, tool, and runtime primitive in `internal/agent/` is now driven by the same semantics as its Python counterpart — variable resolution, template substitution, control flow, error reporting, retry/cancel, and stream event shapes. The retrieval component is the one explicit exception in this PR. It is being reworked in a separate change and is excluded from this alignment pass; the wrapper slot (`universe_a_wrappers.go → newRetrievalComponent`) is preserved. ## Scope of alignment ### Components (all aligned with `agent/component/`) `Begin` · `Message` · `LLM` (incl. ChatTemplateKwargs, MessageHistoryWindowSize, VisualFiles, Cite, OutputStructure, JSONOutput, TopP, MaxRetries, DelayAfterError, credentials) · `Agent` (react + tool artifact capture + `Reset()` interface-assert) · `Switch` (12/12 operators, Python-equivalent semantics) · `Categorize` · `Invoke` · `Iteration` · `Loop` (macro-expansion through `workflowx.AddLoopNode`) · `UserFillUp` (Python-equivalent interrupt/resume via eino `compose.Interrupt`/`ResumeWithData`) · `FillUp` · `DataOperations` · `ListOperations` · `StringTransform` · `VariableAggregator` · `VariableAssigner` · `Browser` (full stagehand runtime parity) · `DocsGenerator` · `ExcelProcessor`. ### Tools (all aligned with `agent/tools/`) `Retrieval` (wrapper slot only — logic out of scope) · `MCPToolAdapter` (streamable-HTTP) · `CodeExec` (sandbox bridge with `code_exec_contract.go` matching Python contract) · `AkShare` · `ArXiv` · `Crawler` · `DeepL` · `DuckDuckGo` · `Email` · `ExeSQL` · `GitHub` · `Google` · `GoogleScholar` · `Jin10` · `PubMed` · `QWeather` · `SearXNG` · `Tavily` · `Tushare` · `Wencai` · `Wikipedia` · `YahooFinance` — uniform `eino tool.InvokableTool` interface, SSRF protection, shared HTTP client. ### Canvas execution engine (`internal/agent/canvas/`) Aligned with Python's `agent/canvas.py`: - Scheduler (`scheduler.go`): state pre/post handlers, node lambdas, per-component timeout resolver (4-level: per-class env → per-class table → uniform env → 600s fallback), `legacyNoOpNames`. - Loop subgraph (`loop_subgraph.go`): Python-equivalent `AddLoopNode` macro expansion + condition translation. - Multibranch (`multibranch.go`): `Switch` / `Categorize` routing via `compose.NewGraphMultiBranch` — same branch selection semantics as Python. - Parallel subgraph (`parallel_subgraph.go`): matches Python's parallel fan-out contract. - Interrupt/Resume (`interrupt_resume.go`): `UserFillUpNodeBody` / `IsInterruptError` / `ExtractInterruptContexts` — replaces the deprecated Python sentinel chain with eino's native interrupt API, preserving the same external behavior. - Checkpoint (`checkpoint_store.go`): `RedisCheckPointStore` Get/Set/Delete, with business metadata (status / canvas_id / parent_run_id) on a parallel Redis Hash. - RunTracker (`run_tracker.go`): Start / MarkSucceeded / MarkFailed / MarkCancelled / AttachCheckpoint — same lifecycle as the Python run record. - Cancel (`cancel.go`): Redis pub/sub watch. - Stream (`stream.go`): SSE channel with `messages` / `waiting` / `errors` / `done` events, same shape as Python's `agent.canvas.RunEvent` payload. ### DSL bridge (`internal/agent/dsl/`) - `normalize.go`: v1↔v2 collapsed into a single wire format — Python and Go consume the same stored JSON. - `reset.go`: per-run state reset matches Python's `Canvas.reset()` semantics. - Testdata mirrors Python's `agent_msg.json` / `all.json` / etc. ### Runtime (`internal/agent/runtime/`) - `CanvasState` / `NewCanvasState` / `GetVar` / `SetVar` / `ReadVars`: same `{{cpn_id@param}}` resolution model. - `ResolveTemplate` (regex fast path + gonja fallback) — Python Jinja-style semantics. - `selector.go`, `metrics.go`, `component.go`: shared runtime contracts. ## Out of scope (intentionally) - `Retrieval` component logic — wrapped only; full parity lands in a follow-up PR. - Frontend — only minor dsl-bridge / canvas UX fixes ride along. - CLI / admin / model registry — orthogonal to agent behavior. ## How alignment is verified `internal/service/agent_run_e2e_test.go` exercises the full production chain against real Python-shaped DSL fixtures: ``` loadCanvasForUser → versionDAO.GetLatest → decodeCanvasFromDSL → canvas.Compile → cc.Workflow.Invoke → answer extraction ``` using in-memory SQLite + miniredis (no Docker). Covers: - `TestRunAgent_RealCanvas_BeginMessage` — happy path, `{{sys.query}}` resolution - `TestRunAgent_RealCanvas_WaitForUserResume` — two-run resume cycle (Python-equivalent) - `TestRunAgent_RealCanvas_CompileFails` — unknown component name → sanitized error (Python-equivalent) - `TestRunAgent_RealCanvas_InvokeFails` — unresolvable template ref (Python-equivalent) - `TestRunAgent_RunTracker_AttachCheckpoint_CallSequence` — Start→AttachCheckpoint→MarkSucceeded lifecycle `internal/handler/agent_test.go` — SSE streaming parity (`Content-Type: text/event-stream`, `data: {…}\n\n`, trailing `data: [DONE]\n\n`, OpenAI-compatible non-stream `choices`). `internal/agent/canvas/fixture_compile_test.go` + per-component tests pin the Python-equivalent outputs. ``` go test -count=1 -v -run 'TestRunAgent_RealCanvas\|TestRunAgent_RunTracker' ./internal/service/ ``` ## Design reference `docs/develop/agent-go-port-design.md` (1329 lines, last cross-checked 2026-06-17) — module layout, per-component / per-tool inventory, corner-case catalogue, and the actionable backlog (Section 14, including the retrieval alignment follow-up). --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-06-22 11:58:29 +08:00
Manan Bansal	70c0121b78	Fix: preserve tables when parsing DOCX with the laws parser (#16008 ) (#16155 ) ## What Fixes #16008 — tables contained in a DOCX are silently dropped when the document is parsed with the laws chunking method. ## Root cause `Docx.__call__` in `rag/app/laws.py` iterated `self.doc.paragraphs`, which only yields paragraph elements. Tables are separate `tbl` blocks in the document body, so they were never visited and were lost from the output. (The `naive` parser already handles tables by iterating the document body.) ## Changes - Iterate `self.doc._element.body` so tables are visited in document order alongside paragraphs. - Add a `__table_to_html` helper that renders each table to HTML, including merged-cell `colspan` detection (mirrors the `naive` parser's logic). - Inject each table into the section tree with a sentinel level deeper than any heading, so `Node.build_tree` merges it into its enclosing section — keeping the chapter/article title path as retrieval context rather than producing an orphaned chunk. - Guard the `h2_level` computation against an empty heading set, so a tables-only or empty DOCX no longer raises `IndexError`. This keeps the laws parser's hierarchical chunking and adds table extraction, so users no longer have to choose between losing structure (naive) or losing tables (laws). ## Tests Adds `test/unit_test/rag/test_laws_docx_tables.py` covering: - table content is preserved and carries its section title path, - merged adjacent cells collapse to `colspan`, - tables-only document does not crash, - empty document returns `[]`. All four pass; `ruff check` / `ruff format` are clean.	2026-06-22 09:46:44 +08:00
Lynn	47bd9dd049	Fix: replace tenant_llm apis (#16131 ) Replace tenant_llm apis with provider-instance apis.	2026-06-18 16:38:32 +08:00
jaso0n0818	a70c7e8cc7	fix(deepdoc): attach lone header lines to the following section when delimiter is set (#16109 ) ## Summary Fixes #15487 — lone markdown headers are no longer isolated as empty chunks when a custom `delimiter` is set. - Merge consecutive lone headers before attaching to the following prose body - Skip code fences, tables, lists, and blockquotes via `_is_attachable_body()` - Unit tests include the `# Title / ## Intro / Body` regression from CodeRabbit review ## Validation - `pytest test/unit_test/deepdoc/parser/test_markdown_parser.py` (11 passed locally) Closes #15487	2026-06-18 14:24:09 +08:00
xu haiLong	a9ddcae0b3	Fix: MCP dataset discovery fails due to REST API max page size limit … (#16148 ) Fix #16146	2026-06-18 09:39:37 +08:00
Liu An	4379269374	Docs: Update version references to v0.26.1 in READMEs and docs (#16158 ) ### What problem does this PR solve? - Update version tags in README files (including translations) from v0.26.0 to v0.26.1 - Modify Docker image references and documentation to reflect new version - Update version badges and image descriptions - Maintain consistency across all language variants of README files ### Type of change - [x] Documentation Update	2026-06-17 19:35:32 +08:00
Zhichang Yu	e45659868a	feat(agent): ship the Go agent canvas port — eino interrupt/resume + Redis check-pointing (#16035 ) Replaces the Python agent canvas runtime with a Go implementation that runs inside `cmd/server_main`. The canvas compiles into an eino Workflow that pauses on wait-for-user via native Interrupt/Resume (no sentinel flag) and resumes from a Redis-backed CheckPointStore. All 21 Python agent components and ~35 tools are ported with functional parity. Sandbox providers now read their JSON config from the admin-panel system_settings table with env fallback. 234 files / +35,413 / -6,111. All Go files are gofmt-clean (CI gate added); drops the v2 DSL E2E step and the gap-analysis plan (both redundant after the port ships). ## Type of change - [x] Refactoring - [x] New feature - [x] Bug fix 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-06-17 13:24:03 +08:00
galuis116	6bfaa3f21e	Fix: SSRF in markdown parser remote image fetch (#15438 ) ### What problem does this PR solve? `rag/app/naive.py` `Markdown.load_images_from_urls` fetched image URLs parsed straight out of an untrusted uploaded markdown document via a raw `requests.get`, with no SSRF validation. Markdown chunking always reaches this path (`return_section_images=True`), so any authenticated user who uploads a `.md`/`.markdown`/`.mdx` file to a knowledge base could make the server issue requests to internal services or cloud-metadata endpoints, e.g. `![x](http://169.254.169.254/latest/meta-data/...)`. The `image/` Content-Type check only gates decoding — the outbound request (the SSRF) always fires. This was the one user-controlled fetch site missed by the project's existing SSRF-hardening (`common/ssrf_guard.py`, already applied to the crawler, SearXNG, RSS connector, MCP/document APIs, and OAuth avatar download). The fix validates and DNS-pins every hop with `common.ssrf_guard.assert_url_is_safe` before connecting, and follows redirects manually so each redirect target is re-validated (closing the DNS-rebinding / redirect-bypass window), mirroring `common/data_source/rss_connector.py`. Blocked URLs are skipped and logged like any other unreachable image, so legitimate public images are unaffected. Adds a regression test at `test/unit_test/rag/app/test_markdown_image_ssrf.py`. Closes #15437 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Ubuntu <ubuntu@ubuntu-2204.linuxvmimages.local> Co-authored-by: galuis116 <galuis116@users.noreply.github.com>	2026-06-16 18:54:55 +08:00
Lynn	70792de899	Fix: v0.26.1 model provider (#16073 ) ### What problem does this PR solve? Fix: - Pass session_id to langfuse. - Get correct status for add model_type. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-16 16:21:43 +08:00
buua436	8e235b7b95	fix: add legacy chat/completions mode (#16014 ) ### What problem does this PR solve? Adds a legacy mode for /chat/completions that restores v0.23.0-style output by converting start_to_think/end_to_think back into raw <think></think> markers and streaming cumulative answer text. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-16 10:34:06 +08:00
dripsmvcp	53d4d9b3bd	fix(api): return 4xx not 500 when attachment blob is missing (#15509 ) Guard the agent-attachment download against a missing or empty storage blob so the caller gets a structured 4xx (`Document not found!`) instead of an HTTP 500. Same bug class as #15365 on document preview. Resolve #15502	2026-06-15 15:41:49 +08:00
Yingfeng	b5bea72e4b	Add git-like file commit API (#15978 ) ### What problem does this PR solve? \| # \| Method \| Endpoint \| Description \| Git Equivalent \| \|---\|--------\|----------\|-------------\|----------------\| \| 1 \| `POST` \| `/api/v1/{prefix}/{folder_id}/commits` \| Create a snapshot commit with file changes (add/modify/delete/rename) \| `git add` + `git commit` \| \| 2 \| `GET` \| `/api/v1/{prefix}/{folder_id}/commits` \| List commit history (paginated) \| `git log` \| \| 3 \| `GET` \| `/api/v1/{prefix}/{folder_id}/commits/{commit_id}` \| Get commit detail with file changes \| `git show` \| \| 4 \| `GET` \| `/api/v1/{prefix}/{folder_id}/commits/{commit_id}/files` \| List file changes in a commit \| `git show --name-status` \| \| 5 \| `GET` \| `/api/v1/{prefix}/{folder_id}/commits/diff?from=...&to=...` \| Compare two commits and return differences \| `git diff` \| \| 6 \| `GET` \| `/api/v1/{prefix}/{folder_id}/changes` \| Get uncommitted changes (add/modify/delete) \| `git status` \| \| 7 \| `GET` \| `/api/v1/{prefix}/{folder_id}/commits/{commit_id}/tree` \| Get the folder tree snapshot at commit time \| `git ls-tree` \| \| 8 \| `GET` \| `/api/v1/{prefix}/{folder_id}/commits/{commit_id}/files/{file_id}/content` \| Get a file's content as it existed in a specific commit \| `git show HEAD:file` \| \| 9 \| `GET` \| `/api/v1/{prefix}/{file_id}/versions` \| Get version history for a specific file across all commits \| `git log -- file` \| Where `{prefix}/{id}` can be: - `folders/{folder_id}` — direct folder access - `workspaces/{workspace_id}` — alias of `folders/{folder_id}` - `datasets/{dataset_id}` — resolves to the dataset's folder - `memories/{memory_id}` — resolves to the memory's folder - `skills/{skill_id}` — resolves to the skill's folder ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update	2026-06-15 11:19:56 +08:00
Zhichang Yu	3fa15c0e2f	feat(agent): Go port — canvas engine, 22 components, DSL v2, 13 endpoints (#15952 ) Ports the agent canvas subsystem from Python to Go. ## What's included ### Canvas Engine (Phase 0/1) - State engine, scheduler, variable resolver, Redis checkpoint store, cancel protocol - 209 tests across canvas / component / io packages ### 22 Components (P0–P4) \| Tier \| Components \| \|---\|---\| \| P0 T1+T2+T3 \| LLM, Agent, ExitLoop, Switch, Categorize, Begin, Message, Invoke \| \| P1 T3 \| VariableAggregator, VariableAssigner, StringTransform, ListOperations, DataOperations \| \| P2 T3 \| Iteration, IterationItem, Loop, LoopItem \| \| P3 T3 \| UserFillUp, Fillup \| \| P4 T5 \| Browser, ExcelProcessor, DocsGenerator \| ### DSL v2 Schema (Phase 2.5) - Typed v2 in-memory model with v1-to-v2 auto-detect converter - v1 legacy field stripping per plan §2.11.7 ### HTTP Endpoints & Bug Fixes (Plans PR1–PR3) - DELETE SQL bug fix: gorm v2 `Where("id = ?", id).Delete(...)` pattern - CreateAgent validation: title/DSL required, duplicate check, 103 envelope - 13 new endpoints: templates, prompts, tags, sessions CRUD, chat/completions (SSE + non-stream stubs), rerun, test_db_connection, logs, webhook/logs - 756 Go unit tests (745 → 756, +18) - 17 → 0 Python integration test failures (test_agents.py + test_session_management/) ### Tools 21 eino tools: HTTPHelper, search tools, financial/data tools, mandatory stubs ### Infrastructure OTel observability, NATS message queue, DeepDoc gRPC client, SSRF guards, IDOR mitigation	2026-06-12 22:58:28 +08:00
Carl Harris	a2de880b6d	fix(profile): enforce profile name validation and input constraints (#15694 ) ### What problem does this PR solve? The Profile Name field currently lacks application-level validation and allows users to save excessively long names and unsupported special characters. While the database enforces a maximum length of 100 characters, neither the frontend nor backend validates nickname format before persistence. This can result in inconsistent user data, poor user experience, and UI layout issues when long names wrap across multiple lines. This PR introduces consistent frontend and backend validation for profile names, enforces length and character constraints, provides clear validation feedback, and prevents invalid values from being saved. Fixes #15693 ### Type of change * [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-12 11:13:18 +08:00
Jonathan Chang	de06c9a60b	feat: Langfuse session grouping for multi-turn chat traces (#15679 ) ## Summary This PR passes `session_id` into Langfuse trace observations so multi-turn chat messages can be grouped under the same session in Langfuse. Changes include: - Propagate `session_id` from chat/session APIs into `dialog_service.async_chat`. - Pass `session_id` into Langfuse `start_observation(...)`. - Share Langfuse `trace_context` with chat, embedding, rerank, and TTS model bundles where applicable. - Add unit coverage to verify Langfuse observations receive `session_id`. - Update affected test stubs for the new optional Langfuse context arguments. ## Related Issue Closes: #15636 ## Change Type - [x] Feature - [x] Bug fix - [x] Test - [ ] Refactor - [ ] Documentation - [ ] Breaking change ## Real Behavior Proof Before this change: - Langfuse observations were created without `session_id`. - Multi-turn chat traces could not be grouped by session in Langfuse. After this change: - Chat/session flows pass `session_id` into `async_chat`. - Langfuse observations include `session_id`. - Related model bundles receive shared trace context and session metadata. Validation result: ```bash uv run python -m py_compile \ api/db/services/tenant_llm_service.py \ api/db/services/llm_service.py \ api/db/services/dialog_service.py \ api/db/services/conversation_service.py \ api/apps/restful_apis/chat_api.py \ test/unit_test/api/db/services/test_dialog_service_final_answer.py \ test/unit_test/api/db/services/test_dialog_service_use_sql_source_columns.py ``` Passed. ```bash uv run pytest \ test/unit_test/api/db/services/test_dialog_service_final_answer.py \ test/unit_test/api/db/services/test_dialog_service_use_sql_source_columns.py -q ``` Result: ```text 11 passed in 16.89s ``` ```bash git diff --check ``` Passed. ## Checklist - [x] Analyzed the issue requirement. - [x] Checked existing Langfuse trace integration. - [x] Implemented only the requested session grouping behavior. - [x] Added/updated unit tests. - [x] Ran focused tests successfully. - [x] Ran Python compile validation. - [x] Ran whitespace diff validation.	2026-06-12 10:18:06 +08:00
Dexterity	bde2b1fc6d	fix(llm): correct error handling, token accounting, and truncation in embedding providers (#15424 ) ### Summary Closes #15423 `rag/llm/embedding_model.py` hosts about 40 embedding providers that shared several defects affecting indexing reliability, cost accounting, and error visibility. This PR fixes four concrete bugs. Masked, inconsistent errors (27 sites). Nearly every provider ran `log_exception(_e, res)` followed by `raise Exception(f"Error: {res}")`. Because `log_exception` always raises, the second line was dead code, and the surfaced exception varied with whether the SDK response exposed a `.text` attribute. Every failure path now raises a single `EmbeddingError` that includes the underlying response detail, so the cause of a failed embedding is consistent and visible. Fabricated token counts. `LocalAIEmbed` returned a hardcoded `1024` and `OllamaEmbed` added `128` per text. These values feed `used_tokens` and therefore billing and usage tracking. Both now report the real count from the API (Ollama `prompt_eval_count`, LocalAI `usage`) and fall back to a local token count only when the server omits it. Truncation overshoot. The `8196` limit used by Mistral and Bedrock exceeded the standard `8192` ceiling and could push boundary sized inputs past the model limit. Limits are corrected to `8192` and made intentional per provider, and providers that rely on server side truncation now request it explicitly (Ollama `truncate=True`, Cohere `truncate="END"`). Missing batching on Zhipu and Ollama. Both issued one request per text. They now batch like the other OpenAI compatible providers, turning N round trips into `ceil(N / batch_size)`. Batched results are realigned by response `index` so a chunk always keeps its own vector. A shared `Base._batched_encode` helper owns the batch loop, optional truncation, result accumulation, and the single error path. It is the mechanism that lets these fixes live in one place instead of across 27 duplicated sites. The public `encode()` and `encode_queries()` contract stays the same, so existing callers are unaffected. Tests covering all four fixes are added under `test/unit_test/rag/llm/test_embedding_model.py`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-11 19:29:46 +08:00
Liu An	92c4b7688b	Docs: Update version references to v0.26.0 in READMEs and docs (#15941 ) ### What problem does this PR solve? - Update version tags in README files (including translations) from v0.25.6 to v0.26.0 - Modify Docker image references and documentation to reflect new version - Update version badges and image descriptions - Maintain consistency across all language variants of README files ### Type of change - [x] Documentation Update	2026-06-11 18:34:26 +08:00
bohdansolovie	381091df71	fix(dialog): guard async_ask() against empty or invalid kb_ids (#15530 ) Fixes #15529 . ### Problem `async_ask()` accessed `kbs[0]` without verifying that `KnowledgebaseService.get_by_ids()` returned any knowledge bases. Empty or stale `kb_ids` raised `IndexError`, which surfaced as HTTP 500 on search/bot SSE endpoints. ### Fix - Add an early guard when `kbs` is empty, yielding a final SSE error event (consistent with `gen_mindmap()` in the same module). - Add regression tests for empty `kb_ids` and deleted/invalid KB IDs. ### Test plan - [ ] `pytest test/unit_test/api/db/services/test_dialog_service_final_answer.py -k "async_ask_empty or async_ask_stale"` - [ ] Manual: `POST /api/v1/searchbots/ask` with invalid `kb_ids` returns SSE error, not HTTP 500 --------- Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-11 15:52:59 +08:00
kpdev	de18313f97	fix(api): POST /documents/stop removes partial chunks and resets counters (#15789 ) ### What problem does this PR solve? `POST /api/v1/datasets/{dataset_id}/documents/stop` (`stop_parse_documents`) cancels parsing tasks and sets `run` to `CANCEL`, but it does not remove chunks already indexed in the doc store or reset `progress` / `chunk_num`. REST callers can end up with a “cancelled” document that still returns partial chunks in `GET .../chunks` and in retrieval. Legacy `DELETE /api/v1/datasets/{dataset_id}/chunks` (`stop_parsing`) already performs full cleanup: it resets counters and calls `docStoreConn.delete`. This PR aligns the newer stop endpoint with that behavior so both paths leave the dataset consistent. Fixes [#15788](https://github.com/infiniflow/ragflow/issues/15788). ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Changes - Update `stop_parse_documents` in `document_api.py` to reset `progress` and `chunk_num` to `0` and delete partial chunks via `docStoreConn.delete` after `cancel_all_task_of`. - Add unit test `test_stop_parse_documents_cleans_partial_chunks` to assert counters reset and doc store delete is invoked. ### Test plan - [x] Unit test: `pytest test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py::TestDocRoutesUnit::test_stop_parse_documents_cleans_partial_chunks -v` - [ ] Manual: upload a slow document, start parse, call `POST .../documents/stop` while `RUNNING`, verify `GET .../chunks` returns zero chunks and UI `chunk_count` is 0 - [ ] Control: legacy `DELETE .../chunks` behavior unchanged --------- Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-11 15:51:32 +08:00
oktofeesh	c15b2b3f66	fix(connectors): enforce WebDAV numeric string size limits (#15731 ) ## Summary - Normalize WebDAV file-size metadata before applying the sync size threshold. - Enforce the same threshold for numeric string sizes in both document sync and slim snapshot paths. - Add focused WebDAV unit coverage for size parsing and over-threshold skips. ## Why Some WebDAV servers return file sizes from PROPFIND metadata as strings. The previous threshold check only handled integer values, so oversized files could still be downloaded and sent into the chunking pipeline. Closes #15724. ## Validation - `uv run --no-project --with pytest --with pytest-asyncio pytest test/unit_test/data_source/test_webdav_connector_unit.py -q` - `uvx ruff check common/data_source/webdav_connector.py test/unit_test/data_source/test_webdav_connector_unit.py` - `python -m compileall -q common/data_source/webdav_connector.py test/unit_test/data_source/test_webdav_connector_unit.py` - `git diff --check` --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 15:47:54 +08:00
monsterDavid	a851228ded	fix(preview): authenticate markdown document preview requests (#15589 ) ## Summary Fixes [#15585](https://github.com/infiniflow/ragflow/issues/15585). - Route markdown preview through the shared `request` client (same as txt/image previewers) so `Authorization` headers and interceptors are applied consistently. - Add a unit test covering `AUTH_BETA` token loading for embedded search auth. ## Root cause Search result preview for `.md`/`.mdx` used raw `fetch`, which did not apply the same auth path as other preview types. That led to `401` on `GET /api/v1/documents/{id}/preview` even when the user was logged in or using an embedded search `auth` query param. ## Test plan - [ ] Log in, run a search, open a markdown citation link — preview loads (no 401). - [ ] Open an embedded shared search URL with `auth` query param, preview a markdown file — preview loads. - [ ] Confirm PDF/txt preview still works in the same search UI. --------- Co-authored-by: MkDev11 <89318445+bitloi@users.noreply.github.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-11 15:46:20 +08:00
bohdansolovie	47fb462e46	fix(api): guard dataset delete when File2Document row is missing (#15533 ) ## Summary Fixes #15532 — `delete_datasets()` crashes with `IndexError` when a document has no `File2Document` row. `delete_datasets()` in `dataset_api_service.py` called `File2DocumentService.get_by_document_id()` and immediately accessed `f2d[0].file_id` without checking whether the lookup returned any rows. Documents created via API ingestion or connector sync may exist without a linked file record, causing dataset deletion to abort with HTTP 500. This PR mirrors the existing guard already used in `file_service.py` and `document_api_service.py`.	2026-06-11 15:18:08 +08:00
Jack	0d3e410826	fix: strip Ollama-style tag suffix from LocalAI model names (#15908 ) ## Summary LocalAI exposes two API surfaces with conflicting naming conventions: - `GET /api/tags` returns model names with `:latest` suffix (Ollama format) - `POST /v1/chat/completions` expects names without `:latest` (OpenAI format) RAGFlow discovered models via `/api/tags` and stored the tagged name, then used it with `/v1/chat/completions`, causing a 404 error because LocalAI didn't recognize `model:latest`. ## Fix In `LocalAI.get_model_list()`, strip the tag suffix from model names using `model["name"].rsplit(":", 1)[0]`, so stored names match what the OpenAI-compatible endpoints expect.	2026-06-10 19:05:05 +08:00
Lynn	478c9846a1	Fix: model list (#15860 ) ### What problem does this PR solve? Remove tenant_llm call in rag. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-10 14:59:57 +08:00
Yingfeng	cf5cca5cbb	Fix wrong unit test path (#15864 )	2026-06-09 22:48:33 +08:00
cleanjunc	88e4d6bddb	Fix: restore GraphRAG entity ranking by indexing pagerank and n-hop paths (#15797 ) ### Summary Closes #15795 Knowledge-graph queries rank entities by `pagerank * sim` in `KGSearch`, but the entity chunks written at index time stopped carrying the values that ranking depends on. `graph_node_to_chunk` only stored `entity_type`, `description`, and `source_id`, dropping the node `pagerank` and the n-hop neighbour paths, while `search.py` still read them back as `rank_flt` and `n_hop_with_weight`. The producer of these fields, `update_nodes_pagerank_nhop_neighbour`, was removed in #6513, but the read side in `KGSearch` was never updated. The result is that on every knowledge-graph query: - `pagerank` resolves to `0`, so the `pagerank * sim` sort key is `0` for every entity and selection falls back to arbitrary order. - Every displayed entity score is `0.00`. - The n-hop relation-enrichment block is dead code because `n_hop_ents` is always empty, leaving `merge_tuples` and `is_continuous_subsequence` orphaned. This PR restores the missing index-time fields so the documented `P(E\|Q) = pagerank * sim` ranking and the n-hop enrichment work again. What changed: - `graph_node_to_chunk` now writes `rank_flt` from the node pagerank and `n_hop_with_weight` from the recomputed n-hop neighbour paths. - Reintroduced the n-hop path computation (`n_neighbor`) in `rag/graphrag/utils.py`, reusing the previously orphaned `merge_tuples` / `is_continuous_subsequence` helpers, with a direction-agnostic edge-weight lookup for undirected graphs. `set_graph` computes the paths per added or updated node and passes them through. - `KGSearch` now selects `n_hop_with_weight` in the entity keyword search so Infinity and OceanBase return it (Elasticsearch and OpenSearch already read it from `_source`), and the read is hardened against missing keys or empty strings before `json.loads`. - Added the `n_hop_with_weight` column to OceanBase, including the `EXTRA_COLUMNS` migration entry so existing tables get it. The other engines already map both fields via dynamic templates or the Infinity mapping. Scope note: pagerank and n-hop are re-indexed for the added or updated nodes in each pass, consistent with the existing incremental indexing design. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Testing Added unit tests in `test/unit_test/rag/graphrag/test_graphrag_utils.py`: - `n_neighbor`: path and weight shape, one-hop vs two-hop, isolated nodes, missing weights, and direction-agnostic lookup. - `graph_node_to_chunk`: `rank_flt` populated from pagerank and defaulting to `0`, `n_hop_with_weight` serialized and defaulting to an empty list. ``` uv run pytest test/unit_test/rag/graphrag/ # 106 passed uv run ruff check rag/graphrag/ rag/utils/ob_conn.py ```	2026-06-09 20:50:45 +08:00
Jack	3eff41361b	fix: prevent None values in auto-metadata from causing KeyError (#15842 ) ## Problem When users configure auto-metadata for a dataset, parsing crashes with: ``` KeyError: 'properties' in gen_metadata → schema["properties"] ``` ## Root Cause Pydantic `AutoMetadataField` defaults `enum` and `description` to `None` when the frontend omits these fields: ```python class AutoMetadataField(Base): enum: Annotated[list[str] \| None, Field(default=None)] description: Annotated[str \| None, Field(default=None)] ``` These `None` values propagate through the call chain and cause two crashes:	2026-06-09 19:10:48 +08:00
Jonathan Chang	c586292993	feat: Implement checkpoint/resume support for GraphRAG community extraction and entity resolution (#15523 ) ## Summary This PR adds checkpoint/resume support for the GraphRAG `extract_community` and `resolve_entities` stages. The implementation stores successful intermediate results in the document store so interrupted ingestion can resume without repeating already-completed LLM work. Checkpoints are loaded before each stage, reused when available, saved after successful batch/community processing, and cleaned up after the stage completes successfully. ## Related Issue Closes: #15518 ## Change Type - [x] Feature - [x] Bug fix - [x] Test - [ ] Refactor - [ ] Documentation - [ ] Breaking change ## Real Behavior Proof Validation commands run locally: ```bash uv run python -m py_compile \ rag/graphrag/checkpoints.py \ rag/graphrag/general/community_reports_extractor.py \ rag/graphrag/entity_resolution.py \ rag/graphrag/general/index.py \ test/unit_test/rag/graphrag/test_checkpoints.py ``` Result: ```text Passed ``` ```bash uv run pytest test/unit_test/rag/graphrag/test_checkpoints.py ``` Result: ```text 4 passed ``` ```bash uv run pytest \ test/unit_test/rag/graphrag/test_phase_markers.py \ test/unit_test/rag/graphrag/test_graphrag_utils.py \ test/unit_test/rag/graphrag/test_checkpoints.py ``` Result: ```text 95 passed ``` ```bash git diff --check ``` Result: ```text Passed ``` ## Checklist - [x] Implemented checkpoint/resume support for `extract_community`. - [x] Implemented checkpoint/resume support for `resolve_entities`. - [x] Avoided touching unrelated API behavior. - [x] Added unit tests for the new checkpoint helper logic. - [x] Verified Python syntax compilation. - [x] Ran related GraphRAG unit tests successfully. - [x] Ran `git diff --check`. - [ ] Ran full project test suite. --------- Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-09 15:34:47 +08:00
Yash Raj Pandey	f2aadd3871	Fix: is_english() returns False for any list argument (broken language detection) (#15489 ) ### What problem does this PR solve? `is_english()` in `rag/nlp/__init__.py` compiles a single-character regex class and `fullmatch`es it against each item: ```python pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]") # no quantifier ... eng = sum(1 for t in texts if pattern.fullmatch(t.strip())) ``` For a string argument the text is first split into single characters (`texts = list(texts)`), so each `fullmatch` sees one character and works. But for a list argument each item is a whole multi-character string, and `fullmatch` of a one-character pattern against a multi-character string always fails — so `is_english()` returns `False` for any list, regardless of content. ```python is_english("This is English") # True (ok) is_english(["The quick brown fox jumps.", "Hello world."]) # False (bug — should be True) is_english(["这是中文。"]) # False (right answer, wrong reason) ``` Many call sites pass lists and were therefore silently always-`False`, e.g.: - `rag/llm/chat_model.py:1088`, `rag/llm/cv_model.py:168,1155` — `is_english([ans])` when an answer is truncated at `max_tokens`, so an English reply gets the Chinese "······由于长度的原因，回答被截断了，要继续吗？" continuation suffix instead of the English one. - `rag/app/book.py` — `remove_contents_table(..., eng=is_english([...sections...]))`, so English books have their contents table stripped in Chinese mode. - `common/doc_store/es_conn_base.py:339`, `rag/utils/opensearch_conn.py:733` — `is_english(txt.split())` in highlight handling. - plus `rag/app/qa.py`, `rag/flow/parser/utils.py`, `common/doc_store/infinity_conn_base.py`. ### Fix Add a `+` quantifier so an all-English multi-character item matches: ```python pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]+") ``` The string path is unchanged (single characters still match) and non-English lists still return `False`. Adds `test/unit_test/rag/test_is_english.py`; the two list cases fail before this change and pass after. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Used the Claude CLI while working on this.	2026-06-08 20:25:23 +08:00
euvre	d9a04ef702	fix: support auto mode in table parser document metadata aggregation (#15780 ) ### What problem does this PR solve? Table parser metadata aggregation previously only ran when `table_column_mode` was set to `manual`. In auto mode (default), all columns default to `"both"` role, meaning they should also be aggregated into document-level metadata for UI/chat filters. Additionally, the task snapshot could be stale — `table_column_names` are written to KB `parser_config` during `chunk()` but the task may have been created before that. Changes: - Renames `aggregate_table_manual_doc_metadata` → `aggregate_table_doc_metadata` - Supports both `"manual"` and `"auto"` `table_column_mode` (defaults to `"auto"`) - Reloads `table_column_names` from KB DB when missing from task snapshot - Removes the manual-only guard in `task_executor` and refactored `post_processor` - Updates all tests with new function name and adds auto mode test cases ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-08 19:08:23 +08:00
天海蒼灆	17f27b9df2	fix(browser): show resolved variables in workflow run log input (#15325 ) ### What problem does this PR solve? Browser parsed sys.query from prompts but never called set_input_value, so node_finished inputs displayed null in the agent orchestration run log. Additionally, Browser’s tenant-model path could trigger unsupported structured-output modes (response_format/tool_choice) for some OpenAI-compatible providers (notably DeepSeek thinking models), causing step failures. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-08 18:12:56 +08:00
Yash Raj Pandey	14c460a525	Fix: Excel parser emits a spurious header-only chunk at exact chunk_rows multiples (#15490 ) ### What problem does this PR solve? `RAGFlowExcelParser.html()` iterates `(len(rows) - 1) // chunk_rows + 1` times. `rows[0]` is the header, so `len(rows) - 1` is the data-row count. When that count is an exact multiple of `chunk_rows`, the `+ 1` over-counts by one: the final iteration's data slice is empty, but the header row is still appended — producing a chunk that contains only the table header and no data. This is reachable via `rag/app/naive.py` (`html4excel`, `chunk_rows=12`) and `rag/app/one.py`. A sheet with 12/24/36… data rows (or 256/512… with the default `chunk_rows=256`) produces an extra `<table><caption>…</caption><tr><th>…</th></tr></table>` chunk. It is non-empty, so it passes the `if _` filter and gets indexed as a real (empty) chunk. \| data rows (chunk_rows=12) \| before \| after \| \|---\|---\|---\| \| 12 \| 2 chunks (1 header-only) \| 1 \| \| 24 \| 3 chunks (1 header-only) \| 2 \| \| 13 \| 2 (unchanged) \| 2 \| ### Fix Iterate `ceil(n_data / chunk_rows)` times instead of `n_data // chunk_rows + 1`. Adds `test/unit_test/deepdoc/parser/test_excel_parser.py`; the header-only-chunk cases fail before this change and pass after. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Used the Claude CLI while working on this.	2026-06-08 17:16:45 +08:00
buua436	c8c890b06c	fix: refine think stream parsing (#15745 ) ### What problem does this PR solve? Refine the stream parsing for `<think>` / `</think>` so MiniMax and DeepSeek-style chunking both flush in the right order without mixing think and answer buffers. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-08 16:53:22 +08:00

1 2 3 4 5 ...

414 Commits