Files

51 lines
2.1 KiB
Python
Raw Permalink Normal View History

fix(llm): correct error handling, token accounting, and truncation in embedding providers (#15424) ### Summary Closes #15423 `rag/llm/embedding_model.py` hosts about 40 embedding providers that shared several defects affecting indexing reliability, cost accounting, and error visibility. This PR fixes four concrete bugs. **Masked, inconsistent errors (27 sites).** Nearly every provider ran `log_exception(_e, res)` followed by `raise Exception(f"Error: {res}")`. Because `log_exception` always raises, the second line was dead code, and the surfaced exception varied with whether the SDK response exposed a `.text` attribute. Every failure path now raises a single `EmbeddingError` that includes the underlying response detail, so the cause of a failed embedding is consistent and visible. **Fabricated token counts.** `LocalAIEmbed` returned a hardcoded `1024` and `OllamaEmbed` added `128` per text. These values feed `used_tokens` and therefore billing and usage tracking. Both now report the real count from the API (Ollama `prompt_eval_count`, LocalAI `usage`) and fall back to a local token count only when the server omits it. **Truncation overshoot.** The `8196` limit used by Mistral and Bedrock exceeded the standard `8192` ceiling and could push boundary sized inputs past the model limit. Limits are corrected to `8192` and made intentional per provider, and providers that rely on server side truncation now request it explicitly (Ollama `truncate=True`, Cohere `truncate="END"`). **Missing batching on Zhipu and Ollama.** Both issued one request per text. They now batch like the other OpenAI compatible providers, turning N round trips into `ceil(N / batch_size)`. Batched results are realigned by response `index` so a chunk always keeps its own vector. A shared `Base._batched_encode` helper owns the batch loop, optional truncation, result accumulation, and the single error path. It is the mechanism that lets these fixes live in one place instead of across 27 duplicated sites. The public `encode()` and `encode_queries()` contract stays the same, so existing callers are unaffected. Tests covering all four fixes are added under `test/unit_test/rag/llm/test_embedding_model.py`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-11 07:29:46 -04:00
#
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
"""Shared setup for RAGFlow unit tests.
Several parsers and the chunking pipeline tokenize text with NLTK, which needs
the ``punkt_tab`` and ``wordnet`` data sets. Production provisions these via
``download_deps.py`` (into ``nltk_data``, exported as ``NLTK_DATA`` by
``docker/launch_backend_service.sh``) and ``api.validation`` at startup, but the
unit-test runner has neither. Without the data, tokenizer-backed tests such as
``test_epub_parser`` and ``test_dataflow_service`` fail with
``LookupError: Resource 'punkt_tab' not found``. Make sure the data is reachable
before any test imports a tokenizer: reuse a provisioned ``nltk_data`` directory
when present, and download only what is still missing.
"""
import os
import nltk
# Reuse data already fetched by download_deps.py (the directory the app exports
# as NLTK_DATA) so provisioned environments do not download it again.
_REPO_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), os.pardir, os.pardir))
feat(agent): align Go agent behavior with Python (except retrieval component) (#16225) ## Summary Aligns the **Go agent runtime/canvas/components/tools** behavior with the **Python `agent/` implementation** so the same stored canvas DSL produces the same execution result on either side. Every component, tool, and runtime primitive in `internal/agent/` is now driven by the same semantics as its Python counterpart — variable resolution, template substitution, control flow, error reporting, retry/cancel, and stream event shapes. The **retrieval component is the one explicit exception** in this PR. It is being reworked in a separate change and is excluded from this alignment pass; the wrapper slot (`universe_a_wrappers.go → newRetrievalComponent`) is preserved. ## Scope of alignment ### Components (all aligned with `agent/component/`) `Begin` · `Message` · `LLM` (incl. ChatTemplateKwargs, MessageHistoryWindowSize, VisualFiles, Cite, OutputStructure, JSONOutput, TopP, MaxRetries, DelayAfterError, credentials) · `Agent` (react + tool artifact capture + `Reset()` interface-assert) · `Switch` (12/12 operators, Python-equivalent semantics) · `Categorize` · `Invoke` · `Iteration` · `Loop` (macro-expansion through `workflowx.AddLoopNode`) · `UserFillUp` (Python-equivalent interrupt/resume via eino `compose.Interrupt`/`ResumeWithData`) · `FillUp` · `DataOperations` · `ListOperations` · `StringTransform` · `VariableAggregator` · `VariableAssigner` · `Browser` (full stagehand runtime parity) · `DocsGenerator` · `ExcelProcessor`. ### Tools (all aligned with `agent/tools/`) `Retrieval` (wrapper slot only — logic out of scope) · `MCPToolAdapter` (streamable-HTTP) · `CodeExec` (sandbox bridge with `code_exec_contract.go` matching Python contract) · `AkShare` · `ArXiv` · `Crawler` · `DeepL` · `DuckDuckGo` · `Email` · `ExeSQL` · `GitHub` · `Google` · `GoogleScholar` · `Jin10` · `PubMed` · `QWeather` · `SearXNG` · `Tavily` · `Tushare` · `Wencai` · `Wikipedia` · `YahooFinance` — uniform `eino tool.InvokableTool` interface, SSRF protection, shared HTTP client. ### Canvas execution engine (`internal/agent/canvas/`) Aligned with Python's `agent/canvas.py`: - **Scheduler** (`scheduler.go`): state pre/post handlers, node lambdas, per-component timeout resolver (4-level: per-class env → per-class table → uniform env → 600s fallback), `legacyNoOpNames`. - **Loop subgraph** (`loop_subgraph.go`): Python-equivalent `AddLoopNode` macro expansion + condition translation. - **Multibranch** (`multibranch.go`): `Switch` / `Categorize` routing via `compose.NewGraphMultiBranch` — same branch selection semantics as Python. - **Parallel subgraph** (`parallel_subgraph.go`): matches Python's parallel fan-out contract. - **Interrupt/Resume** (`interrupt_resume.go`): `UserFillUpNodeBody` / `IsInterruptError` / `ExtractInterruptContexts` — replaces the deprecated Python sentinel chain with eino's native interrupt API, preserving the same external behavior. - **Checkpoint** (`checkpoint_store.go`): `RedisCheckPointStore` Get/Set/Delete, with business metadata (status / canvas_id / parent_run_id) on a parallel Redis Hash. - **RunTracker** (`run_tracker.go`): Start / MarkSucceeded / MarkFailed / MarkCancelled / AttachCheckpoint — same lifecycle as the Python run record. - **Cancel** (`cancel.go`): Redis pub/sub watch. - **Stream** (`stream.go`): SSE channel with `messages` / `waiting` / `errors` / `done` events, same shape as Python's `agent.canvas.RunEvent` payload. ### DSL bridge (`internal/agent/dsl/`) - `normalize.go`: v1↔v2 collapsed into a single wire format — Python and Go consume the same stored JSON. - `reset.go`: per-run state reset matches Python's `Canvas.reset()` semantics. - Testdata mirrors Python's `agent_msg.json` / `all.json` / etc. ### Runtime (`internal/agent/runtime/`) - `CanvasState` / `NewCanvasState` / `GetVar` / `SetVar` / `ReadVars`: same `{{cpn_id@param}}` resolution model. - `ResolveTemplate` (regex fast path + gonja fallback) — Python Jinja-style semantics. - `selector.go`, `metrics.go`, `component.go`: shared runtime contracts. ## Out of scope (intentionally) - **`Retrieval` component logic** — wrapped only; full parity lands in a follow-up PR. - **Frontend** — only minor dsl-bridge / canvas UX fixes ride along. - **CLI / admin / model registry** — orthogonal to agent behavior. ## How alignment is verified `internal/service/agent_run_e2e_test.go` exercises the **full production chain** against real Python-shaped DSL fixtures: ``` loadCanvasForUser → versionDAO.GetLatest → decodeCanvasFromDSL → canvas.Compile → cc.Workflow.Invoke → answer extraction ``` using in-memory SQLite + miniredis (no Docker). Covers: - `TestRunAgent_RealCanvas_BeginMessage` — happy path, `{{sys.query}}` resolution - `TestRunAgent_RealCanvas_WaitForUserResume` — two-run resume cycle (Python-equivalent) - `TestRunAgent_RealCanvas_CompileFails` — unknown component name → sanitized error (Python-equivalent) - `TestRunAgent_RealCanvas_InvokeFails` — unresolvable template ref (Python-equivalent) - `TestRunAgent_RunTracker_AttachCheckpoint_CallSequence` — Start→AttachCheckpoint→MarkSucceeded lifecycle `internal/handler/agent_test.go` — SSE streaming parity (`Content-Type: text/event-stream`, `data: {…}\n\n`, trailing `data: [DONE]\n\n`, OpenAI-compatible non-stream `choices`). `internal/agent/canvas/fixture_compile_test.go` + per-component tests pin the Python-equivalent outputs. ``` go test -count=1 -v -run 'TestRunAgent_RealCanvas|TestRunAgent_RunTracker' ./internal/service/ ``` ## Design reference `docs/develop/agent-go-port-design.md` (1329 lines, last cross-checked 2026-06-17) — module layout, per-component / per-tool inventory, corner-case catalogue, and the actionable backlog (Section 14, including the retrieval alignment follow-up). --------- Co-authored-by: Claude <noreply@anthropic.com>
2026-06-22 11:58:29 +08:00
_LOCAL_NLTK_DATA = os.path.join(_REPO_ROOT, "ragflow_deps", "nltk_data")
fix(llm): correct error handling, token accounting, and truncation in embedding providers (#15424) ### Summary Closes #15423 `rag/llm/embedding_model.py` hosts about 40 embedding providers that shared several defects affecting indexing reliability, cost accounting, and error visibility. This PR fixes four concrete bugs. **Masked, inconsistent errors (27 sites).** Nearly every provider ran `log_exception(_e, res)` followed by `raise Exception(f"Error: {res}")`. Because `log_exception` always raises, the second line was dead code, and the surfaced exception varied with whether the SDK response exposed a `.text` attribute. Every failure path now raises a single `EmbeddingError` that includes the underlying response detail, so the cause of a failed embedding is consistent and visible. **Fabricated token counts.** `LocalAIEmbed` returned a hardcoded `1024` and `OllamaEmbed` added `128` per text. These values feed `used_tokens` and therefore billing and usage tracking. Both now report the real count from the API (Ollama `prompt_eval_count`, LocalAI `usage`) and fall back to a local token count only when the server omits it. **Truncation overshoot.** The `8196` limit used by Mistral and Bedrock exceeded the standard `8192` ceiling and could push boundary sized inputs past the model limit. Limits are corrected to `8192` and made intentional per provider, and providers that rely on server side truncation now request it explicitly (Ollama `truncate=True`, Cohere `truncate="END"`). **Missing batching on Zhipu and Ollama.** Both issued one request per text. They now batch like the other OpenAI compatible providers, turning N round trips into `ceil(N / batch_size)`. Batched results are realigned by response `index` so a chunk always keeps its own vector. A shared `Base._batched_encode` helper owns the batch loop, optional truncation, result accumulation, and the single error path. It is the mechanism that lets these fixes live in one place instead of across 27 duplicated sites. The public `encode()` and `encode_queries()` contract stays the same, so existing callers are unaffected. Tests covering all four fixes are added under `test/unit_test/rag/llm/test_embedding_model.py`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)
2026-06-11 07:29:46 -04:00
if os.path.isdir(_LOCAL_NLTK_DATA) and _LOCAL_NLTK_DATA not in nltk.data.path:
nltk.data.path.insert(0, _LOCAL_NLTK_DATA)
# (download name, resource path used by nltk.data.find)
_REQUIRED_NLTK_DATA = (
("punkt_tab", "tokenizers/punkt_tab"),
("wordnet", "corpora/wordnet"),
)
for _name, _find_path in _REQUIRED_NLTK_DATA:
try:
nltk.data.find(_find_path)
except LookupError:
nltk.download(_name, quiet=True)