Files
ragflow/common/token_utils.py

186 lines
6.9 KiB
Python
Raw Normal View History

#
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
feat(agent): report accurate aggregated token usage and propagate session/user + input/output to Langfuse for agent runs (#16420) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): ## Summary Agent (Canvas) runs previously did not surface token usage in the SSE stream, and RAGFlow's own Langfuse generations for agent runs were missing the prompt/completion split and the session/user correlation. This made it impossible for an external caller (or Langfuse) to reconcile an agent turn's cost with the upstream provider (e.g. OpenRouter), because a single turn can issue several distinct LLM calls (query rewriting / cross-language translation, multi-round tool reasoning, nested sub-agents, and the final answer). This PR introduces a per-run token usage sink so that **every** LLM call in a run is aggregated and reported once, and enriches Langfuse generations with the prompt/completion split plus session/user attributes. ## What changes ### 1. Per-run token usage sink (`common/token_utils.py`) - Adds two `contextvars`: `token_usage_sink` (a mutable per-run accumulator) and `langfuse_run_attrs` (session_id/user_id for the run). - Adds `record_run_token_usage(...)` (thread-safe via a lock, because `thread_pool_exec` copies the context into worker threads that share the sink dict) and `usage_from_response(...)` which extracts a `{prompt_tokens, completion_tokens, total_tokens}` split from OpenAI/OpenRouter-style responses. ### 2. Provider layer captures the prompt/completion split (`rag/llm/chat_model.py`) - `LiteLLMBase` and `Base` now store `self.last_usage` (prompt/completion/total) for the most recent chat call, in both the plain and tool-calling paths. - Streaming requests set `stream_options.include_usage = True` (LiteLLM path) so the authoritative usage arrives on the final chunk; this is read even on the usage-only chunk that carries no `choices`. - Fixes a multi-round accounting bug in `*_with_tools`: token totals were **overwritten** by each round (`total_tokens = tol`) instead of accumulated, undercounting multi-round tool conversations. Each round is now committed to a running aggregate. ### 3. LLMBundle reports usage once, per call (`api/db/services/llm_service.py`) - New `_report_usage(total_tokens)` records the call's usage into the active run sink and returns the prompt/completion/total split for Langfuse. The split is only used when it is consistent with the authoritative total; otherwise only the total is reported. - All three chat entry points (`async_chat`, `async_chat_streamly`, `async_chat_streamly_delta`) now emit `usage_details` with `input`/`output`/`total` instead of total-only. - `_start_langfuse_observation` now applies `session_id`/`user_id` from the per-run context (`langfuse_run_attrs`) so agent-run generations are correctly grouped, even though agent LLMBundles are constructed without those attributes. ### 4. Canvas installs the sink and emits the aggregate (`agent/canvas.py`) - `Canvas.run()` installs a fresh `token_usage_sink` and `langfuse_run_attrs` (from `user_id`/`session_id`) at the start of every turn. - `message_end` now includes an aggregated `usage` object: `{prompt_tokens, completion_tokens, total_tokens, calls}` covering all LLM calls in the run. ### 5. Pass session id into the run (`api/db/services/canvas_service.py`) - `completion()` forwards `session_id` to `Canvas.run()` for Langfuse session correlation. ## Why a context variable LLM calls in an agent run originate from many places that each build their own `LLMBundle` (e.g. `cross_languages`/`keyword_extraction` helpers, the Agent component, and nested sub-agents invoked as tools). A run-scoped context variable is the only non-invasive chokepoint that captures all of them exactly once, including nested agents (which run in the same async context) and thread-pool tools (the executor copies the context). ## Behavior / compatibility - No public API or wire-format removal: `message_end` gains an additional optional `usage` field; existing consumers are unaffected. - When a provider does not return authoritative usage, behavior falls back to the previous token estimate (total only, no split). - Non-agent flows (Dataflow `Pipeline`, sync `Graph.run`) are untouched. ## Testing - [x] Simple agent answer: `message_end.usage.total_tokens` matches provider usage. - [x] Agent with cross-language retrieval: aggregate equals the sum of both provider calls. - [x] Tool-calling agent (multi-round): total accumulates across rounds. - [x] Nested agent (agent-as-tool): sub-agent tokens included in the parent run total. - [x] Langfuse: agent generations show input/output split and are grouped by session/user. --------- Co-authored-by: yzc <yuzhichang@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-02 04:35:28 +03:00
import contextvars
fix(agent): enforce document access on POST /api/v1/agents/rerun (#15145) ## Related issues Closes #15144 ### What problem does this PR solve? `POST /api/v1/agents/rerun` loaded a pipeline operation log by UUID via `PipelineOperationLogService.get_documents_info` with no authorization, then wiped chunks, reset document counters, deleted tasks, and re-queued dataflow for the victim document. Any authenticated user who knew a victim's pipeline log id could disrupt parsing on documents they did not own. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Changes | File | Change | |------|--------| | `api/apps/restful_apis/agent_api.py` | Call `DocumentService.accessible(doc["id"], tenant_id)` before destructive rerun operations; deny with generic `"Document not found."` | | `test/unit_test/api/apps/restful_apis/test_rerun_agent_authorization.py` | Unit tests: cross-tenant log rejected, missing/unauthorized same message, authorized rerun proceeds | ### Security notes - **CWE-639:** Closes cross-tenant pipeline rerun / chunk wipe via leaked log UUID. - `tenant_id` from `@add_tenant_id_to_kwargs` is `current_user.id`; `DocumentService.accessible` covers team-shared KBs. ### Test plan - [ ] `pytest test/unit_test/api/apps/restful_apis/test_rerun_agent_authorization.py` - [ ] Manual: attacker cannot rerun victim pipeline log id ```bash cd ragflow uv run pytest test/unit_test/api/apps/restful_apis/test_rerun_agent_authorization.py -q ``` --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 08:34:22 -07:00
import hashlib
feat(agent): report accurate aggregated token usage and propagate session/user + input/output to Langfuse for agent runs (#16420) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): ## Summary Agent (Canvas) runs previously did not surface token usage in the SSE stream, and RAGFlow's own Langfuse generations for agent runs were missing the prompt/completion split and the session/user correlation. This made it impossible for an external caller (or Langfuse) to reconcile an agent turn's cost with the upstream provider (e.g. OpenRouter), because a single turn can issue several distinct LLM calls (query rewriting / cross-language translation, multi-round tool reasoning, nested sub-agents, and the final answer). This PR introduces a per-run token usage sink so that **every** LLM call in a run is aggregated and reported once, and enriches Langfuse generations with the prompt/completion split plus session/user attributes. ## What changes ### 1. Per-run token usage sink (`common/token_utils.py`) - Adds two `contextvars`: `token_usage_sink` (a mutable per-run accumulator) and `langfuse_run_attrs` (session_id/user_id for the run). - Adds `record_run_token_usage(...)` (thread-safe via a lock, because `thread_pool_exec` copies the context into worker threads that share the sink dict) and `usage_from_response(...)` which extracts a `{prompt_tokens, completion_tokens, total_tokens}` split from OpenAI/OpenRouter-style responses. ### 2. Provider layer captures the prompt/completion split (`rag/llm/chat_model.py`) - `LiteLLMBase` and `Base` now store `self.last_usage` (prompt/completion/total) for the most recent chat call, in both the plain and tool-calling paths. - Streaming requests set `stream_options.include_usage = True` (LiteLLM path) so the authoritative usage arrives on the final chunk; this is read even on the usage-only chunk that carries no `choices`. - Fixes a multi-round accounting bug in `*_with_tools`: token totals were **overwritten** by each round (`total_tokens = tol`) instead of accumulated, undercounting multi-round tool conversations. Each round is now committed to a running aggregate. ### 3. LLMBundle reports usage once, per call (`api/db/services/llm_service.py`) - New `_report_usage(total_tokens)` records the call's usage into the active run sink and returns the prompt/completion/total split for Langfuse. The split is only used when it is consistent with the authoritative total; otherwise only the total is reported. - All three chat entry points (`async_chat`, `async_chat_streamly`, `async_chat_streamly_delta`) now emit `usage_details` with `input`/`output`/`total` instead of total-only. - `_start_langfuse_observation` now applies `session_id`/`user_id` from the per-run context (`langfuse_run_attrs`) so agent-run generations are correctly grouped, even though agent LLMBundles are constructed without those attributes. ### 4. Canvas installs the sink and emits the aggregate (`agent/canvas.py`) - `Canvas.run()` installs a fresh `token_usage_sink` and `langfuse_run_attrs` (from `user_id`/`session_id`) at the start of every turn. - `message_end` now includes an aggregated `usage` object: `{prompt_tokens, completion_tokens, total_tokens, calls}` covering all LLM calls in the run. ### 5. Pass session id into the run (`api/db/services/canvas_service.py`) - `completion()` forwards `session_id` to `Canvas.run()` for Langfuse session correlation. ## Why a context variable LLM calls in an agent run originate from many places that each build their own `LLMBundle` (e.g. `cross_languages`/`keyword_extraction` helpers, the Agent component, and nested sub-agents invoked as tools). A run-scoped context variable is the only non-invasive chokepoint that captures all of them exactly once, including nested agents (which run in the same async context) and thread-pool tools (the executor copies the context). ## Behavior / compatibility - No public API or wire-format removal: `message_end` gains an additional optional `usage` field; existing consumers are unaffected. - When a provider does not return authoritative usage, behavior falls back to the previous token estimate (total only, no split). - Non-agent flows (Dataflow `Pipeline`, sync `Graph.run`) are untouched. ## Testing - [x] Simple agent answer: `message_end.usage.total_tokens` matches provider usage. - [x] Agent with cross-language retrieval: aggregate equals the sum of both provider calls. - [x] Tool-calling agent (multi-round): total accumulates across rounds. - [x] Nested agent (agent-as-tool): sub-agent tokens included in the parent run total. - [x] Langfuse: agent generations show input/output split and are grouped by session/user. --------- Co-authored-by: yzc <yuzhichang@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-02 04:35:28 +03:00
import logging
import os
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
import shutil
feat(agent): report accurate aggregated token usage and propagate session/user + input/output to Langfuse for agent runs (#16420) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): ## Summary Agent (Canvas) runs previously did not surface token usage in the SSE stream, and RAGFlow's own Langfuse generations for agent runs were missing the prompt/completion split and the session/user correlation. This made it impossible for an external caller (or Langfuse) to reconcile an agent turn's cost with the upstream provider (e.g. OpenRouter), because a single turn can issue several distinct LLM calls (query rewriting / cross-language translation, multi-round tool reasoning, nested sub-agents, and the final answer). This PR introduces a per-run token usage sink so that **every** LLM call in a run is aggregated and reported once, and enriches Langfuse generations with the prompt/completion split plus session/user attributes. ## What changes ### 1. Per-run token usage sink (`common/token_utils.py`) - Adds two `contextvars`: `token_usage_sink` (a mutable per-run accumulator) and `langfuse_run_attrs` (session_id/user_id for the run). - Adds `record_run_token_usage(...)` (thread-safe via a lock, because `thread_pool_exec` copies the context into worker threads that share the sink dict) and `usage_from_response(...)` which extracts a `{prompt_tokens, completion_tokens, total_tokens}` split from OpenAI/OpenRouter-style responses. ### 2. Provider layer captures the prompt/completion split (`rag/llm/chat_model.py`) - `LiteLLMBase` and `Base` now store `self.last_usage` (prompt/completion/total) for the most recent chat call, in both the plain and tool-calling paths. - Streaming requests set `stream_options.include_usage = True` (LiteLLM path) so the authoritative usage arrives on the final chunk; this is read even on the usage-only chunk that carries no `choices`. - Fixes a multi-round accounting bug in `*_with_tools`: token totals were **overwritten** by each round (`total_tokens = tol`) instead of accumulated, undercounting multi-round tool conversations. Each round is now committed to a running aggregate. ### 3. LLMBundle reports usage once, per call (`api/db/services/llm_service.py`) - New `_report_usage(total_tokens)` records the call's usage into the active run sink and returns the prompt/completion/total split for Langfuse. The split is only used when it is consistent with the authoritative total; otherwise only the total is reported. - All three chat entry points (`async_chat`, `async_chat_streamly`, `async_chat_streamly_delta`) now emit `usage_details` with `input`/`output`/`total` instead of total-only. - `_start_langfuse_observation` now applies `session_id`/`user_id` from the per-run context (`langfuse_run_attrs`) so agent-run generations are correctly grouped, even though agent LLMBundles are constructed without those attributes. ### 4. Canvas installs the sink and emits the aggregate (`agent/canvas.py`) - `Canvas.run()` installs a fresh `token_usage_sink` and `langfuse_run_attrs` (from `user_id`/`session_id`) at the start of every turn. - `message_end` now includes an aggregated `usage` object: `{prompt_tokens, completion_tokens, total_tokens, calls}` covering all LLM calls in the run. ### 5. Pass session id into the run (`api/db/services/canvas_service.py`) - `completion()` forwards `session_id` to `Canvas.run()` for Langfuse session correlation. ## Why a context variable LLM calls in an agent run originate from many places that each build their own `LLMBundle` (e.g. `cross_languages`/`keyword_extraction` helpers, the Agent component, and nested sub-agents invoked as tools). A run-scoped context variable is the only non-invasive chokepoint that captures all of them exactly once, including nested agents (which run in the same async context) and thread-pool tools (the executor copies the context). ## Behavior / compatibility - No public API or wire-format removal: `message_end` gains an additional optional `usage` field; existing consumers are unaffected. - When a provider does not return authoritative usage, behavior falls back to the previous token estimate (total only, no split). - Non-agent flows (Dataflow `Pipeline`, sync `Graph.run`) are untouched. ## Testing - [x] Simple agent answer: `message_end.usage.total_tokens` matches provider usage. - [x] Agent with cross-language retrieval: aggregate equals the sum of both provider calls. - [x] Tool-calling agent (multi-round): total accumulates across rounds. - [x] Nested agent (agent-as-tool): sub-agent tokens included in the parent run total. - [x] Langfuse: agent generations show input/output split and are grouped by session/user. --------- Co-authored-by: yzc <yuzhichang@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-02 04:35:28 +03:00
import threading
import tiktoken
from common.file_utils import get_project_base_directory
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
def _ensure_tiktoken_cache() -> str:
cache_dir = get_project_base_directory()
os.environ["TIKTOKEN_CACHE_DIR"] = cache_dir
bundled_encoding_path = get_project_base_directory("ragflow_deps", "cl100k_base.tiktoken")
fix(agent): enforce document access on POST /api/v1/agents/rerun (#15145) ## Related issues Closes #15144 ### What problem does this PR solve? `POST /api/v1/agents/rerun` loaded a pipeline operation log by UUID via `PipelineOperationLogService.get_documents_info` with no authorization, then wiped chunks, reset document counters, deleted tasks, and re-queued dataflow for the victim document. Any authenticated user who knew a victim's pipeline log id could disrupt parsing on documents they did not own. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Changes | File | Change | |------|--------| | `api/apps/restful_apis/agent_api.py` | Call `DocumentService.accessible(doc["id"], tenant_id)` before destructive rerun operations; deny with generic `"Document not found."` | | `test/unit_test/api/apps/restful_apis/test_rerun_agent_authorization.py` | Unit tests: cross-tenant log rejected, missing/unauthorized same message, authorized rerun proceeds | ### Security notes - **CWE-639:** Closes cross-tenant pipeline rerun / chunk wipe via leaked log UUID. - `tenant_id` from `@add_tenant_id_to_kwargs` is `current_user.id`; `DocumentService.accessible` covers team-shared KBs. ### Test plan - [ ] `pytest test/unit_test/api/apps/restful_apis/test_rerun_agent_authorization.py` - [ ] Manual: attacker cannot rerun victim pipeline log id ```bash cd ragflow uv run pytest test/unit_test/api/apps/restful_apis/test_rerun_agent_authorization.py -q ``` --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 08:34:22 -07:00
encoding_url = "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
cached_encoding_path = os.path.join(cache_dir, hashlib.sha1(encoding_url.encode()).hexdigest())
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
if os.path.exists(bundled_encoding_path) and not os.path.exists(cached_encoding_path):
shutil.copyfile(bundled_encoding_path, cached_encoding_path)
return cache_dir
tiktoken_cache_dir = _ensure_tiktoken_cache()
os.environ["TIKTOKEN_CACHE_DIR"] = tiktoken_cache_dir
# encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")
encoder = tiktoken.get_encoding("cl100k_base")
feat(agent): report accurate aggregated token usage and propagate session/user + input/output to Langfuse for agent runs (#16420) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): ## Summary Agent (Canvas) runs previously did not surface token usage in the SSE stream, and RAGFlow's own Langfuse generations for agent runs were missing the prompt/completion split and the session/user correlation. This made it impossible for an external caller (or Langfuse) to reconcile an agent turn's cost with the upstream provider (e.g. OpenRouter), because a single turn can issue several distinct LLM calls (query rewriting / cross-language translation, multi-round tool reasoning, nested sub-agents, and the final answer). This PR introduces a per-run token usage sink so that **every** LLM call in a run is aggregated and reported once, and enriches Langfuse generations with the prompt/completion split plus session/user attributes. ## What changes ### 1. Per-run token usage sink (`common/token_utils.py`) - Adds two `contextvars`: `token_usage_sink` (a mutable per-run accumulator) and `langfuse_run_attrs` (session_id/user_id for the run). - Adds `record_run_token_usage(...)` (thread-safe via a lock, because `thread_pool_exec` copies the context into worker threads that share the sink dict) and `usage_from_response(...)` which extracts a `{prompt_tokens, completion_tokens, total_tokens}` split from OpenAI/OpenRouter-style responses. ### 2. Provider layer captures the prompt/completion split (`rag/llm/chat_model.py`) - `LiteLLMBase` and `Base` now store `self.last_usage` (prompt/completion/total) for the most recent chat call, in both the plain and tool-calling paths. - Streaming requests set `stream_options.include_usage = True` (LiteLLM path) so the authoritative usage arrives on the final chunk; this is read even on the usage-only chunk that carries no `choices`. - Fixes a multi-round accounting bug in `*_with_tools`: token totals were **overwritten** by each round (`total_tokens = tol`) instead of accumulated, undercounting multi-round tool conversations. Each round is now committed to a running aggregate. ### 3. LLMBundle reports usage once, per call (`api/db/services/llm_service.py`) - New `_report_usage(total_tokens)` records the call's usage into the active run sink and returns the prompt/completion/total split for Langfuse. The split is only used when it is consistent with the authoritative total; otherwise only the total is reported. - All three chat entry points (`async_chat`, `async_chat_streamly`, `async_chat_streamly_delta`) now emit `usage_details` with `input`/`output`/`total` instead of total-only. - `_start_langfuse_observation` now applies `session_id`/`user_id` from the per-run context (`langfuse_run_attrs`) so agent-run generations are correctly grouped, even though agent LLMBundles are constructed without those attributes. ### 4. Canvas installs the sink and emits the aggregate (`agent/canvas.py`) - `Canvas.run()` installs a fresh `token_usage_sink` and `langfuse_run_attrs` (from `user_id`/`session_id`) at the start of every turn. - `message_end` now includes an aggregated `usage` object: `{prompt_tokens, completion_tokens, total_tokens, calls}` covering all LLM calls in the run. ### 5. Pass session id into the run (`api/db/services/canvas_service.py`) - `completion()` forwards `session_id` to `Canvas.run()` for Langfuse session correlation. ## Why a context variable LLM calls in an agent run originate from many places that each build their own `LLMBundle` (e.g. `cross_languages`/`keyword_extraction` helpers, the Agent component, and nested sub-agents invoked as tools). A run-scoped context variable is the only non-invasive chokepoint that captures all of them exactly once, including nested agents (which run in the same async context) and thread-pool tools (the executor copies the context). ## Behavior / compatibility - No public API or wire-format removal: `message_end` gains an additional optional `usage` field; existing consumers are unaffected. - When a provider does not return authoritative usage, behavior falls back to the previous token estimate (total only, no split). - Non-agent flows (Dataflow `Pipeline`, sync `Graph.run`) are untouched. ## Testing - [x] Simple agent answer: `message_end.usage.total_tokens` matches provider usage. - [x] Agent with cross-language retrieval: aggregate equals the sum of both provider calls. - [x] Tool-calling agent (multi-round): total accumulates across rounds. - [x] Nested agent (agent-as-tool): sub-agent tokens included in the parent run total. - [x] Langfuse: agent generations show input/output split and are grouped by session/user. --------- Co-authored-by: yzc <yuzhichang@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-02 04:35:28 +03:00
# Per-run token usage sink. An agent run (Canvas.run) installs a mutable dict here
# at the start of each turn; every LLMBundle chat call adds its provider-reported
# usage to it. This is the single chokepoint that aggregates token usage across all
# LLM calls in a run (query rewriting, cross-language translation, tool reasoning,
# and the final streamed answer) regardless of which component or helper issued the
# call. Default None means "not inside a tracked run" and callers must no-op.
token_usage_sink: contextvars.ContextVar = contextvars.ContextVar("ragflow_token_usage_sink", default=None)
# Per-run Langfuse correlating attributes (e.g. {"session_id": ..., "user_id": ...}).
# Installed by Canvas.run so RAGFlow's own Langfuse generations can be grouped by
# session and user even though the agent's LLMBundles are created without them.
langfuse_run_attrs: contextvars.ContextVar = contextvars.ContextVar("ragflow_langfuse_run_attrs", default=None)
# Guards sink mutations: concurrent tool calls (asyncio.gather + thread_pool_exec,
# which copies the context so worker threads share the same sink dict) can otherwise
# race on the read-modify-write of the counters.
_sink_lock = threading.Lock()
def record_run_token_usage(prompt_tokens: int = 0, completion_tokens: int = 0, total_tokens: int = 0) -> None:
"""Add a single LLM call's token usage to the active run sink, if any.
Safe to call from anywhere: when no run sink is installed it does nothing.
"""
sink = token_usage_sink.get()
if sink is None:
return
try:
with _sink_lock:
sink["prompt_tokens"] += int(prompt_tokens or 0)
sink["completion_tokens"] += int(completion_tokens or 0)
sink["total_tokens"] += int(total_tokens or 0)
sink["calls"] += 1
except Exception:
# Never let usage bookkeeping break a request; log at debug so a malformed
# sink or token value is still traceable without adding noise.
logging.debug("Failed to record run token usage", exc_info=True)
def usage_from_response(resp) -> dict:
"""Extract a {prompt_tokens, completion_tokens, total_tokens} split from an LLM response.
Handles OpenAI/OpenRouter-style ``resp.usage`` objects and dict variants. Missing
fields default to 0; ``total_tokens`` falls back to prompt+completion when absent.
"""
out = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
if resp is None:
return out
usage = None
try:
usage = getattr(resp, "usage", None)
if usage is None and isinstance(resp, dict):
usage = resp.get("usage")
except Exception:
usage = None
if usage is None:
return out
def _get(obj, *names):
for n in names:
try:
v = obj.get(n) if isinstance(obj, dict) else getattr(obj, n, None)
except Exception:
v = None
if v:
return int(v)
return 0
out["prompt_tokens"] = _get(usage, "prompt_tokens", "input_tokens")
out["completion_tokens"] = _get(usage, "completion_tokens", "output_tokens")
out["total_tokens"] = _get(usage, "total_tokens")
if not out["total_tokens"]:
out["total_tokens"] = out["prompt_tokens"] + out["completion_tokens"]
return out
def num_tokens_from_string(string: str) -> int:
"""Returns the number of tokens in a text string."""
try:
code_list = encoder.encode(string)
return len(code_list)
except Exception:
return 0
feat(agent): report accurate aggregated token usage and propagate session/user + input/output to Langfuse for agent runs (#16420) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): ## Summary Agent (Canvas) runs previously did not surface token usage in the SSE stream, and RAGFlow's own Langfuse generations for agent runs were missing the prompt/completion split and the session/user correlation. This made it impossible for an external caller (or Langfuse) to reconcile an agent turn's cost with the upstream provider (e.g. OpenRouter), because a single turn can issue several distinct LLM calls (query rewriting / cross-language translation, multi-round tool reasoning, nested sub-agents, and the final answer). This PR introduces a per-run token usage sink so that **every** LLM call in a run is aggregated and reported once, and enriches Langfuse generations with the prompt/completion split plus session/user attributes. ## What changes ### 1. Per-run token usage sink (`common/token_utils.py`) - Adds two `contextvars`: `token_usage_sink` (a mutable per-run accumulator) and `langfuse_run_attrs` (session_id/user_id for the run). - Adds `record_run_token_usage(...)` (thread-safe via a lock, because `thread_pool_exec` copies the context into worker threads that share the sink dict) and `usage_from_response(...)` which extracts a `{prompt_tokens, completion_tokens, total_tokens}` split from OpenAI/OpenRouter-style responses. ### 2. Provider layer captures the prompt/completion split (`rag/llm/chat_model.py`) - `LiteLLMBase` and `Base` now store `self.last_usage` (prompt/completion/total) for the most recent chat call, in both the plain and tool-calling paths. - Streaming requests set `stream_options.include_usage = True` (LiteLLM path) so the authoritative usage arrives on the final chunk; this is read even on the usage-only chunk that carries no `choices`. - Fixes a multi-round accounting bug in `*_with_tools`: token totals were **overwritten** by each round (`total_tokens = tol`) instead of accumulated, undercounting multi-round tool conversations. Each round is now committed to a running aggregate. ### 3. LLMBundle reports usage once, per call (`api/db/services/llm_service.py`) - New `_report_usage(total_tokens)` records the call's usage into the active run sink and returns the prompt/completion/total split for Langfuse. The split is only used when it is consistent with the authoritative total; otherwise only the total is reported. - All three chat entry points (`async_chat`, `async_chat_streamly`, `async_chat_streamly_delta`) now emit `usage_details` with `input`/`output`/`total` instead of total-only. - `_start_langfuse_observation` now applies `session_id`/`user_id` from the per-run context (`langfuse_run_attrs`) so agent-run generations are correctly grouped, even though agent LLMBundles are constructed without those attributes. ### 4. Canvas installs the sink and emits the aggregate (`agent/canvas.py`) - `Canvas.run()` installs a fresh `token_usage_sink` and `langfuse_run_attrs` (from `user_id`/`session_id`) at the start of every turn. - `message_end` now includes an aggregated `usage` object: `{prompt_tokens, completion_tokens, total_tokens, calls}` covering all LLM calls in the run. ### 5. Pass session id into the run (`api/db/services/canvas_service.py`) - `completion()` forwards `session_id` to `Canvas.run()` for Langfuse session correlation. ## Why a context variable LLM calls in an agent run originate from many places that each build their own `LLMBundle` (e.g. `cross_languages`/`keyword_extraction` helpers, the Agent component, and nested sub-agents invoked as tools). A run-scoped context variable is the only non-invasive chokepoint that captures all of them exactly once, including nested agents (which run in the same async context) and thread-pool tools (the executor copies the context). ## Behavior / compatibility - No public API or wire-format removal: `message_end` gains an additional optional `usage` field; existing consumers are unaffected. - When a provider does not return authoritative usage, behavior falls back to the previous token estimate (total only, no split). - Non-agent flows (Dataflow `Pipeline`, sync `Graph.run`) are untouched. ## Testing - [x] Simple agent answer: `message_end.usage.total_tokens` matches provider usage. - [x] Agent with cross-language retrieval: aggregate equals the sum of both provider calls. - [x] Tool-calling agent (multi-round): total accumulates across rounds. - [x] Nested agent (agent-as-tool): sub-agent tokens included in the parent run total. - [x] Langfuse: agent generations show input/output split and are grouped by session/user. --------- Co-authored-by: yzc <yuzhichang@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-02 04:35:28 +03:00
def total_token_count_from_response(resp):
fix(llm): handle None response in total_token_count_from_response (#10941) ### What problem does this PR solve? Fixes #10933 This PR fixes a `TypeError` in the Gemini model provider where the `total_token_count_from_response()` function could receive a `None` response object, causing the error: TypeError: argument of type 'NoneType' is not iterable **Root Cause:** The function attempted to use the `in` operator to check dictionary keys (lines 48, 54, 60) without first validating that `resp` was not `None`. When Gemini's `chat_streamly()` method returns `None`, this triggers the error. **Solution:** 1. Added a null check at the beginning of the function to return `0` if `resp is None` 2. Added `isinstance(resp, dict)` checks before all `in` operations to ensure type safety 3. This defensive programming approach prevents the TypeError while maintaining backward compatibility ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Changes Made **File:** `rag/utils/__init__.py` - Line 36-38: Added `if resp is None: return 0` check - Line 52: Added `isinstance(resp, dict)` before `'usage' in resp` - Line 58: Added `isinstance(resp, dict)` before `'usage' in resp` - Line 64: Added `isinstance(resp, dict)` before `'meta' in resp` ### Testing - [x] Code compiles without errors - [x] Follows existing code style and conventions - [x] Change is minimal and focused on the specific issue ### Additional Notes This fix ensures robust handling of various response types from LLM providers, particularly Gemini, w --------- Signed-off-by: Zhang Zhefang <zhangzhefang@example.com>
2025-11-20 10:04:03 +08:00
"""
Extract token count from LLM response in various formats.
Handles None responses and different response structures from various LLM providers.
Returns 0 if token count cannot be determined.
"""
if resp is None:
return 0
try:
if hasattr(resp, "usage") and hasattr(resp.usage, "total_tokens"):
return resp.usage.total_tokens
except Exception:
pass
try:
if hasattr(resp, "usage_metadata") and hasattr(resp.usage_metadata, "total_tokens"):
return resp.usage_metadata.total_tokens
except Exception:
pass
try:
if hasattr(resp, "meta") and hasattr(resp.meta, "billed_units") and hasattr(resp.meta.billed_units, "input_tokens"):
return resp.meta.billed_units.input_tokens
except Exception:
pass
feat(agent): report accurate aggregated token usage and propagate session/user + input/output to Langfuse for agent runs (#16420) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): ## Summary Agent (Canvas) runs previously did not surface token usage in the SSE stream, and RAGFlow's own Langfuse generations for agent runs were missing the prompt/completion split and the session/user correlation. This made it impossible for an external caller (or Langfuse) to reconcile an agent turn's cost with the upstream provider (e.g. OpenRouter), because a single turn can issue several distinct LLM calls (query rewriting / cross-language translation, multi-round tool reasoning, nested sub-agents, and the final answer). This PR introduces a per-run token usage sink so that **every** LLM call in a run is aggregated and reported once, and enriches Langfuse generations with the prompt/completion split plus session/user attributes. ## What changes ### 1. Per-run token usage sink (`common/token_utils.py`) - Adds two `contextvars`: `token_usage_sink` (a mutable per-run accumulator) and `langfuse_run_attrs` (session_id/user_id for the run). - Adds `record_run_token_usage(...)` (thread-safe via a lock, because `thread_pool_exec` copies the context into worker threads that share the sink dict) and `usage_from_response(...)` which extracts a `{prompt_tokens, completion_tokens, total_tokens}` split from OpenAI/OpenRouter-style responses. ### 2. Provider layer captures the prompt/completion split (`rag/llm/chat_model.py`) - `LiteLLMBase` and `Base` now store `self.last_usage` (prompt/completion/total) for the most recent chat call, in both the plain and tool-calling paths. - Streaming requests set `stream_options.include_usage = True` (LiteLLM path) so the authoritative usage arrives on the final chunk; this is read even on the usage-only chunk that carries no `choices`. - Fixes a multi-round accounting bug in `*_with_tools`: token totals were **overwritten** by each round (`total_tokens = tol`) instead of accumulated, undercounting multi-round tool conversations. Each round is now committed to a running aggregate. ### 3. LLMBundle reports usage once, per call (`api/db/services/llm_service.py`) - New `_report_usage(total_tokens)` records the call's usage into the active run sink and returns the prompt/completion/total split for Langfuse. The split is only used when it is consistent with the authoritative total; otherwise only the total is reported. - All three chat entry points (`async_chat`, `async_chat_streamly`, `async_chat_streamly_delta`) now emit `usage_details` with `input`/`output`/`total` instead of total-only. - `_start_langfuse_observation` now applies `session_id`/`user_id` from the per-run context (`langfuse_run_attrs`) so agent-run generations are correctly grouped, even though agent LLMBundles are constructed without those attributes. ### 4. Canvas installs the sink and emits the aggregate (`agent/canvas.py`) - `Canvas.run()` installs a fresh `token_usage_sink` and `langfuse_run_attrs` (from `user_id`/`session_id`) at the start of every turn. - `message_end` now includes an aggregated `usage` object: `{prompt_tokens, completion_tokens, total_tokens, calls}` covering all LLM calls in the run. ### 5. Pass session id into the run (`api/db/services/canvas_service.py`) - `completion()` forwards `session_id` to `Canvas.run()` for Langfuse session correlation. ## Why a context variable LLM calls in an agent run originate from many places that each build their own `LLMBundle` (e.g. `cross_languages`/`keyword_extraction` helpers, the Agent component, and nested sub-agents invoked as tools). A run-scoped context variable is the only non-invasive chokepoint that captures all of them exactly once, including nested agents (which run in the same async context) and thread-pool tools (the executor copies the context). ## Behavior / compatibility - No public API or wire-format removal: `message_end` gains an additional optional `usage` field; existing consumers are unaffected. - When a provider does not return authoritative usage, behavior falls back to the previous token estimate (total only, no split). - Non-agent flows (Dataflow `Pipeline`, sync `Graph.run`) are untouched. ## Testing - [x] Simple agent answer: `message_end.usage.total_tokens` matches provider usage. - [x] Agent with cross-language retrieval: aggregate equals the sum of both provider calls. - [x] Tool-calling agent (multi-round): total accumulates across rounds. - [x] Nested agent (agent-as-tool): sub-agent tokens included in the parent run total. - [x] Langfuse: agent generations show input/output split and are grouped by session/user. --------- Co-authored-by: yzc <yuzhichang@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-02 04:35:28 +03:00
if isinstance(resp, dict) and "usage" in resp and "total_tokens" in resp["usage"]:
try:
return resp["usage"]["total_tokens"]
except Exception:
pass
feat(agent): report accurate aggregated token usage and propagate session/user + input/output to Langfuse for agent runs (#16420) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): ## Summary Agent (Canvas) runs previously did not surface token usage in the SSE stream, and RAGFlow's own Langfuse generations for agent runs were missing the prompt/completion split and the session/user correlation. This made it impossible for an external caller (or Langfuse) to reconcile an agent turn's cost with the upstream provider (e.g. OpenRouter), because a single turn can issue several distinct LLM calls (query rewriting / cross-language translation, multi-round tool reasoning, nested sub-agents, and the final answer). This PR introduces a per-run token usage sink so that **every** LLM call in a run is aggregated and reported once, and enriches Langfuse generations with the prompt/completion split plus session/user attributes. ## What changes ### 1. Per-run token usage sink (`common/token_utils.py`) - Adds two `contextvars`: `token_usage_sink` (a mutable per-run accumulator) and `langfuse_run_attrs` (session_id/user_id for the run). - Adds `record_run_token_usage(...)` (thread-safe via a lock, because `thread_pool_exec` copies the context into worker threads that share the sink dict) and `usage_from_response(...)` which extracts a `{prompt_tokens, completion_tokens, total_tokens}` split from OpenAI/OpenRouter-style responses. ### 2. Provider layer captures the prompt/completion split (`rag/llm/chat_model.py`) - `LiteLLMBase` and `Base` now store `self.last_usage` (prompt/completion/total) for the most recent chat call, in both the plain and tool-calling paths. - Streaming requests set `stream_options.include_usage = True` (LiteLLM path) so the authoritative usage arrives on the final chunk; this is read even on the usage-only chunk that carries no `choices`. - Fixes a multi-round accounting bug in `*_with_tools`: token totals were **overwritten** by each round (`total_tokens = tol`) instead of accumulated, undercounting multi-round tool conversations. Each round is now committed to a running aggregate. ### 3. LLMBundle reports usage once, per call (`api/db/services/llm_service.py`) - New `_report_usage(total_tokens)` records the call's usage into the active run sink and returns the prompt/completion/total split for Langfuse. The split is only used when it is consistent with the authoritative total; otherwise only the total is reported. - All three chat entry points (`async_chat`, `async_chat_streamly`, `async_chat_streamly_delta`) now emit `usage_details` with `input`/`output`/`total` instead of total-only. - `_start_langfuse_observation` now applies `session_id`/`user_id` from the per-run context (`langfuse_run_attrs`) so agent-run generations are correctly grouped, even though agent LLMBundles are constructed without those attributes. ### 4. Canvas installs the sink and emits the aggregate (`agent/canvas.py`) - `Canvas.run()` installs a fresh `token_usage_sink` and `langfuse_run_attrs` (from `user_id`/`session_id`) at the start of every turn. - `message_end` now includes an aggregated `usage` object: `{prompt_tokens, completion_tokens, total_tokens, calls}` covering all LLM calls in the run. ### 5. Pass session id into the run (`api/db/services/canvas_service.py`) - `completion()` forwards `session_id` to `Canvas.run()` for Langfuse session correlation. ## Why a context variable LLM calls in an agent run originate from many places that each build their own `LLMBundle` (e.g. `cross_languages`/`keyword_extraction` helpers, the Agent component, and nested sub-agents invoked as tools). A run-scoped context variable is the only non-invasive chokepoint that captures all of them exactly once, including nested agents (which run in the same async context) and thread-pool tools (the executor copies the context). ## Behavior / compatibility - No public API or wire-format removal: `message_end` gains an additional optional `usage` field; existing consumers are unaffected. - When a provider does not return authoritative usage, behavior falls back to the previous token estimate (total only, no split). - Non-agent flows (Dataflow `Pipeline`, sync `Graph.run`) are untouched. ## Testing - [x] Simple agent answer: `message_end.usage.total_tokens` matches provider usage. - [x] Agent with cross-language retrieval: aggregate equals the sum of both provider calls. - [x] Tool-calling agent (multi-round): total accumulates across rounds. - [x] Nested agent (agent-as-tool): sub-agent tokens included in the parent run total. - [x] Langfuse: agent generations show input/output split and are grouped by session/user. --------- Co-authored-by: yzc <yuzhichang@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-02 04:35:28 +03:00
if isinstance(resp, dict) and "usage" in resp and "input_tokens" in resp["usage"] and "output_tokens" in resp["usage"]:
try:
return resp["usage"]["input_tokens"] + resp["usage"]["output_tokens"]
except Exception:
pass
feat(agent): report accurate aggregated token usage and propagate session/user + input/output to Langfuse for agent runs (#16420) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Other (please describe): ## Summary Agent (Canvas) runs previously did not surface token usage in the SSE stream, and RAGFlow's own Langfuse generations for agent runs were missing the prompt/completion split and the session/user correlation. This made it impossible for an external caller (or Langfuse) to reconcile an agent turn's cost with the upstream provider (e.g. OpenRouter), because a single turn can issue several distinct LLM calls (query rewriting / cross-language translation, multi-round tool reasoning, nested sub-agents, and the final answer). This PR introduces a per-run token usage sink so that **every** LLM call in a run is aggregated and reported once, and enriches Langfuse generations with the prompt/completion split plus session/user attributes. ## What changes ### 1. Per-run token usage sink (`common/token_utils.py`) - Adds two `contextvars`: `token_usage_sink` (a mutable per-run accumulator) and `langfuse_run_attrs` (session_id/user_id for the run). - Adds `record_run_token_usage(...)` (thread-safe via a lock, because `thread_pool_exec` copies the context into worker threads that share the sink dict) and `usage_from_response(...)` which extracts a `{prompt_tokens, completion_tokens, total_tokens}` split from OpenAI/OpenRouter-style responses. ### 2. Provider layer captures the prompt/completion split (`rag/llm/chat_model.py`) - `LiteLLMBase` and `Base` now store `self.last_usage` (prompt/completion/total) for the most recent chat call, in both the plain and tool-calling paths. - Streaming requests set `stream_options.include_usage = True` (LiteLLM path) so the authoritative usage arrives on the final chunk; this is read even on the usage-only chunk that carries no `choices`. - Fixes a multi-round accounting bug in `*_with_tools`: token totals were **overwritten** by each round (`total_tokens = tol`) instead of accumulated, undercounting multi-round tool conversations. Each round is now committed to a running aggregate. ### 3. LLMBundle reports usage once, per call (`api/db/services/llm_service.py`) - New `_report_usage(total_tokens)` records the call's usage into the active run sink and returns the prompt/completion/total split for Langfuse. The split is only used when it is consistent with the authoritative total; otherwise only the total is reported. - All three chat entry points (`async_chat`, `async_chat_streamly`, `async_chat_streamly_delta`) now emit `usage_details` with `input`/`output`/`total` instead of total-only. - `_start_langfuse_observation` now applies `session_id`/`user_id` from the per-run context (`langfuse_run_attrs`) so agent-run generations are correctly grouped, even though agent LLMBundles are constructed without those attributes. ### 4. Canvas installs the sink and emits the aggregate (`agent/canvas.py`) - `Canvas.run()` installs a fresh `token_usage_sink` and `langfuse_run_attrs` (from `user_id`/`session_id`) at the start of every turn. - `message_end` now includes an aggregated `usage` object: `{prompt_tokens, completion_tokens, total_tokens, calls}` covering all LLM calls in the run. ### 5. Pass session id into the run (`api/db/services/canvas_service.py`) - `completion()` forwards `session_id` to `Canvas.run()` for Langfuse session correlation. ## Why a context variable LLM calls in an agent run originate from many places that each build their own `LLMBundle` (e.g. `cross_languages`/`keyword_extraction` helpers, the Agent component, and nested sub-agents invoked as tools). A run-scoped context variable is the only non-invasive chokepoint that captures all of them exactly once, including nested agents (which run in the same async context) and thread-pool tools (the executor copies the context). ## Behavior / compatibility - No public API or wire-format removal: `message_end` gains an additional optional `usage` field; existing consumers are unaffected. - When a provider does not return authoritative usage, behavior falls back to the previous token estimate (total only, no split). - Non-agent flows (Dataflow `Pipeline`, sync `Graph.run`) are untouched. ## Testing - [x] Simple agent answer: `message_end.usage.total_tokens` matches provider usage. - [x] Agent with cross-language retrieval: aggregate equals the sum of both provider calls. - [x] Tool-calling agent (multi-round): total accumulates across rounds. - [x] Nested agent (agent-as-tool): sub-agent tokens included in the parent run total. - [x] Langfuse: agent generations show input/output split and are grouped by session/user. --------- Co-authored-by: yzc <yuzhichang@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2026-07-02 04:35:28 +03:00
if isinstance(resp, dict) and "meta" in resp and "tokens" in resp["meta"] and "input_tokens" in resp["meta"]["tokens"] and "output_tokens" in resp["meta"]["tokens"]:
try:
return resp["meta"]["tokens"]["input_tokens"] + resp["meta"]["tokens"]["output_tokens"]
except Exception:
pass
return 0
def truncate(string: str, max_len: int) -> str:
"""Returns truncated text if the length of text exceed max_len."""
return encoder.decode(encoder.encode(string)[:max_len])