Files
ragflow/docs/develop/agent-go-port-design.md
Zhichang Yu 3fa15c0e2f feat(agent): Go port — canvas engine, 22 components, DSL v2, 13 endpoints (#15952)
Ports the agent canvas subsystem from Python to Go.

## What's included

### Canvas Engine (Phase 0/1)
- State engine, scheduler, variable resolver, Redis checkpoint store,
cancel protocol
- **209 tests** across canvas / component / io packages

### 22 Components (P0–P4)
| Tier | Components |
|---|---|
| P0 T1+T2+T3 | LLM, Agent, ExitLoop, Switch, Categorize, Begin,
Message, Invoke |
| P1 T3 | VariableAggregator, VariableAssigner, StringTransform,
ListOperations, DataOperations |
| P2 T3 | Iteration, IterationItem, Loop, LoopItem |
| P3 T3 | UserFillUp, Fillup |
| P4 T5 | Browser, ExcelProcessor, DocsGenerator |

### DSL v2 Schema (Phase 2.5)
- Typed v2 in-memory model with v1-to-v2 auto-detect converter
- v1 legacy field stripping per plan §2.11.7

### HTTP Endpoints & Bug Fixes (Plans PR1–PR3)
- **DELETE SQL bug fix**: gorm v2 `Where("id = ?", id).Delete(...)`
pattern
- **CreateAgent validation**: title/DSL required, duplicate check, 103
envelope
- **13 new endpoints**: templates, prompts, tags, sessions CRUD,
chat/completions (SSE + non-stream stubs), rerun, test_db_connection,
logs, webhook/logs
- **756 Go unit tests** (745 → 756, +18)
- **17 → 0 Python integration test failures** (test_agents.py +
test_session_management/)

### Tools
21 eino tools: HTTPHelper, search tools, financial/data tools, mandatory
stubs

### Infrastructure
OTel observability, NATS message queue, DeepDoc gRPC client, SSRF
guards, IDOR mitigation
2026-06-12 22:58:28 +08:00

1292 lines
79 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Agent Canvas Go Port — Design Document
> **Status:** Phase 1 / 2.5 / 3 / 4 / 5 / 5.5 核心功能已落地Phase 6 (灰度) / Phase 7 (清理) 未启动
> **Last cross-checked against code:** 2026-06-11 (commit `aa270bed7`)
> **Source of truth:** `internal/agent/` (canvas, component, tool, runtime, workflowx, dsl) + `internal/observability/otel/`
> **Supersedes:** `.claude/plans/agent-go-port.md`, `.claude/plans/eino-workflow-loop.md`, `.claude/plans/eino-workflow-parallel.md`, `.claude/plans/fluffy-strolling-bear.md`, `.claude/plans/refactor-canvas-loop.md`
This document consolidates the five plan files in `.claude/plans/` into a single design-of-record. It describes the **current** state (present tense), verified against the code, with a final section that calls out where reality diverged from the original plans.
---
## 1. 概述 / Overview
### 1.1 目标
RAGFlow 的 Agent Canvas编排 22 个 component + 21 个 tool 的 DSL 执行器)从 Python 移植到 Go。Python 端位于 `agent/canvas.py``Graph` / `Canvas`+ `agent/component/base.py``ComponentBase` / `ComponentParamBase`+ `agent/tools/`。Go 端独立实现于 `internal/agent/`,与 Python 端通过共享 DSL JSON schema 兼容v1↔v2 双向转换器在 `internal/agent/dsl/`)。
### 1.2 核心架构决策
**State + Workflow 混血**eino 的 `compose.Workflow` 提供声明式拓扑(节点 + exec 边)+ 并发调度;`compose.WithGenLocalState` + `WithStatePreHandler/WithStatePostHandler` 提供任意节点读任意节点输出的"状态变量"能力。State 解决 `{{cpn_id@param}}` 任意交叉引用问题Workflow 解决执行拓扑 + cancel + checkpoint 问题。
**5-tier 移植策略**T1直接复用 eino 内置)→ T2薄包装→ T3Lambda + State→ T4嵌套 Workflow 子图)→ T5重 I/O + 第三方 lib。判定原则功能相当 → 优先 eino 内置,禁止复制 Python 端的黑魔法(`_feeded_deprecated_params`、partial hack、`thread_pool_exec` 异步伪装等)。
**Checkpoint 存 Redis**eino `compose.CheckPointStore` 是纯 KV 接口Redis String + EXPIRE 是天然 fit。业务元数据status / canvas_id / parent_run_id走独立 Redis Hash**由应用层显式控制**,不依赖 eino 自动写)。
**Observability 走 OpenTelemetry**:弃用 §2.10 v1 "Redis Stream + MySQL 双写",改用 OTLP HTTP exporter + eino `callbacks.Handler` 注入 span。理由业界事实标准与 Python langfuseOTel-based互通零新表。
**AGPL-3 零容忍**T5 DOCX 库穷举后全部 AGPL-3/维护停滞,**自实现 OOXML writer**`archive/zip` stdlib + `text/template`PDF 选 `signintech/gopdf` (MIT)Excel 选 `xuri/excelize/v2` (BSD-3)Markdown 选 `yuin/goldmark` (MIT)。
---
## 2. 顶层模块布局 / Module Layout
```
internal/agent/
├── canvas/ # 画布执行器eino 编译、状态调度、checkpoint、cancel、stream
│ ├── canvas.go # Canvas struct, BuildWorkflow, Run/Stream
│ ├── state.go # CanvasState, Outputs/Sys/Env/Path/History
│ ├── state_export.go # WithState / GetStateFromContext (runtime 包的薄重导出,测试用)
│ ├── variable.go # {{cpn_id@param}} / sys.x / env.x 解析
│ ├── scheduler.go # State pre/post handler + 节点 lambda
│ ├── node_body.go # 单节点 lambda 体state in/out + 调 component
│ ├── loop_subgraph.go # Loop 宏展开buildSubWorkflow + translateLoopCondition
│ ├── cycle_wrap.go # cycle detection + back-edge 切断
│ ├── cancel.go # Redis cancel 协议 (watchCancel goroutine)
│ ├── stream.go # SSE 通道
│ ├── compile.go # eino 编译 + WithCheckPointStore + WithSerializer
│ ├── checkpoint_store.go # RedisCheckPointStore (Get/Set/Delete)
│ ├── run_tracker.go # RunTracker (Start/MarkSucceeded/MarkFailed/MarkCancelled/AttachCheckpoint)
│ └── state_serializer.go # CanvasStateSerializer (encoding/json, eino Serializer 签名无 ctx)
├── component/ # 19 components + 5 helpers
│ ├── base.go # Component interface + ParamError + ErrNotImplemented
│ ├── registry.go # name → factory 映射
│ ├── runtime_wire.go # 组件与 runtime 包的桥接
│ ├── io_init.go # T5 组件初始化
│ ├── v1_stubs.go # v1 DSL compat 桩
│ ├── agent.go # T1 — react.NewAgent
│ ├── llm.go # T1 — EinoChatModel 薄包装
│ ├── switch.go # T2 — NewGraphMultiBranch
│ ├── begin.go / message.go / categorize.go / invoke.go / browser.go
│ ├── data_operations.go / list_operations.go / string_transform.go
│ ├── variable_aggregator.go / variable_assigner.go
│ ├── fillup.go / userfillup.go
│ ├── loop.go # T4 — no-op marker, 实际工作由 loop_subgraph 接管
│ ├── parallel.go # T4 — workflowx.AddParallelNode 包装
│ ├── docs_generator.go / excel_processor.go # T5
├── tool/ # 21 tools (统一 eino tool.InvokableTool)
│ ├── registry.go # BuildAll / BuildByName (支持 alias: execute_sql/exesql, retrieval/search_my_dateset)
│ ├── http_helper.go # 共用 HTTP client (context + retry)
│ ├── ssrf.go # SSRF 防护
│ ├── akshare.go / arxiv.go / code_exec.go / crawler.go / deepl.go
│ ├── duckduckgo.go / email.go / exesql.go / github.go / google.go
│ ├── google_scholar.go / jin10.go / pubmed.go / qweather.go
│ ├── retrieval.go / searxng.go / tavily.go / tushare.go
│ ├── wencai.go / wikipedia.go / yahoo_finance.go
├── runtime/ # canvas + component 共享的运行时契约(无 cycle
│ ├── component.go # Component interface (从 component/base.go 提取)
│ ├── context.go # GetStateFromContext / withState
│ ├── state.go # CanvasState + NewCanvasState + GetVar/SetVar/ReadVars
│ ├── template.go # ResolveTemplate (从 canvas/variable.go 提取)
│ ├── selector.go # component selector 辅助
│ └── metrics.go # runtime metrics
├── workflowx/ # eino 扩展(零侵入,外部 helper
│ ├── loop.go # AddLoopNode[T] — 通用 do-while 循环节点
│ ├── parallel.go # AddParallelNode[I,O] — 通用 bounded-concurrency 节点
│ └── *_test.go # 单元 + 集成测试miniredis 风格的内存 store
└── dsl/ # DSL v2 schema + v1↔v2 双向转换器
├── v2.go # Go-native 强类型 schemaversion=2, 无 _feeded_deprecated_params 装饰)
├── loader.go # 自动检测 v1/v2输出统一 v2 内存模型
├── converter_v1_to_v2.go
└── converter_v2_to_v1.go
internal/observability/otel/
├── provider.go # TracerProvider 工厂(读 OTEL_EXPORTER_OTLP_ENDPOINT未配置时返回 noop
├── handler.go # eino callbacks.Handler → OTel span
└── handler_test.go # tracetest.SpanRecorder 单元测试
```
**实际文件计数**(与 §14 计划偏差):
- Components: **19 个** (计划写 22 → 21) — 见 §14.1 偏差说明
- Tools: **21 个** (计划 21 ✓)
- Test files: 35+ (含 loop_semantics_test.go, dsl_examples_e2e_test.go, cycle_wrap_test 等)
---
## 3. 架构 / Architecture
### 3.1 State + Workflow 混血
eino `compose.Workflow` 本身只支持 DAG节点间数据通过 declared predecessor 输出传递),没有"任意节点读任意节点输出"的现成 API。RAGFlow Python 端用 `self._canvas.get_variable_value("cpn_id@param")` 实现 `{{cpn_id@param}}` 任意交叉引用。
**Go 端方案**
1. **State 承载变量**:每个 canvas run 创建 `*CanvasState`,挂在 `context.Value` 上。所有节点通过 `runtime.GetStateFromContext(ctx)` 读写。
2. **State pre-handler**:在 `g.AddLambdaNode(...)` 时挂 `compose.WithStatePreHandler[map[string]any, *runtime.CanvasState](canvasPre)`,从 State 提取节点输入。
3. **State post-handler**:挂 `compose.WithStatePostHandler`,把节点输出回写 State。
4. **Workflow 承载拓扑**:节点按 `downstream` / `upstream` 加 exec 边,**数据流走 State 不走边**。eino 静态拓扑分析仍然能看到 exec 边,调度正确性不丢失。
```go
// internal/agent/canvas/scheduler.go — 节点加挂方式
node := wf.AddLambdaNode(cpnID, nodeBody,
compose.WithStatePreHandler[map[string]any, *runtime.CanvasState](canvasPre),
compose.WithStatePostHandler[map[string]any, *runtime.CanvasState](canvasPost),
)
for _, upID := range comp.Upstream {
node.AddInput(upID) // exec 边
}
```
**关键修正**vs §2.6 v1 plan`WithStatePreHandler/WithStatePostHandler``GraphAddNodeOpt`(节点选项),**不是** `GraphCompileOption`(编译选项)。传给 `g.Compile(...)` 编译失败。eino 实际签名:
- `compose.NewGraph[I,O](opts ...NewGraphOption)` — 工厂选项,含 `WithGenLocalState`
- `g.AddNode(name, lambda, opts ...GraphAddNodeOpt)` — 节点选项,含 `WithStatePreHandler/WithStatePostHandler`
- `g.Compile(ctx, opts ...GraphCompileOption)` — 编译选项,含 `WithCheckPointStore/WithSerializer/WithInterruptBeforeNodes/WithInterruptAfterNodes`
### 3.2 `runtime` 包:消除 `canvas <-> component` cycle
**问题**`component/` 大量文件Begin/Message/Switch/Browser/...)需要调 `canvas.CanvasState` / `canvas.GetStateFromContext` / `canvas.ResolveTemplate` / `canvas.SetDefaultFactory`;同时 `canvas` 通过 `ComponentFactory` 间接依赖 `component` 的具体实现。强行 `canvas -> component` 形成 Go import cycle。
**方案**(来自 `fluffy-strolling-bear.md`,已落地):把"运行时共用契约"提取到 `internal/agent/runtime/`**canvas 和 component 都依赖 runtime但不互相依赖**。
| 提取到 runtime | 留在 canvas | 留在 component |
|---------------|-------------|----------------|
| `Component` interface | DSL graph types (`Canvas`, `CanvasComponent`, `CanvasComponentObj`) | component registry + factory |
| `CanvasState` + `GetVar/SetVar/ReadVars` | 拓扑构建 (`BuildWorkflow`, `buildLoopExpansion`, scheduler wiring) | 具体 component 实现 |
| `GetStateFromContext` / `withState` / `WithState` | checkpoint / workflow 编译 orchestration | `NewBeginComponent`, `NewMessageComponent`, ... |
| `ResolveTemplate` + 纯 runtime 模板 helpers | Loop 宏展开 logic | |
| `ParamError`, `ErrNotImplemented` | | |
**`state_export.go` 薄重导出**:测试代码从 `canvas.WithState` 改为 `runtime.WithState` 是机械性替换。为减少 churn`canvas/state_export.go` 提供薄 alias`type CanvasState = runtime.CanvasState` 等),但**生产代码不再 import `canvas` 来获取 state**。
### 3.3 调度模型
```go
// internal/agent/canvas/canvas.go:BuildWorkflow
func BuildWorkflow(ctx context.Context, c *Canvas, store compose.CheckPointStore, ser compose.Serializer) (*compose.Workflow[map[string]any, map[string]any], error) {
wf := compose.NewWorkflow[map[string]any, map[string]any]()
for cpnID, comp := range c.Components {
// 1. 加节点(含 state pre/post handler
node := wf.AddLambdaNode(cpnID, nodeBody,
compose.WithStatePreHandler[map[string]any, *runtime.CanvasState](canvasPre),
compose.WithStatePostHandler[map[string]any, *runtime.CanvasState](canvasPost),
)
// 2. 加 exec 边
for _, upID := range comp.Upstream {
node.AddInput(upID)
}
// 3. 错误跳转
if comp.ExceptionTo != "" {
node.AddInputWithOptions(
buildExceptionDummy(comp),
compose.WithNoDirectDependency(),
compose.WithExceptionBranch(/* ... */),
)
}
}
// 4. 编译(仅编译期选项)
return wf.Compile(ctx,
compose.WithCheckPointStore(store),
compose.WithSerializer(ser),
)
}
```
**`canvasPre` / `canvasPost`**State pre-handler 从 `CanvasState.Outputs[cpn]` 提取节点入参(沿用 `{{cpn_id@param}}` 正则解析post-handler 把节点出参回写 `CanvasState.Outputs[cpn_id]`。eino 拓扑上只有 exec 边data flow 走 State。
---
## 4. Component 库 / Component Library
### 4.1 5-tier 移植策略(**已落地**
| Tier | 含义 | 验收 |
|------|------|------|
| **T1** | 直接用 eino 已有类型/接口,零代码 | eino 单元测试覆盖 |
| **T2** | 薄包装 1 struct + factory对齐 Python 行为参数 | 跨 eino/RAGFlow 边界 + 1 e2e |
| **T3** | `compose.Lambda` + `StatePre/PostHandler` | 1 单测 + 1 e2e |
| **T4** | 嵌套 `compose.Workflow` + `getState[CanvasState](ctx)` | 子图单测 + 完整 e2e |
| **T5** | 重 I/O + 第三方 lib | 单测 + e2e + 失败注入 |
**判定原则**T1 > T2 > T3 > T4 > T5 时**禁止跳级**。除非 eino 抽象**确无对应**。
### 4.2 Component 现状
**19 个 .go 文件**(实际;计划写 22 → 21
| Component | Python 行为 | Tier | Go 实现 |
|-----------|------------|------|---------|
| **LLM** | `LLMBundle` 单轮 chat + JSON output + cite + stream | T1 | `EinoChatModel` 薄包装 `internal/entity/models/<provider>.go`;实现 `model.ToolCallingChatModel`(含 `WithTools` 并发安全) |
| **Agent** | ReAct + tool/MCP + 多轮 stream | T1 | `react.NewAgent` + `compose.ToolsNodeConfig{Tools: tools}` + 22 tool 全注册citation 中间件 + tool artifact 收集为未来增量(**当前未实现**,见 §14 |
| **Switch** | 多条件 (and/or) → 多 downstream + ELSE | T2 | `compose.NewGraphMultiBranch` 路由 |
| **Categorize** | LLM 分类 + 路由 | T3 | Lambda 调 LLM + `compose.NewGraphMultiBranch` |
| **Begin** | DSL 入口 + 注入 inputs + 文件 inputs | T3 | Lambda + `StatePreHandler`;文件走 `internal/service/file_service.go` |
| **UserFillUp / Fillup** | Jinja2 + file inputs | T3 | `text/template` 替代 Jinja2 |
| **Message** | 最终输出jinja2 + stream + downloads + filegen | T3 | Lambda + `schema.StreamReader` + `text/template` + MinIO |
| **Invoke** | HTTP 客户端 + HTML 清洗 + JSON | T3 | `net/http` + `golang.org/x/net/html` |
| **Browser** | LLM + HTTP + 文件下载 + MinIO | T3 | 复用 Invoke + LLM + storage |
| **DataOperations** | dict 7 类操作 | T3 | Lambda + `encoding/json` + `go/ast` |
| **ListOperations** | slice 6 类操作 | T3 | Lambda + `slices` (Go 1.21+ stdlib) |
| **StringTransform** | split/merge + Jinja2 | T3 | Lambda + `strings.Split` + `text/template` |
| **VariableAggregator** | 多 groupfirst-non-empty | T3 | Lambda + State 读 |
| **VariableAssigner** | 12 个算子原地改 State | T3 | Lambda + State 写 |
| **Loop** | 条件循环 + `loop_variables` 初始化 + 终止评估 | T4 | **`compose.NewWorkflow` + `workflowx.AddLoopNode`**loop.go 自身变为 no-op marker实际工作由 `canvas/loop_subgraph.go` 宏展开接管) |
| **Parallel** | 数组并行处理 | T4 | `workflowx.AddParallelNode` 包装(见 §6 |
| **DocsGenerator** | pdf/docx/txt/md/html 生成 | T5 | `signintech/gopdf` (PDF) + 自实现 OOXML writer (DOCX) + `yuin/goldmark` (MD) |
| **ExcelProcessor** | pandas 读/合并/转换 Excel | T5 | `xuri/excelize/v2` (BSD-3) |
### 4.3 不移植的 Python 端"遗产"
| Python 端 | 不移植原因 |
|----------|-----------|
| `_feeded_deprecated_params` / `_deprecated_params` / `_user_feeded_params` 三层装饰 | DSL v2 已去除Go `ComponentParamBase` 不引入 |
| `ComponentParamBase.validate()` + `param_validation/*.json` 96 文件 | Go struct tag + `go-playground/validator/v10` 替代 |
| `ComponentBase.thread_limiter = asyncio.Semaphore(...)` | Go `errgroup.SetLimit(MAX_CONCURRENT_CHATS)` (stdlib x/sync) |
| `partial` 流式 hack | eino `schema.StreamReader` 原生流式 |
| `thread_pool_exec(self._invoke, **kwargs)` 异步伪装 | Go 全程 goroutine |
| `set_output("_ERROR", ...)` + `set_exception_default_value()` 双轨 | Go `error` 单一返回 + eino `OnError` callback |
| `ExitLoop` no-op 节点 | DSL v1 compat 通过 `legacyNoOpNames` 在 canvas 层吸收,**不注册 component** |
| `LoopItem` 组件 | LoopItem 角色由 `workflowx.AddLoopNode` 内部 machinery 取代,**不注册 component** |
| `Iteration` / `IterationItem` 组件 | IterationItem 角色合并到 `Loop` 单节点模式(**Iteration + IterationItem 也走 workflowx.AddLoopNode 同一路径**,但 Loop 终止条件为"遍历完成"而非"条件成立" |
### 4.4 Tool 实现统一模式
```go
// internal/agent/tool/registry.go
type Tool interface {
einotool.InvokableTool // eino 协议Info() / InvokableRun(ctx, args, opts)
}
func BuildAll(names []string, params map[string]map[string]any) ([]einotool.BaseTool, error)
func BuildByName(name string, params map[string]any) (einotool.BaseTool, error)
```
**Alias 一致性**`TestToolRegistry_SchemasAreComplete` 覆盖):
- `execute_sql``exesql` 都 surface canonical `Info().Name == "execute_sql"`
- `retrieval``search_my_dateset` 都 surface canonical `Info().Name == "search_my_dateset"`
**22 tool 表**(与 plan 一致alias 不算新 tool
- akshare, arxiv, code_exec, crawler, deepl, duckduckgo, email, exesql(=execute_sql), github, google, google_scholar, jin10, pubmed, qweather, retrieval(=search_my_dateset), searxng, tavily, tushare, wencai, wikipedia, yahoo_finance = **21 唯一** tool
**Tool 通用模式**HTTP 类 tool 走 `http_helper.go`context + retry + 简单指数 backoffExeSQL 走 stdlib `database/sql` + 各 driver**不复用** `internal/dao` GORM——DAO 是 RAGFlow 元数据库层,与 ExeSQL 用户的外部 DB 完全独立CodeExec 调既有 Python sandbox gRPC保留现状**不重写沙箱**Retrieval 直接进程内 `import internal/service/nlp/retrieval.go`Dealer 后端已 Go 化),`use_kg=True` 暂不支持。
---
## 5. DSL v2 / DSL
### 5.1 v2 schema强类型去装饰
```go
// internal/agent/dsl/v2.go实际
type Canvas struct {
Version int `json:"version"` // 固定 = 2
Components map[string]Component `json:"components"`
}
type Component struct {
ID string `json:"id"`
Name string `json:"name"` // e.g. "Retrieval"
Downstream []string `json:"downstream"`
Params map[string]any `json:"params"`
Outputs map[string]any `json:"outputs,omitempty"` // 运行时填充DSL 加载时不存在
}
```
**去掉的装饰**v1 嵌套 `obj``_feeded_deprecated_params` / `_deprecated_params` / `_user_feeded_params` 三层集合、`custom_header`
**对比 plan §4.6 原始 v2 设计**plan 还规划了 `Path` / `History` / `Retrieval` / `Globals` / `Metadata`(含 author/tags/created_at字段——**这些字段在实现时全部砍掉**。状态信息(`Path` / `History` / `Retrieval` / `Globals`)被推到了 **runtime `CanvasState`**`internal/agent/runtime/state.go:54-66`)—— DSL 只描述拓扑,运行时由 State pre/post handler 填充。这是更聪明的设计:避免 DSL schema 携带运行时状态导致的反序列化陷阱。
**`Metadata` 字段决策****Q4 2026-06-11 闭环**v2 schema 不携带画布级 metadataauthor/tags/created_at。元数据走 RAGFlow 后端已有字段:`user_canvas.title` / `user_canvas.description``internal/entity/canvas.go:25, 28`)—— 业务表空间已存这些信息,不需要在 DSL JSON 里重复。**未来若需要标签/作者等元数据**,建议加 `user_canvas.tags` / `user_canvas.author_id` 列而不是改 DSL schema。详见 §14.8 Q4。
**保留**`{{cpn_id@param}}` / `sys.x` / `env.x` 语法(运行时通过 `runtime.GetVar` 解析);`sys` / `env` 命名空间在 `CanvasState.Sys/Env` 持有(不在 DSL
### 5.2 v1 ↔ v2 双向转换器
**v1 → v2**`internal/agent/dsl/converter_v1_to_v2.go`Phase 2.5 必跑,作为 Phase 2 component 输入适配器,避免每个 component 自己处理 v1 装饰字段。
**v2 → v1**`internal/agent/dsl/converter_v2_to_v1.go`Phase 5.5~270 行):
行为契约:
- 校验输入 canvasnil / 空 / 无效 → error
- 按**确定性顺序**迭代 components`begin_…` 前缀排最前,其余按字典序。自定义 `MarshalJSON` on `v1Envelope` 强制执行Go 默认 map 编码器按 key 文本排序,会打乱顺序)
- **Key 还原**v2 id `<name>_<UUID>` → v1 key `<Name>:<UUID>`
- 从左边第一个 `_` 切分(`switch_abc_def``Switch:abc_def`
- name 半段首字母大写best-effort PascalCase
- **空 uuid 半段**(尾部 `_`,来自 v1 无冒号的 `begin` legacy key**不加冒号**`Begin` 而非 `Begin:`),使 `v1ToV2` 能经无冒号分支重新解析。这是唯一切离 §5 spec 示例的地方,为 round-trip closure 必需
- **大小写是有损的**UUID 半段在 `v1ToV2` 上游被小写化;全大写名称会变为首字母大写(`LLM:abc``llm_abc``Llm:abc`)。结构不变量 `v1ToV2(v2ToV1(v1ToV2(x))) == v1ToV2(x)` 保持
- 构建 v1 entry 形状:
```json
{
"downstream": ["<v1 keys>"],
"obj": {
"component_name": "<name>",
"params": {…},
"downstream": ["<v1 keys>"]
}
}
```
- 空 `downstream` 输出 `[]`(非 `null`),空 `params` 输出 `{}`(非 `null`
- **永不输出**三个 legacy 字段(`_deprecated_params` / `_feeded_deprecated_params` / `_user_feeded_params`——v2 不携带它们,重新输出等于重新引入已删掉的 bug
- 用 `json.Indent` 2 空格格式化输出
**v2→v1 测试覆盖**12 个,全部通过):
| 测试 | 覆盖点 |
|------|--------|
| `TestV2ToV1_WebSearchAssistant` | 30 KB 真实模板完整 v1→v2→v1→v2 round-trip |
| `TestV2ToV1_CustomerFeedback` | 同上customer_feedback_dispatcher.json |
| `TestV2ToV1_IngestionPipeline` | 同上ingestion_pipeline_general.json |
| `TestV2ToV1_EmptyDownstream` | 单组件 → `"downstream": []`(非 null |
| `TestV2ToV1_NilParams` | 双组件 → 两个 `"params": {}`(非 null |
| `TestV2ToV1_NoLegacyFields` | 全量数据输入,输出零 legacy 子串 |
| `TestV2ToV1_DeterministicOrder` | 两次调用(含 map 突变)→ 字节级相同 |
| `TestV2ToV1_KeyRestore` | `begin_abc`→`Begin:abc`, `begin_`→`Begin`(无冒号), `switch_abc_def`→`Switch:abc_def` |
| `TestV2ToV1_NilCanvas` | nil → error不 panic |
| `TestV2ToV1_EmptyComponents` | 空 map → error |
| `TestV2ToV1_BeginFirst` | Begin 是输出 JSON 第一个 key领先 Alpha/Zeta |
| `TestV2ToV1_ParamOrderStable` | 嵌套 map/slice/scalar params round-trip |
| `TestV2ToV1_AcceptanceFixture_Smoke` | e2ev1ToV2 → v2ToV1 → LoadV1 → v1ToV2 无错误 |
DSL 包总测试42 个30 + 12
**已知限制**(已在代码中注释,非 bug
| 限制 | 原因 | 影响 | 缓解 |
|------|------|------|------|
| v1 key 大小写有损(`LLM:abc` → `Llm:abc` | `v1ToV2` 正向路径把两半都小写化 | 装饰性v1 key 字符串不逐字节保持 | 对比走 v2正则形式 |
| v1 输出省略 `upstream` | Plan §5 未指定Python reader 从 `downstream` 计算 | 若 Python reader 容忍缺失则无影响 | 若 §2.2 run-book 发现需要再补 |
| `Begin` key 输出无冒号(`Begin` 非 `Begin:` | `v1ToV2` round-trip 所需spec 示例 `Begin:` 无法重新解析 | 无;`Begin` 和 `Begin:abc` 都是合法 v1 | 若需更新 spec标注示例仅为示意 |
| map 迭代非确定性通过自定义 `MarshalJSON` 规避 | Go `map[string]X` 不排序 | 无——自定义序列化器保障顺序 | 移除自定义序列化器的前提是 Go 支持有序 map |
### 5.3 Round-Trip 闭合不变量
对三个真实模板,以下不变量成立:
```
v1 (template) ──v1ToV2──> v2_a ──v2ToV1──> v1' ──v1ToV2──> v2_b
└─ component ID set 相同
downstream refs 相同
params (canonical JSON) 相同
as v2_a
```
这是在纯 Go 环境中可验证的最强确定性不变量。Python reader 输入 `v1'` 会计算出同一 `v2_b`——由上述闭合性质保证——从而得出相同的执行图。
**验收**Phase 5.5100 条 v1 样本 round-tripv1→v2→v1→v2 字段不变v2 写出的 DSL 喂给旧 Python reader 端到端验证。**数据源约束**:首选 InfiniFlow SRE 维护的 staging 固定回放集≥200 条覆盖 P0-P4回退到生产 DB 抽样需 DPO + DBA + 季度上限 100 条 + ledger 登记;**不接受未脱敏/未登记生产 DSL 流入测试链**。
**本地运行**
```bash
cd internal/agent/dsl
go test -count=1 -run TestV2ToV1 -v # 12 个测试,~1s
go test -count=1 . # 全部 42 个 dsl 测试
go vet ./...
gofmt -l . # 预期无 diff
```
### 5.4 Staging 验收闸门Phase 6 前置条件)
以下两项**无法在 dev 环境执行**,需在 staging 环境由 SRE 团队驱动。Phase 6灰度**在两者都通过前不得启动**。
**闸门 1100 样本 staging 语料库回放**
blocker`staging_canvas_snapshot_2026q2.json`100 条 v1 DSL由 InfiniFlow SRE 维护dev 环境不可用。当前替代方案10 条 `agent/templates/*.json` 真实模板(与 Phase 2.5 共用)。
staging run-book
1. 从 SRE staging object store 拉取语料库(路径 TBD联系 `@ragflow-sre`
2. 放入本地目录
3. 执行:`go test -count=1 -run TestV2ToV1_StagingCorpus -tags=staging``staging` build tag 防止 CI 默认运行)
4. 预期100/100 条目 round-trip 结构等价
5. 若有失败:记录条目 ID + 输入前 200 字符,提 `phase-5.5-corpus-fail` issue
**闸门 2Python reader 兼容性测试**
blockerdev 环境无 Python canvas runtime。需验证 Go 发出的 v1 DSL 能被旧 Python reader 加载。
staging run-book
1. 构建微型 Go 二进制(或 `go test` entry point读 v1 template → `v1ToV2` → `v2ToV1` → 写 v1 JSON 到 stdout
2. 管道输入 Python reader`go run ./cmd/v2-to-v1 < web_search_assistant.json | python -m agent.canvas.load_dsl -`
3. 预期Python reader 返回的 `Graph` 的 nodes 和 edges 与输入匹配(允许 v1 key 大小写恢复的装饰性损失)
4. 若 Python reader 报错:记录 traceback提 `phase-5.5-python-fail` issue。最可能出问题的字段按嫌疑排序`upstream`我们省略了Python 应从 `downstream` 计算)、`obj.params` 形状(我们保持原样)、`Begin` key 有无冒号
---
## 6. workflowx 扩展 / workflowx Extensions
`internal/agent/workflowx/` 提供**零侵入 eino 扩展**——不修改 eino 源码,不添加方法到 `compose.Workflow`,只提供外部 helper。
### 6.1 AddLoopNode[T] — 通用循环节点
**API**
```go
func AddLoopNode[T any](
ctx context.Context,
wf *compose.Workflow[T, T],
key string,
sub *compose.Workflow[T, T],
shouldQuit LoopCondition[T],
opts ...LoopOption,
) (*compose.WorkflowNode, error)
```
**执行模型**do-while 语义):
1. 接收 `current`
2. 跑一次 sub-workflow 拿 `next`
3. `shouldQuit(ctx, iteration, current, next)` — `iteration` 从 1 开始
4. 满足 quit → 返回 `next`;否则 `current = next` 继续
5. 必须至少执行一次
**实现要点**
- `compose.AnyLambda[T, T, struct{}](...)` 包裹 invoke + stream 双路径
- `WithLoopMaxIterations(n)` 强建议(防意外死循环)
- `WithLoopStream(mode)` — `LoopStreamFinalOnly` (默认) / `LoopStreamEveryIteration`
- 错误处理:`ErrLoopMaxIterationsExceeded` / `ErrLoopSubGraphInterrupted` / `ErrLoopResumeStateInvalid` / `ErrLoopQuitConditionFailed`
- 嵌套子 workflow 走 `compose.Runnable[T,T]` + sub-checkpoint 通过 loop-owned bridge store**不要求 caller 单独配 child store**
**Checkpoint/Resume 合约**P0 acceptance
- Invoke path 嵌套 interrupt → 通过 `compose.CompositeInterrupt` 向上传播resume 从中断的 iteration 继续(不重头)
- Stream path 走 **iteration-granular** 恢复合约:已完整发到下游的 iteration 不重放;中断的 iteration 可能整体重放(**不承诺 chunk-granular resume**——eino 公开 API 不支持)
- 稳定 child checkpoint ID 通过 `WithLoopCheckpointIDBuilder(nodeKey, iteration)`;默认 `workflowx-loop:<nodeKey>:<iteration>` 命名空间
**Loop 在 canvas 中的应用**`refactor-canvas-loop.md`,已落地):
- `Loop` 在 Go 端是**单节点**registry 注册 + 工厂,但 `LoopComponent.Invoke` 是 no-op实际工作由 `canvas/loop_subgraph.go` 宏展开接管)
- `BuildWorkflow` 看到名为 `Loop` 的 cpn 时:调用 `expandLoopSubgraph` 收集下游、构建 sub-`compose.Workflow[map[string]any, map[string]any]`、调 `workflowx.AddLoopNode` 把结果作为单节点插入外图,把 Loop 和它的 descendant 从外图节点 map 移除
- `LoopItem` / `ExitLoop` **已删除**v1 compat 通过 `legacyNoOpNames` 在 canvas 层吸收)
### 6.2 AddParallelNode[I, O] — 通用并发节点
**API**
```go
func AddParallelNode[I, O any](
ctx context.Context,
wf *compose.Workflow[[]I, []O],
key string,
sub Compilable[I, O],
opts ...ParallelOption,
) (*compose.WorkflowNode, error)
```
**实现要点**
- 外层 invoke-only内层 sub workflow 可 stream-capableeino runnable 兼容规则接管 stream 转发)
- `WithParallelMaxConcurrency(n int)`0 / 1 = 顺序执行(主 goroutine 跑,**不**起 worker goroutine> 1 = 信号量并发(首 item 主 goroutine后续 goroutine
- **顺序保持不变量**`outputs[i]` 永远对应 `inputs[i]`——并发路径下,每个 goroutine 捕获 `idx` 闭包写入预分配 `outputs[idx]`,与完成顺序无关
- 错误处理:`ErrParallelCompileFailed` / `ErrParallelResumeStateInvalid`per-item 错误用 `fmt.Errorf("item %d: %w", idx, err)` 包装
- 嵌套 interrupt累积到 `compose.CompositeInterrupt(ctx, nil, state, interruptErrs...)`
- 恢复不变量:`CompletedResults InterruptedIndices = 0..TotalCount-1`partition 完整),`InterruptedIndices` = 补集(不是仅显式返回 interrupt 的 index——并发场景下未 durable 完成的也算)
**模型参考**:本扩展以 `cloudwego/eino-examples/compose/batch/batch/node.go` 的 batch 节点为参照;区别是 reference 是 registered Component本扩展是 free helper不依赖 component registry非 DSL caller 也能用)。
**Parallel 在 canvas 中的应用**`component/parallel.go`
- `Parallel` component 走 T4 薄包装:注册时传 `agenttool.BuildByName("parallel", params)`(注:实际是 `internal/agent/component/parallel.go` 的 `ParallelComponent`,不通过 tool registry内部用 `workflowx.AddParallelNode` 把 sub-workflow 插入外图
---
## 7. Checkpoint + Run Tracker / Persistence
### 7.1 双 key 设计
**Key 1`agent:cp:{check_point_id}`** — eino payload 存储
- 类型String直接存 `[]byte`**不走 JSON** —— eino Serializer 已负责序列化)
- TTL30 天Set 时 `EXPIRE 30*24*3600` 一次设置
- eino `CheckPointStore` 是**纯 KV 接口**`internal/core/interrupt.go:27`)—— `Get(ctx, id) ([]byte, bool, error)` / `Set(ctx, id, []byte) error`
- eino **不会**自动写入 status / canvas_id / tenant_id / run_id / parent_id / expires_at 等业务字段
**Key 2`agent:run:{run_id}`** — 业务元数据存储Redis Hash
| 字段 | 类型 | 含义 |
|------|------|------|
| `canvas_id` | string | `user_canvas.id` |
| `tenant_id` | string | |
| `checkpoint_id` | string | 当前 run 的最新 checkpoint指向 key 1 |
| `parent_run_id` | string | resume_from 源 run续跑链可空 |
| `status` | int (0/1/2/3) | 0=running 1=succeeded 2=failed 3=cancelled |
| `failure_reason` | string | 失败原因err.Error() |
| `cancel_requested` | int (0/1) | 1=用户/admin 已请求 cancel |
| `started_at` | int (epoch ms) | |
| `finished_at` | int (epoch ms) | 退出时填写 |
- TTL30 天(与 key 1 同步Set 时 `EXPIRE 30*24*3600`
- `RunTracker.Start/MarkSucceeded/MarkFailed/MarkCancelled/AttachCheckpoint` 显式调用
- **不依赖 eino 自动写**——cancel/fail 后的 `status=failed` 由应用层自己写
### 7.2 4 个 eino payload 写入触发(写 `agent:cp:*`
| # | 触发点 | eino 源码 | 用途 |
|---|--------|-----------|------|
| **W1** | 节点显式 `compose.Interrupt(ctx, info)` / `StatefulInterrupt(ctx, info, state)` | `compose/interrupt.go:110, 130` | human-in-the-loop、外部 API 回调、限流暂停 |
| **W2** | `compose.WithInterruptBeforeNodes([]string)` / `WithInterruptAfterNodes([]string)` 编译期拦截点 | `compose/interrupt.go:31, 37` | 命中后**写盘 + 终止 run**(与 W1 共用 `handleInterrupt` 路径);**默认开 0 个** |
| **W3** | 子 graph interrupt 向上传播 | `subGraphInterruptError``compose/interrupt.go:340` | 嵌套 subgraph / ToolsNode / agentic 抛 interrupt 时,父 graph 同步落盘 |
| **W4** | 运行退出 | `WithCheckPointID` + `WithWriteToCheckPointID` | run 退出时最后一次落盘;**每次 W4 必同步调 `RunTracker.AttachCheckpoint(runID, cpID)`** |
### 7.3 4 个业务元数据写入 + 1 个恢复触发
| # | 触发点 | 写入函数 |
|---|--------|---------|
| **B1** | Canvas run 启动 | `RunTracker.Start(runID, canvasID, tenantID, parentRunID)` |
| **B2** | Run 正常完成 | `RunTracker.MarkSucceeded(runID)` |
| **B3** | Run 失败 | `RunTracker.MarkFailed(runID, err.Error())` |
| **B4** | Run 被 cancel | `RunTracker.MarkCancelled(runID)` |
| **R1** | HTTP `POST /run?resume_from=run_xxx` | handler: `HGetAll("agent:run:run_xxx")` → `checkpoint_id` → `WithCheckPointID(cpID)` + `WithWriteToCheckPointID(newCP)` + `RunTracker.Start(newRunID, canvas, tenant, "run_xxx")` |
### 7.4 Serializer 签名修正
eino `compose.Serializer` 实际签名(`compose/checkpoint.go:53-56`**不带 `context.Context`**
```go
type Serializer interface {
Marshal(v any) ([]byte, error)
Unmarshal(data []byte, v any) error
}
```
**CanvasStateSerializer**`internal/agent/canvas/state_serializer.go`
```go
type CanvasStateSerializer struct{}
func (CanvasStateSerializer) Marshal(v any) ([]byte, error) { return json.Marshal(v) }
func (CanvasStateSerializer) Unmarshal(b []byte, v any) error { return json.Unmarshal(b, v) }
```
### 7.5 Cancel 协议(两段式)
**为什么两段式**eino `compose.WithGraphInterrupt` 返回的 `interrupt` 是 **Go 函数引用**,仅在**同进程内**可调。Admin/UI 在另一个 HTTP handler 里发取消信号,必须经跨进程通道——这正是 Python 端 Redis `{task_id}-cancel` 协议要解决的。两者协同,不替代。
```go
// internal/agent/canvas/cancel.go
func Run(ctx context.Context, taskID string, compiled compose.Runnable[...]) error {
einoCtx, interrupt := compose.WithGraphInterrupt(ctx)
defer close(stopCh)
go watchCancel(taskID, func() {
interrupt(compose.WithGraphInterruptTimeout(30 * time.Second))
})
return compiled.Invoke(einoCtx, input,
compose.WithCheckPointID(genID(taskID)),
compose.WithWriteToCheckPointID(genID(taskID)),
)
}
func watchCancel(taskID string, onCancel func()) {
ticker := time.NewTicker(500 * time.Millisecond) // 500ms 轮询
defer ticker.Stop()
for {
select {
case <-stopCh: return
case <-ticker.C:
v, _ := redis.Get(context.Background(), fmt.Sprintf("%s-cancel", taskID))
if v != "" { onCancel(); return }
}
}
}
```
**Python 兼容**`{task_id}-cancel` Redis key 命名与 Python 端 task_service.py 协议**完全一致**——同进程 + 跨进程 cancel 都能识别。
**轮询 vs Pub/Sub 决策**:默认 500ms 轮询p99 ≤ 500msPub/Sub < 10ms 但与 Python 协议不兼容。Phase 2 视用户反馈切 Pub/Sub 双通道(轮询保兼容 + Pub/Sub 提速),由 `feature/cancel-pubsub` flag 控制。
---
## 8. OpenTelemetry 可观测性 / Observability
### 8.1 总体设计
```
Canvas run goroutine (Go)
eino Graph Engine
↓ (OnStart / OnEnd / OnError auto-injected)
callbacks.Handler (业务实现)
├─ OTelHandler (本计划新增)
│ └─ 开始 span → 注入 attributes → 结束 span
│ └─ otlphttpexporter → OTel Collector (外部)
│ ├─ Jaeger / Tempo (trace UI)
│ ├─ Langfuse (LLM 专门)
│ └─ Prometheus / Grafana
└─ SSEHandler (业务事件流) → admin UI
```
### 8.2 双通道分离
| 通道 | 用途 | 协议 | 消费者 |
|------|------|------|--------|
| **SSE** | 业务事件("node 开始/结束/消息" | `text/event-stream` HTTP | admin UI |
| **OTel span** | 系统可观测性(节点耗时/错误/token | OTLP HTTP | 运维/APM |
| **OTel logs**Phase 8+ | 结构化日志 | OTLP | 运维/排障 |
### 8.3 eino callback → OTel 映射
| eino 时机 | OTel 行为 | Span attribute |
|-----------|-----------|----------------|
| `OnStart(ctx, info, input)` | `tracer.Start(ctx, info.Name)` → 写入 `ctx` | `eino.component.name`, `eino.component.type`, `eino.input.size` |
| `OnEnd(ctx, info, output)` | `span.End()` | `eino.output.size` |
| `OnError(ctx, info, err)` | `span.RecordError(err)` + `span.SetStatus(codes.Error, ...)` | `eino.error.message` |
| `OnStartWithStreamInput` | 同 OnStartspan event `eino.stream.input.start` | `eino.stream.input.size` |
| `OnEndWithStreamOutput` | `span.End()`span event `eino.stream.output.end` | `eino.stream.output.size` |
**耗时计算**`OnStart` 时 `startTime := time.Now()` 写入 `ctx`(参考 eino `callbacks/doc.go:99-102` 范式),`OnEnd` 时 `span.SetDuration(time.Since(startTime))`。
**Node name 来源**`RunInfo.Name` 来自 `compose.WithNodeName(name)`Canvas DSL 加载时给每个 cpn 设置节点名为 `cpn_id` → span 名 = `cpn_id`。
### 8.4 启动配置
```bash
# 必选(未设置 → no-op handler不影响业务
export OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector:4318"
export OTEL_SERVICE_NAME="ragflow-agent"
export OTEL_RESOURCE_ATTRIBUTES="service.namespace=ragflow,deployment.environment=production"
export OTEL_TRACES_SAMPLER="parentbased_traceidratio"
export OTEL_TRACES_SAMPLER_ARG="0.1" # 10% 采样
```
**降级**:未配置 `OTEL_EXPORTER_OTLP_ENDPOINT` → handler 退化为 noop`otel.SetTracerProvider(noop.NewTracerProvider())`**不报错**、不影响业务OTel collector 不可达 → batch processor 内部 retry + drop`OTEL_BSP_EXPORT_TIMEOUT` 默认 30shandler 永不阻塞 run。
### 8.5 跨语言追踪
- Go → deepdoc Python HTTP 调用:用 `otelhttp.NewTransport(...)` 包裹 HTTP clientW3C `traceparent` header 透传
- Python RAGFlow OTel通过 langfuse SDK 间接实现):与 Go 端 OTLP 互通(同一 OTel collector同一 `service.namespace=ragflow`
- 关联规则:每次 canvas run 生成 `trace_id = run_id`;下发给 deepdoc / Python 的请求带 `traceparent` header
### 8.6 与 §2.10 v1 方案对比
| 维度 | v1弃用 | v2采用 |
|------|-----------|-----------|
| 存储 | MySQL `agent_run_log` 自管表 | 外部 OTel collector无新表 |
| 实时推送 | Redis Stream XREAD consumer | OTel OTLP HTTP → collector |
| 跨语言 | ❌ 独立 MySQL 表 | ✅ OTLP 业界标准 |
| 与 Langfuse | ❌ 各自为政 | ✅ 同一 OTel pipeline |
| 启动轻 | 需建表 + 索引 + 归档策略 | 仅环境变量 |
| Python 端对齐 | 偏离 | 对齐langfuse OTel |
### 8.7 Python↔Go OTel 互通验证
**目的**Go canvaseino + OTLP/HTTP和 Python canvaslangfuse SDKOTel-bridged出现在同一 `service.namespace=ragflow` 标签下Jaeger/Langfuse 可跨语言追踪。
**通过标准**6 条,缺一不可):
1. Collector 在 5 分钟内同时收到 Python 和 Go 的 trace
2. 双方 span 携带 `service.namespace=ragflow` resource attribute
3. Jaeger 单一 `service.namespace=ragflow` filter 返回双方 trace
4. Langfuse 同 project 下显示两条独立 trace
5. Go span 遵循 OTel semantic conventions`eino.component.name`, `eino.component.type`
6. Python span 附带 `langfuse.*` namespace
**关键 env var**
| Var | 用途 | 值示例 |
|-----|------|--------|
| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP collector 地址 | `http://otel-collector:4318` |
| `OTEL_SERVICE_NAME` | Go service name | `ragflow-agent` |
| `OTEL_RESOURCE_ATTRIBUTES` | 必须含 `service.namespace=ragflow` | `service.namespace=ragflow,deployment.environment=prod` |
| `OTEL_TRACES_SAMPLER` | 采样策略 | `parentbased_traceidratio` |
**collector 兜底**`resource/propagate` processor 对缺失 `service.namespace` 的 span 自动插入 `ragflow`,确保 Jaeger filter 始终可分组。
**常见失败**
| 症状 | 原因 | 修复 |
|------|------|------|
| Collector 收到 0 span | 防火墙/端口错 | `curl -X POST http://localhost:4318/v1/traces` |
| `service.namespace` 为空 | env var 未传给子进程 | 在父 shell 设并 re-export |
| Go span 缺失 | `OTEL_EXPORTER_OTLP_ENDPOINT` 未设 | Go SDK 未设时 no-op |
| Python span 不在 Jaeger | langfuse SDK 只发自己后端 | 设 `OTEL_EXPORTER_OTLP_ENDPOINT`langfuse ≥ 2.x 尊重 OTLP env var |
---
## 9. 多版本 Agent 管理 / Multi-version Agents
**Go 端支持多版本并存****永不覆盖**),与 Python v1 "每次发布覆盖写 `user_canvas.dsl`" 行为不同。
**Schema 现状**MySQL
- `user_canvas.id` 32 字符 UUID
- `user_canvas.dsl` 当前"草稿"或"最新已发布"
- `user_canvas.release` bool
- `user_canvas_version.id` 32 字符 UUID**每版本一个,永不更新**
- `user_canvas_version.user_canvas_id` 外键关联
- `user_canvas_version.dsl` 完整 DSL 快照
- 索引:`user_canvas_version(user_canvas_id)`
| 场景 | 行为 |
|------|------|
| 编辑器保存草稿 | `UPDATE user_canvas SET dsl=? WHERE id=?`**不创建 version** |
| 点击"发布" | `INSERT user_canvas_version(...)` 新行;`UPDATE user_canvas SET release=true, dsl=?, update_at=NOW()` |
| Run 不带 version | 拉取**最新** `user_canvas_version``create_time DESC LIMIT 1` |
| Run `?version=v_xxx` | 拉取**指定** `user_canvas_version` |
| Run `?version=draft` | 拉取 `user_canvas.dsl`(编辑器未发布状态) |
| 删除版本 | `DELETE FROM user_canvas_version WHERE id=?`**不影响其他版本** |
| 删除整个 agent | 级联删除所有 version |
**保留策略**
- **不自动删除旧版本**——由用户/管理员显式删除
- **不限制版本数**——业务表空间不是瓶颈
- **可选** `agents_max_versions` 配置(默认不启用)
**API 端**
- `GET /api/v1/agents/{id}/versions` — 列表
- `POST /api/v1/agents/{id}/versions` — 显式发布
- `DELETE /api/v1/agents/{id}/versions/{version_id}` — 删除
- `GET /api/v1/agents/{id}/versions/{version_id}` — 详情
- `POST /api/v1/agents/{id}/run?version=xxx` — 指定版本运行(缺省=最新)
**与 Python 兼容**`user_canvas.dsl` 保留(草稿/最新已发布副本前端老接口仍能读Go 端新发布永远插入新行,**不破坏** Python 老数据。
---
## 10. 第三方库选型 / Third-party Libraries (License Gate)
### 10.1 决策结论
| 用途 | 选 | License | 备注 |
|------|-----|---------|------|
| **PDF 生成** | `signintech/gopdf` | MIT | 主选TTF 字体注册 + CJK + header/footer 内置 |
| **PDF 备选** | `go-pdf/fpdf` (codeberg.org fork) | MIT | GitHub 主仓库 2025-03-04 archive |
| ~~PDF unipdf~~ | ~~`unidoc/unipdf`~~ | ~~AGPL-3 + 商业~~ | ❌ 排除(强传染) |
| **DOCX 生成** | **自实现** OOXML writer | — | Go `archive/zip` stdlib + `text/template` + `//go:embed` |
| ~~DOCX unioffice~~ | ~~`unidoc/unioffice`~~ | ~~AGPL-3 + 商业~~ | ❌ 排除(强传染) |
| ~~DOCX fumiama-go-docx~~ | ~~`fumiama/go-docx`~~ | ~~AGPL-3~~ | ❌ 排除(强传染) |
| **Excel 读写** | `xuri/excelize/v2` | BSD-3 | 无 license 风险,标准选择 |
| **Markdown 解析** | `yuin/goldmark` | MIT | CommonMark 标准 |
| **HTML 解析** | `golang.org/x/net/html` | BSD-3 | stdlib 旁路 |
| **OpenTelemetry SDK** | `go.opentelemetry.io/otel` v1.44.0 | Apache-2.0 | 含 sdk + otlptrace/otlptracehttp + semconv |
| **MySQL driver** | `go-sql-driver/mysql` | MPL-2.0 | ExeSQL 走 stdlib `database/sql` |
| **PG driver** | `lib/pq` | MIT | ExeSQL 走 stdlib `database/sql` |
| **MSSQL driver** | `denisenkom/go-mssqldb` | BSD-3 | ExeSQL 走 stdlib `database/sql` |
| **HTTP retry** | 自实现指数 backoff | — | 17+ HTTP tool 共用 helper |
| **Test SQL mock** | `DATA-DOG/go-sqlmock` | MIT | ExeSQL 注入测试 |
### 10.2 关键论证
**AGPL-3 零容忍**RAGFlow 是 Apache-2.0AGPL-3 强传染会让整个 RAGFlow Go 二进制被迫 AGPL-3 化。所有候选 AGPL-3 库unipdf / unioffice / fumiama-go-docx / baliance-gooxml**全部排除**。
**DOCX 必须自实现**(穷举结果):
- AGPL-3 阵营unioffice商业双轨、fumiama/go-docx活跃但传染、baliance/gooxml停滞+传染)
- MIT/Apache 阵营tealeg停滞、lytdev功能不完整、legion-zverlicense 不明)
**自实现可行性**
- DOCX = ZIP 容器 + XML parts`document.xml` / `header*.xml` / `footer*.xml` / `styles.xml` / `[Content_Types].xml` / `_rels/*.rels`
- Go `archive/zip` stdlib 即可生成容器
- **不采用 `encoding/xml` 1:1 struct 映射**OOXML 元素数 ≈ 500+,会暴涨到 5K+ LoC—— **采用 `//go:embed` 静态基线 + `text/template` 动态渲染 混合模式**
- 固定部分(`[Content_Types].xml` / `_rels/.rels`)→ `//go:embed` `const []byte`
- 动态部分(`document.xml` / `header1.xml` / `footer1.xml` / `styles.xml`)→ `text/template`
- `funcMap["xml"]` 走 `template.HTML` + `escapeXMLAttr`(避免用户内容 `&`/`<`/`>` 破坏 XML 拓扑)
- **代码量** ≈ 350 行核心 + 200 行模板 = 550 行(比"1.5K LoC struct 映射"压缩 2.7×
**对比 Python 端的 pypandoc + xelatex 方案**
- 优势:避免外部 binary 依赖pandoc + TeX Live ≈ 800MB 镜像膨胀)
- 代价:自实现 1.5K LoC → 0.55K LoC实际
**Golden Master 快照测试**(防 XML 拓扑回归):
- 10+ 个标准用例minimal / full含 watermark + page#/ cjk / nested_table / list_numbering / heading_levels / page_break / section_break / multi_header / long_text / special_chars / empty_doc
- 生成 DOCX → `unzip` → pretty-print → `cmp.Diff` 与 `testdata/golden_*.xml` 对比
- `UPDATE_GOLDEN=1` 触发 golden 重写
- Word 兼容性手动验证LibreOffice headless 打开无"文件已损坏"提示,列入完工 checklist
### 10.3 完整 License 审计14 候选库)
> 审计时间Phase 0。规则AGPL-3 / SSPL / Commons Clause / BUSL → **一律拒绝**(强传染,与 Apache-2.0 不兼容)。
| # | Library | License | Decision | Justification |
|---|---------|---------|----------|---------------|
| 1 | `unidoc/unipdf` | AGPL-3.0 | ❌ DENIED | AGPL-3 §13 viral |
| 2 | `unidoc/unioffice` | AGPL-3.0 | ❌ DENIED | 同上 |
| 3 | `fumiama/go-docx` | MIT | ❌ 实际未采用 | 自实现 OOXML 替代 |
| 4 | `baliance/gooxml` | AGPL-3.0 | ❌ DENIED | AGPL-3 dual-licensed 仍是 AGPL-3 |
| 5 | `tealeg/golang-docx` | BSD-3 | ⚠️ CONDITIONAL | 停滞;未采用 |
| 6 | `legion-zver/go-docx-templates` | AGPL-3.0 | ❌ DENIED | AGPL-3 |
| 7 | `lytdev/go-docxlib` | AGPL-3.0 | ❌ DENIED | AGPL-3 + 低活跃度 |
| 8 | `signintech/gopdf` | MIT | ✅ APPROVED | PDF 主选 |
| 9 | `go-pdf/fpdf` | MIT | ✅ APPROVED | PDF 备选(替代已 archive 的 `gofpdf` |
| 10 | `jung-kurt/gofpdf` | MIT (archived) | ❌ DENIED | 上游已 archive无安全补丁 |
| 11 | `pdfcpu/pdfcpu` | Apache-2.0 | ✅ APPROVED | PDF read/inspect/merge |
| 12 | `ledongthuc/pdf` | BSD-2 | ⚠️ CONDITIONAL | 优先用 `pdfcpu` |
| 13 | `xuri/excelize/v2` | BSD-3 | ✅ APPROVED | Excel 主选Go 生态事实标准 |
| 14 | `yuin/goldmark` | MIT | ✅ APPROVED | Markdown→HTML |
**AGPL-3 预筛规则**(用于未来新增依赖):
- README header 含 "AGPL" 或 "Affero" → 直接拒绝
- LICENSE 文件首行含 "Affero General Public License" → 拒绝
- GitHub license badge 显示 AGPL-3.0 / SSPL-1.0 → 拒绝
- CI 中 `go-licenses check` 命中 AGPL → 构建失败
**Re-verification 触发条件**:上游改 license、新 major version 重许可、依赖 archive、新 CVE 无补丁。
---
## 11. HTTP 接口 / HTTP API
| Method | Path | Handler | 说明 |
|--------|------|---------|------|
| `GET` | `/api/v1/agents` | `ListAgents` | 已存在commit `0a7662cf3` |
| `POST` | `/api/v1/agents` | `CreateAgent` | 新增 |
| `GET` | `/api/v1/agents/{id}` | `GetAgent` | 自动 v1/v2 转换;返回草稿 DSL |
| `PATCH`| `/api/v1/agents/{id}` | `UpdateAgent` | 更新草稿,**不创建版本** |
| `DELETE`| `/api/v1/agents/{id}` | `DeleteAgent` | 级联删除所有 version |
| `POST` | `/api/v1/agents/{id}/run` | `RunAgent` | 同步;`?version=v_xxx` 缺省=最新,`?version=draft`=草稿 |
| `POST` | `/api/v1/agents/{id}/stream` | `StreamAgent` | SSE`?version=` 同上 |
| `POST` | `/api/v1/agents/{id}/cancel` | `CancelAgent` | 写 Redis cancel key |
| `GET` | `/api/v1/agents/{id}/versions` | `ListVersions` | 列出版本列表 |
| `POST` | `/api/v1/agents/{id}/versions` | `PublishVersion` | 发布新版本,**永不覆盖** |
| `GET` | `/api/v1/agents/{id}/versions/{vid}` | `GetVersion` | 版本详情 |
| `DELETE`| `/api/v1/agents/{id}/versions/{vid}` | `DeleteVersion` | 删除指定版本 |
**SSE 事件 payload**(与 Python `agent_api.py` 一致):
```json
{"event": "node_start"|"node_finish"|"message"|"error", "task_id": "...", "component": "cpn_id", "data": {...}}
```
---
## 12. 验收标准 / Acceptance Criteria
| 类别 | 标准 |
|------|------|
| **功能** | 19 component × ≥3 单测 = ≥57 个 component 单测21 tool × ≥2 单测 = ≥42 个 tool 单测 |
| **eino 复用** | T1 组件LLM/Agent回归跑 eino 自带 `react_test.go` / `chatmodel_test.go` / `compose_test.go` 不退化 |
| **功能** | 100 条 v1 DSL 样本 → v2 → 调度执行,结果与 Python 端一致 |
| **功能** | `{{cpn_id@param}}` 任意节点读任意节点、`globals` 读写、`sys.x` / `env.x` 解析,单测覆盖 |
| **功能** | SSE 事件序列与 Python `agent_api.py` 一致node_start / node_finish / message / error |
| **并发** | 100 并发 canvas run单租户 P99 启动延迟 < 200ms不含组件执行 |
| **并发** | 调度器 overhead100 节点 DAG 调度 < 50ms |
| **并发State mutex 硬门)** | `BenchmarkStateMutex` 在 100 节点 / 1000 并发 `ns/op < 500µs`(不通过禁止进 Phase 2fallback 走分片 RWMutex |
| **可靠** | Redis 取消协议cancel → 5s 内节点 stop500ms 轮询下 p99 ≤ 500ms |
| **可靠** | 流式中断client disconnect→ 节点 30s 内退出 |
| **兼容** | v1 DSL 零修改加载成功≥99% 样本);失败样本产出明确错误 |
| **兼容** | v2 → v1 写出后旧 Python reader 仍能加载 |
| **可观测性** | OTel handler P99 overhead < 2%100 节点);未配置 endpoint 时 no-opP99 启动延迟变化 < 1ms |
| **checkpoint** | Redis `RedisCheckPointStore` Get/Set/Delete 通过 eino 集成测试cancel 后 resume_from 链路无重复执行已通过节点 |
| **checkpoint** | 30 天 TTL 由 Redis `EXPIRE` 原生保证 |
| **代码质量** | 公共 API 100% godoc 注释golangci-lint revive 强制);复杂算法/状态机/并发原语 100% 注释karpathy 原则);`>=80% test coverage on internal/agent/canvas` |
---
## 13. 风险 & 缓解 / Risks
| 风险 | 严重度 | 缓解 |
|------|--------|------|
| **eino State 在高并发下 mutex 竞争** | 中 | Phase 1 末 benchmark若 > 5% 调度开销,引入分片 mutex按 `cpn_id` hashN = `min(NumCPU*4, 64)` |
| **v1 DSL 100% 兼容不可能**Python 装饰字段) | 中 | 不兼容的旧 DSL 走"自动转换 + 提示"路径,不静默丢字段 |
| **Component 接口签名与 Python 偏离** | 中 | 签名一致 → 转换代码 1:1 复刻 → 行为一致 |
| **Tool 外部 HTTP 失败** | 中 | 复用 `http_helper.go` 的 retrymock 测试覆盖 5xx / timeout / DNS |
| **Python task_executor 协议不同步** | 低 | `internal/proto/ingestion.proto` 已废弃Python task_executor 注册/心跳仍走 Redis |
| **前端 DSL 编辑器只懂 v1** | 中 | Phase 5 维持 v1 写出能力;前端 v2 编辑器作为独立项目排期 |
| **测试环境无 LLM key** | 低 | 所有 LLM 组件测试走 mock provider driver`internal/entity/models/dummy.go` 范式) |
| **deepdoc 仍 Python 导致跨语言追踪** | 中 | 跨语言 deepdoc 调用走 HTTPtracing 通过 OpenTelemetry propagator 串联 |
---
## 14. 计划 vs 现状 对比 / Plan vs Reality
This section captures the deviations between the original plans and the code as it stands on 2026-06-11.
### 14.1 Component 数量:计划 22 → 21 → **实际 19**
| 计划来源 | 描述 | 实际 |
|---------|------|------|
| §2.11.3 row 11-13 | `Iteration` / `IterationItem` / `Loop` / `LoopItem` = 4 独立 component | `Loop` 1 个(`component/loop.go`),其余 3 **未注册 component**——通过 `canvas/loop_subgraph.go` 宏展开吸收为 `Loop` 单节点的子图 |
| §2.11.3 row 13 | `ExitLoop` no-op component | **未注册 component**——`legacyNoOpNames` 在 canvas 层吸收DSL v1 compat |
| §2.11.3 row 8 | `Agent` 走 T1自建 citation 中间件 + tool artifact 收集 | `Agent` 已实现T1 + `react.NewAgent` + 22 tool 注册),**citation 中间件和 tool artifact 收集未实现**(见 §14.4 |
实际 `.go` 文件清单19 个 component .go
```
agent.go, begin.go, browser.go, categorize.go, data_operations.go,
docs_generator.go, excel_processor.go, fillup.go, invoke.go,
list_operations.go, llm.go, loop.go, message.go, parallel.go,
string_transform.go, switch.go, userfillup.go, variable_aggregator.go,
variable_assigner.go
```
加上 5 个 helpers`base.go, registry.go, runtime_wire.go, io_init.go, v1_stubs.go`。
### 14.2 T5 路径:计划 `component/io/` 子目录 → 实际 根目录
| 计划来源 | 描述 | 实际 |
|---------|------|------|
| §4.1 目录树 | `internal/agent/component/io/{docs_generator.go, excel_processor.go, docx_writer.go, pdf_writer.go, md_ast.go, ...}` | `docs_generator.go` / `excel_processor.go` 在 `internal/agent/component/` 根目录;`docx_writer.go` / `pdf_writer.go` / `md_ast.go` **未单独拆出**(可能内联在 docs_generator.go 内) |
| §2.11.5.3 | `docx_writer.go` ≈ 350 行核心 + 5 个 .tmpl | 自实现 OOXML writer 存在,模板/文件结构需进一步验证 |
### 14.3 双写 vs OpenTelemetry已完全切换
`agent-go-port.md §2.10` 早期版本是 "Redis Stream + MySQL 双写"2026-06-03 决策切换为 OTel。当前代码 `internal/observability/otel/` 三件套provider.go / handler.go / handler_test.go已落地MySQL `agent_run_log` 表**未创建**。
### 14.4 Agent 组件 1 个 P0 缺口
> **✅ 2026-06-11 闭环**commit pending两个中间件已落地详见 `component/agent.go` 的 `toolArtifactCapture` / `maybeAppendCitation`。
`component/agent.go` 走 T1`react.NewAgent` + 22 tool 注册。plan §2.11.6 D2 提到的两个**自建中间件**当前实现:
- **Tool artifact 收集**eino `ToolCallbackHandler` 挂在 `react.NewAgent(... compose.WithCallbacks(cb))` 上。`OnStart` 捕获 `ArgumentsInJSON``OnEnd` 捕获 `CallbackOutput.Response`。capture 通过 `context.WithValue` 传递(`toolArtifactKey``AgentComponent.Invoke` 入口安装runner 内 callback 写入runner 出口读取——**runner 签名不变**test seam `withAgentRunner` 仍能 seed artifacts
- **Citation 中间件**`maybeAppendCitation(ctx, chatModel, msg)` 在 ReAct 结束后调,逻辑:
1. `runtime.GetStateFromContext[*CanvasState](ctx)` 拿 state无 state → no-op
2. `state.Retrieval["chunks"]` 为空/nil/空 slice → no-op**避免无谓 LLM 调用**
3. 否则用 `chatCompleter.Generate(...)` 发一次 follow-up LLM callprompt 模板让模型在原文基础上加 `[n]` 引用标记
4. 失败/no-op 路径都保持 `msg.Content` 不变best-effort polish
- `AgentOutput.Artifacts` 字段在 `component/agent.go:51` 之前**始终返回空 slice**`"artifacts": []map[string]any{}`),现在通过 `artifactsToMaps(readToolArtifacts(ctx))` 填入真实内容。
**测试覆盖**`agent_test.go`
- `TestAgent_ReadsArtifactsFromContext` — 验证 test seam 能 seed captureInvoke 输出含 2 个 artifact一个 OnStart args + 一个 OnEnd response
- `TestAgent_ArtifactsEmptyWhenRunnerSeedsNothing` — 验证未 seed 时返回空 slice 而非 nilschema 稳定)
- `TestAgent_MaybeAppendCitation_NoState` — 无 state → LLM 不被调
- `TestAgent_MaybeAppendCitation_EmptyChunks` — 空 chunks → LLM 不被调(避免浪费)
- `TestAgent_MaybeAppendCitation_AppendsTail` — 正常路径content 拼接为 `original + "\n\n" + cited`
### 14.5 ExeSQL 决策已按 2026-06-11 review 落地
`agent-go-port.md` 2026-06-11 changelog 记录 ExeSQL 走 stdlib `database/sql` + 各 driver**不复用** `internal/dao` GORM。当前 `component/tool/exesql.go` 实际采用此方案(`exesqlDriverAndDSN` 集中拼装 + `exesqlDialer` 注入 + `DATA-DOG/go-sqlmock` 测试)。✅
### 14.6 workflowx 扩展:已完全实现
`eino-workflow-loop.md` 和 `eino-workflow-parallel.md` 描述的 `AddLoopNode[T]` / `AddParallelNode[I,O]` 已在 `internal/agent/workflowx/` 落地,配套 `loop_test.go` / `loop_integration_test.go` / `parallel_test.go` / `parallel_integration_test.go`**含 miniredis-style 内存 checkpoint store 模拟真实 eino 集成路径**)。
### 14.7 runtime 包:已从 canvas/component 双侧提取
`fluffy-strolling-bear.md` 描述的"提取共享运行时契约到 `internal/agent/runtime/`"已落地:`component.go` / `context.go` / `metrics.go` / `selector.go` / `state.go` / `template.go` 6 个文件。`canvas/state_export.go` 保留薄 alias 供测试用,生产代码不依赖。✅
### 14.8 开放问题 / Open Questions
| ID | 问题 | 状态 |
|----|------|------|
| Q1 | Retrieval + GraphRAG Go 化策略 | ✅ 已闭环(策略 AGo Retrieval 外壳 + 进程内 Dealer 直调;`use_kg=True` 走配置错误返回) |
| Q2 | Checkpoint 持久化 | ✅ 已闭环Redis 30d TTL 双 key |
| Q3 | 跨语言调用策略 + 可观测性 | ✅ 已闭环deepdoc 走 HTTPOTel 集成) |
| Q4 | DSL v2 metadataauthor/tags/created_at | ✅ 已闭环(**不上 v2 schema**;元数据走 `user_canvas.title/description` 等后端字段) |
| Q5 | Tenant LLM 默认模型注入 | ✅ 已闭环(`service.ModelProviderService.GetChatModel` + `entity/models.NewChatModel` + eino `model.ChatModel` |
| Q6 | Streaming WebSocket 支持 | ⏸️ **pending demand**——目前仅 SSE无用户/产品需求触发前不实现 |
| Q7 | Component 热重载 | ✅ 已闭环(不支持;沿用 Python v1 行为) |
| Q8 | Retrieval 工具 Go 化 | ✅ 已闭环(策略 A0 gRPC |
| Q9 | v1.1 cgo 嵌入 CPython 调 KGSearch | ⏸️ 暂不做 |
| Q11 | T5 cgo 绑定 | ✅ 已闭环(不引入 cgo纯 Go lib / 自实现) |
### 14.9 计划 Phase 与代码落地对照
| Phase | 计划范围 | 落地状态 |
|-------|---------|---------|
| Phase 0 — 准备接口清单、license-gate、deepdoc 端点调研) | 1 周 | ✅ 全部产出(`docs/agent-port/*.md` × 5 |
| Phase 0.5 — Deepdoc Client 类型契约 | 0.5 天 | ✅ `internal/deepdoc/{client,dla,ocr,tsr}.go` + 24 单测HTTP/multipart/retry/4xx-5xx/ctx-cancel 全部覆盖) |
| Phase 1 — 画布骨架 | 2.5 周 | ✅ `canvas/{state, variable, scheduler, cancel, stream, checkpoint_store, run_tracker, state_serializer, compile}.go` 全部到位 |
| Phase 2 — Component 库 | 4.5-7 周 | ✅ 19 component + 5-tier 全部实现P0-P4 混合交付) |
| Phase 2.5 — DSL v2 + v1→v2 | 1.5 周 | ✅ `internal/agent/dsl/{v2.go, loader.go, converter_v1_to_v2.go}` |
| Phase 3 — Tool 库 | 2.5-3.5 周 | ✅ 21 tool + `BuildAll`/`BuildByName` registry |
| Phase 5 — HTTP/RPC | 1.5-2.5 周 | ✅ 12 endpoint + 3 version 端点 |
| Phase 5.5 — DSL v2 写兼容 | 1 周 | ✅ `converter_v2_to_v1.go` |
| Phase 6 — 灰度 | 1-2 周 | ❌ **未启动**——`tenant_canvas_runtime_mode` 配置表未实现Python 端 `agent_api.py` 仍为主路径 |
| Phase 7 — 清理 | 1 周 | ❌ **未启动**——Python 端未标 `@deprecated``docs/go-python-implementation-status.md` 第 314316 行未更新为"已 Go 化" |
### 14.10 Phase 6 — Per-Tenant Runtime Selector已交付基础设施建设
**Go 侧已交付**
| File | Purpose |
|------|---------|
| `internal/agent/runtime/selector.go` | 每租户 runtime 模式选择器Redis 读 `tenant_canvas_runtime:{tenantID}`fallback `RAGFLOW_CANVAS_DEFAULT_RUNTIME`(默认 `python` |
| `internal/agent/runtime/metrics.go` | Prometheus counter `ragflow_canvas_runs_total{runtime,outcome}` + histogram `ragflow_canvas_run_duration_seconds{runtime}` |
| `internal/handler/admin_runtime.go` | `POST /api/v1/admin/canvas-runtime/:tenant_id` — 翻转租户 override |
| `internal/router/admin_routes.go` | `RegisterAdminRuntimeRoutes` helper |
**操作契约**
- 默认行为:`RAGFLOW_CANVAS_DEFAULT_RUNTIME=python` → 所有租户走 Python
- 租户提升:`curl -X POST .../admin/canvas-runtime/tenant_42 -d '{"runtime":"go"}'`
- 回滚:同上,`{"runtime":"python"}`
- Override 存 Redis 无 TTL永久有效显式覆盖才变
**Staging 灰度 run-book**
1. 部署 Go Canvas 服务(不接用户流量)
2. 验证默认值 `python`Go 服务 idle
3. 提升 100 个租户到 Go
4. 跑标准负载1000 runs/tenant × 30 分钟
5. 观察:`rate(ragflow_canvas_runs_total{runtime="go"}[5m])` 与 Python rate 差 < 1%p99 < 2s
6. 回滚演练:挑 1 租户切回 Python< 5s p99
7. SLO 满足 24h → 进 Phase 7
**Phase 7 启动前置条件**(由 staging canary 验证):
- 100 tenants × 1000 runs success-rate parity ≤ 1%
- p99 latency Go < 2s 持续 24h
- 回滚 drill p99 < 5s 持续 24h
- Admin endpoint auth gap 已关闭
### 14.11 Phase 7 — Python `agent_api.py` DeprecationGo 侧已交付Python 侧阻塞)
**Go 侧已交付**
- Hybrid routing default 翻到 100% Go
- Per-tenant override 保留作回退窗口
- 状态文档更新为"已 Go 化"
**Python 侧待办**Python 团队负责Go 侧无权触碰):
1. 给 `api/apps/agent_app.py` 加 `@deprecated` docstring + `DeprecationWarning`
2. 添加兼容代理 shim`/api/v1/agents/*` → proxy 到 Go 服务(`RAGFLOW_GO_CANVAS_URL`Go 不可达时 fallback Python
3. 删除时间线Phase 7 发版 → 1 release~3 月)后,若 0 active tenants 走 Python 持续 7 天 → 删除废弃模块
**安全删除验收门**PromQL 查询 `ragflow_canvas_runs_total{runtime="python"}` 连续 7 天为 0Redis `tenant_canvas_runtime:*` 无 `"python"` 值;无 Python canvas 路径 support ticket
**回滚**:单租户 `POST .../admin/runtime/tenants/<id> -d '{"mode":"python"}'`;集群级回滚设 `RAGFLOW_CANVAS_DEFAULT_RUNTIME=python` 并重启 Go 服务。
---
## 15. 后续跟进 / Future Work
1. **DSL v3**:类型化表达式(编译期校验 `{{cpn_id@param}}`
2. **eino 生态对齐**`AddAgenticModelNode` 替换 LLM component`AddRetrieverNode` 替换 Retrieval component
3. **GraphRAG component Go 化**(独立项目排期)
4. **WebSocket 流支持**Q6pending demand
5. **Checkpoint 增强**:跨 canvas run 复用、增量 checkpoint仅写 diff channel
6. **Phase 6 灰度 + Phase 7 清理**:把 Python 端 agent_api.py 流量切到 Go
7. **如果产品/UI 需要画布级标签/作者**:在 `user_canvas` 表加 `tags` / `author_id` 列(**不**改 v2 DSL schema参见 Q4 决策)
---
## 附录 A · 关键文件 / Key Files
按"修改这一处会触及的设计点"分组:
| 设计点 | 关键文件 |
|--------|---------|
| **State 模式** | `internal/agent/canvas/{state.go, scheduler.go}` + `internal/agent/runtime/{state.go, context.go}` |
| **runtime 提取** | `internal/agent/runtime/*.go`6 文件) + `internal/agent/canvas/state_export.go` |
| **Loop 宏展开** | `internal/agent/canvas/loop_subgraph.go` + `internal/agent/component/loop.go`no-op marker |
| **Parallel** | `internal/agent/component/parallel.go` + `internal/agent/workflowx/parallel.go` |
| **Loop 通用节点** | `internal/agent/workflowx/loop.go` + `loop_{test,integration,options}_test.go` |
| **Checkpoint** | `internal/agent/canvas/{checkpoint_store.go, run_tracker.go, state_serializer.go, compile.go}` |
| **Cancel 协议** | `internal/agent/canvas/cancel.go` |
| **OTel** | `internal/observability/otel/{provider.go, handler.go, handler_test.go}` |
| **DSL v2** | `internal/agent/dsl/{v2.go, loader.go, converter_*.go}` |
| **Tool registry** | `internal/agent/tool/registry.go` + `http_helper.go` + `ssrf.go` |
| **Component 5-tier** | `internal/agent/component/{base.go, registry.go, runtime_wire.go}` + 19 component .go |
## 附录 B · 测试覆盖 / Test Coverage
| 包 | 测试文件数 | 覆盖点 |
|----|-----------|--------|
| `internal/agent/canvas` | 14 | `canvas_test.go, scheduler_test.go, state_test.go, variable_test.go, state_bench_test.go, state_serializer_test.go, checkpoint_store_test.go, run_tracker_test.go, cancel_test.go, stream_test.go, loop_subgraph_test.go, loop_semantics_test.go, dsl_examples_e2e_test.go, cycle_wrap_test.go` |
| `internal/agent/component` | 16+ | 各 component `_test.go` + `verify_p1_test.go`(批量回归) |
| `internal/agent/tool` | 21+ | 各 tool `_test.go` + `registry_test.go`schema sweep + alias 一致性) |
| `internal/agent/runtime` | 2 | `metrics_test.go, selector_test.go` |
| `internal/agent/workflowx` | 8 | `loop_test.go, loop_options_test.go, loop_integration_test.go, loop_example_test.go, parallel_test.go, parallel_options_test.go, parallel_integration_test.go, parallel_helpers_test.go` |
| `internal/agent/dsl` | 4 | `loader_test.go, converter_v1_to_v2_test.go, converter_v2_to_v1_test.go, v1_examples_test.go` (42 个测试,含 12 个 v2→v1 + round-trip) |
| `internal/observability/otel` | 1 | `handler_test.go`tracetest.SpanRecorder |
---
## 附录 C · Deepdoc Service Endpoints (DLA/OCR/TSR)
> Phase 0 research deliverable. Documents the wire contract for the deepdoc vision stack (DLA remote HTTP, OCR/TSR local ONNX only).
### C.1 Endpoint summary
| Endpoint | URL | Status | Go port need |
|----------|-----|--------|--------------|
| DLA (Document Layout Analysis) | `POST {DEEPDOC_URL}/predict` | Remote HTTP (via `dla_cli.py`, fork only) | Go client with 3-retry + 18s timeout |
| OCR | **No remote endpoint** | Local ONNX only (`deepdoc/vision/ocr.py`) | None — `ErrNotImplemented` stub |
| TSR (Table Structure Recognition) | **No remote endpoint** | Local ONNX only | None — `ErrNotImplemented` stub |
Single toggle: `DEEPDOC_URL` (preferred) or `TENSORRT_DLA_SVR` (legacy). When unset, LayoutRecognizer loads local ONNX.
### C.2 DLA HTTP contract
- **Method**: `POST {DEEPDOC_URL}/predict`
- **Body**: `multipart/form-data`, field name `request`, raw JPEG bytes
- **Response**: `{"bboxes": [[left, top, right, bottom, score, type_idx], ...]}`
- **Timeout**: 18s per request; **3 retries** per image with `Session` rebuild
- **Failure sentinel**: empty list `[]` for that image
#### DLA class taxonomy (10 classes)
| idx | Class | idx | Class |
|----:|-------|----:|-------|
| 0 | title | 5 | Table |
| 1 | Text | 6 | Table caption |
| 2 | Reference | 7 | Table caption (dup) |
| 3 | Figure | 8 | Equation |
| 4 | Figure caption | 9 | Figure caption (dup) |
> Note duplicates at idx 4/6/7/9. Go port must use same array ordering and lowercase normalization — renumbering is a wire-format break.
### C.3 Go client placeholder (`internal/deepdoc/client.go`)
Phase 0 delivers typed Go client with no implementation beyond `ErrNotImplemented`. Phase 2 P3 fills in `DLA(ctx, images [][]byte) ([]DLAResult, error)`:
- Build multipart body with `mime/multipart`, field `request`, `Content-Type: image/jpeg`
- POST to `baseURL + "/predict"`
- Decode `{bboxes: [[l,t,r,b,score,ty], ...]}`, map `ty` through `DLA_CLASSES`
- 3-retry + 18s timeout with `http.Client.Timeout`
- Wrap transport with `otelhttp.NewTransport` for trace propagation
### C.4 Environment variables
```
DEEPDOC_URL # preferred; full URL e.g. http://deepdoc:11234
TENSORRT_DLA_SVR # legacy alias; honored as fallback
```
### C.5 LayoutRecognizer consumers
The single Python module calling into DLA HTTP is `deepdoc/vision/layout_recognizer.py`, consumed by:
- Resume parser (`rag/app/resume.py`)
- Table recognizer (`deepdoc/vision/t_recognizer.py`)
---
## 附录 D · DSL v1 Corner Cases Inventory
> Phase 0 deliverable. Canonical v1 DSL schema + 15 corner-case categories anchored on `agent/canvas.py:43-95` and `agent/component/base.py:368-369`.
### D.1 Top-level DSL shape
```json
{
"components": {
"<cpn_id>": {
"obj": {"component_name": "Retrieval", "params": {...}},
"downstream": ["generate_0"],
"upstream": ["answer_0"]
}
},
"path": ["begin"],
"history": [],
"retrieval": {"chunks": [], "doc_aggs": []},
"globals": {"sys.query": "", "sys.user_id": "...", "sys.conversation_turns": 0,
"sys.files": [], "sys.history": [], "sys.date": "..."},
"variables": {},
"memory": []
}
```
### D.2 Variable reference syntax
Two regexes:
```
variable_ref_patt = r"\{* *\{([a-zA-Z:0-9]+@[A-Za-z0-9_.-]+|sys\.[A-Za-z0-9_.]+|env\.[A-Za-z0-9_.]+)\} *\}*"
iteration_alias_patt = r"\{* *\{(item|index|result)\} *\}*"
```
Key behaviors the Go port must mirror:
- **Brace tolerance**: `{{var}}`, `{{ var }}`, `{{{var}}}` are all valid
- **`sys.*`/`env.*`**: namespace-only (no `@`), read from `State` flat namespace
- **`cpn_id@param.nested.path`**: dot-path traversal with `json.loads` on strings, `dict.get`, `list[int]` index, `getattr` fallback
- **`set_variable_value`**: auto-creates missing dict keys in the path
- **`functools.partial`**: unwrapped during variable resolution (message streaming)
- **Empty `{{...}}`**: resolves to `""`, never crashes
- **`is_reff`**: returns `True` only if `cpn_id@param` resolves to a known component; otherwise treats as literal
### D.3 `custom_header` injection
`custom_header` is a **per-run HTTP header dict**, NOT a stored DSL field. The loader injects it at `canvas.py:102` before `param.update()`. Go port must:
1. Strip `custom_header` from stored DSL on read
2. Pass via Canvas run context, NOT via `ComponentParamBase`
3. Surface to relevant tool/component via State
### D.4 Three-set parameter decoration (REMOVED in v2)
Python stores 4 internal keys per-param-instance: `_feeded_deprecated_params`, `_deprecated_params`, `_user_feeded_params`, `_is_raw_conf`. The Go port's DSL v2 **drops all 4** on v1→v2 conversion. Unknown keys are silently absorbed (permissive `update()`).
### D.5 `path` linearization & runtime mutation
`path` is mutated at runtime by: `begin` append on empty, iteration/loop/categorize/switch/exitloop extensions, `userfillup` reordering, `exception_goto` extension, node popping for out-of-order dependencies. Go scheduler must replicate same `path` semantics including `idx = to` truncation at batch end.
### D.6 `exception_goto`
`exception_goto` is a **list** of cpn_ids (usually length 1). Empty list = no-op. `exception_method` is one of `None` / `"comment"` / implicit `"goto"` (by presence of non-empty `exception_goto`). Once triggered, no further downstream extension (short-circuit).
### D.7 Nested messages / streaming
- `<think>`/`</think>` tokens → separate SSE events with `start_to_think`/`end_to_think` flags
- TTS audio batched at 16 chars
- After streaming completes, full concatenated string written to `set_output("content", ...)` for downstream `{{Message@content}}` references
- `partials` queue buffers components whose `content` is a partial until it drains
### D.8 `userfillup` interactive pause
Can appear in `path` multiple times. On re-entry, `begin` is NOT re-invoked. `enable_tips=True` produces a `tips` field rendered by frontend. Go port must reorder path so `userfillup` nodes come first on every re-entry.
### D.9 `globals` / `sys.*` / `env.*` semantics
6 default keys: `sys.query`, `sys.user_id`, `sys.conversation_turns`, `sys.files`, `sys.history`, `sys.date`. `sys.date` refreshed at every `run()`. `sys.conversation_turns` defensively coerces `None` → `0` then `+= 1`. `env.*` reset path falls back to type-based default (`number→0`, `boolean→false`, `string→""`, etc.). `sys.history` auto-appended on every assistant turn (duplicate store with `history` list).
### D.10 Component-name case-insensitivity
All comparisons use `.lower()`. Stored cpn_ids may be any case. Go port must NOT key component map by case-sensitive `cpn_id` — raw id for display, lowercase for internal lookups.
### D.11 Template samples
25 JSON templates in `agent/templates/` (~1.1 MB total) covering all 22 components. Key samples:
- `web_search_assistant.json` (~30K): Agent + Retrieval + Message, variable refs with whitespace
- `customer_feedback_dispatcher.json` (~34K): Categorize + Switch + Message
- `deep_research.json` (~144K, largest): heavy Iteration + Loop, ~30 component instances
- `data_analysis_beginner_assistant.json` (~22K): `exception_goto` with real cpn_ids
- `market_seo_article_writer.json` (~62K): DocsGenerator with PDF output, multiple Iterations
---
## 附录 E · Component & Tool Interface Inventory
> Phase 0 deliverable. 22 components + 21 tools with class hierarchy, public methods, input/output schemas, and key dependencies.
### E.1 Component inventory (22)
| # | Component | File | `component_name` | Tier | Key behavior |
|---|-----------|------|-----------------|------|-------------|
| 1 | Begin | `begin.py` | `Begin` | T3 | Consumes `kwargs["inputs"]`, resolves file inputs via `FileService.get_files` |
| 2 | UserFillUp | `fillup.py` | `UserFillUp` | T3 | Renders `tips` with variable interpolation, resolves file inputs |
| 3 | Fillup | (alias) | `Fillup` | T3 | Thin alias of UserFillUp (disable `enable_tips`) |
| 4 | Message | `message.py` | `Message` | T3 | Assembles final response: jinja2 prompt + stream + TTS + filegen + memory save |
| 5 | LLM | `llm.py` | `LLM` | T1 | Sync + async paths; `chatModel.Generate` / `Stream`; structured JSON output |
| 6 | Categorize | `categorize.py` | `Categorize` | T3 | LLM one-shot classification → `_next` (routing list) + `category_name` |
| 7 | Switch | `switch.py` | `Switch` | T2 | Evaluates boolean conditions; `_next` = matching downstream(s) |
| 8 | Agent | `agent_with_tools.py` | `Agent` | T1 | ReAct loop with `LLMBundle` + tool binding + citations |
| 9 | Iteration | `iteration.py` | `Iteration` | T4 | Resolves `items_ref`, validates array, drives `IterationItem` children |
| 10 | IterationItem | `iterationitem.py` | `IterationItem` | T4 | Round-local outputs aggregated by parent |
| 11 | Loop | `loop.py` | `Loop` | T4 | Initializes `loop_variables`, drives `LoopItem` children |
| 12 | LoopItem | `loopitem.py` | `LoopItem` | T4 | Evaluates `loop_condition`; `end()` → `True` triggers exit |
| 13 | ExitLoop | `exit_loop.py` | `ExitLoop` | T1 (Passthrough) | No-op; parent Loop extends path |
| 14 | Invoke | `invoke.py` | `Invoke` | T3 | HTTP GET/POST/PUT/PATCH/DELETE + headers/proxy/timeout/HTML cleanup |
| 15 | Browser | `browser.py` | `Browser` | T3 | LLM-driven browsing: page fetch, click, type, screenshot, MinIO upload |
| 16 | DataOperations | `data_operations.py` | `DataOperations` | T3 | 7 ops: select_keys/literal_eval/combine/filter/append_or_update/remove/rename |
| 17 | ListOperations | `list_operations.py` | `ListOperations` | T3 | 6 ops: nth/head/tail/filter/sort/drop_duplicates |
| 18 | StringTransform | `string_transform.py` | `StringTransform` | T3 | split/merge/jinja2 template ops |
| 19 | VariableAggregator | `variable_aggregator.py` | `VariableAggregator` | T3 | Returns first non-empty in each variable group |
| 20 | VariableAssigner | `variable_assigner.py` | `VariableAssigner` | T3 | 12 ops: overwrite/clear/set/append/extend/remove_first/last/`+=`/`-=`/`*=`/`//=` |
| 21 | DocsGenerator | `docs_generator.py` | `DocGenerator` | T5 | MD → PDF/DOCX/TXT/MD/HTML; header/footer/watermark/page# |
| 22 | ExcelProcessor | `excel_processor.py` | `ExcelProcessor` | T5 | Excel read/write/merge/convert via `pandas` + `openpyxl` |
### E.2 Tool inventory (21)
All tools extend `ToolBase` (`agent/tools/base.py:141`), expose `get_meta()` (OpenAI function-call schema), `_invoke`/`_invoke_async`, and `thoughts()`.
| # | Tool | `component_name` | Behavior |
|---|------|-----------------|----------|
| 1 | AkShare | `AkShare` | Chinese financial data (HTTP) |
| 2 | ArXiv | `ArXiv` | `export.arxiv.org/api/query` search |
| 3 | CodeExec | `CodeExec` | gRPC client to Python sandbox (kept as-is) |
| 4 | Crawler | `Crawler` | Generic HTML scraper (httpx + selectolax/BeautifulSoup) |
| 5 | DeepL | `DeepL` | DeepL Translate API (HTTP) |
| 6 | DuckDuckGo | `DuckDuckGo` | `html.duckduckgo.com/html` search |
| 7 | Email | `Email` | SMTP send via `smtplib` |
| 8 | ExeSQL | `ExeSQL` | MySQL/PG/MSSQL query via `database/sql` |
| 9 | GitHub | `GitHub` | GitHub REST API search |
| 10 | Google | `Google` | SerpAPI / Google CSE search |
| 11 | GoogleScholar | `GoogleScholar` | Scholar via SerpAPI |
| 12 | Jin10 | `Jin10` | Chinese financial news feed (HTTP) |
| 13 | PubMed | `PubMed` | NCBI E-utilities |
| 14 | QWeather | `QWeather` | HeFeng weather API |
| 15 | Retrieval | `Retrieval` | Dealer backend (Go-ized, in-process call) |
| 16 | SearXNG | `SearXNG` | Meta-search |
| 17 | TavilySearch | `TavilySearch` | Tavily search API |
| 18 | TavilyExtract | `TavilyExtract` | Tavily extract API |
| 19 | TuShare | `TuShare` | Tushare Chinese financial data |
| 20 | WenCai | `WenCai` | 同花顺 问财 stock Q&A |
| 21 | Wikipedia | `Wikipedia` | Wikipedia REST API |
| 22 | YahooFinance | `YahooFinance` | Yahoo Finance unofficial API |
### E.3 ComponentBase cross-cutting surface
Every `Component` exposes 18 methods: `invoke`/`invoke_async`/`_invoke`/`output`/`set_output`/`error`/`reset`/`get_input`/`get_input_values`/`get_input_elements_from_text`/`get_input_elements`/`set_input_value`/`get_input_value`/`get_param`/`get_upstream`/`get_downstream`/`get_parent`/`is_canceled`/`check_if_canceled`/`exception_handler`/`thoughts`.
### E.4 ToolBase cross-cutting surface
`ToolParamBase(ComponentParamBase)` wraps `inputs` from `meta["parameters"]`; `get_meta()` returns OpenAI function-call schema. `ToolBase(ComponentBase)` wraps `_invoke`/`_invoke_async` in `check_if_canceled` + records `_ERROR` + `_elapsed_time`. `LLMToolPluginCallSession` dispatches `tool_call_async(name, args)` to the right tool (or `MCPToolBinding`/`MCPToolCallSession`).