Files
ragflow/test/unit_test/agent/test_pipeline_chunker.py

145 lines
5.5 KiB
Python
Raw Normal View History

feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
#
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
"""Unit tests for the PipelineChunker agent component (#14773).
These tests cover only the pieces that don't require a live Canvas/Graph:
parameter validation and the parser-id -> module lookup table. Full
end-to-end behavior is intentionally left to higher-level integration tests.
"""
from __future__ import annotations
import sys
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
from importlib import import_module, reload
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
from unittest.mock import MagicMock
import pytest
pytestmark = pytest.mark.p2
# The component pulls in api.db.services.file_service (-> quart_auth, peewee,
# the entire backend stack) and rag.app.* (-> deepdoc, OCR, xgboost,
# transformers). None of that is exercised by these unit tests, so replace
# the heavy modules with stubs to keep the test runnable without the full
# runtime environment. We track every key we install and restore the prior
# sys.modules state in teardown_module so the stubs don't leak into other
# test files.
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
@pytest.fixture(scope="module")
def pipeline_chunker_module():
"""Import pipeline_chunker with rag.app parser modules stubbed locally."""
stubbed_names = [
"api.db.services.file_service",
"deepdoc.vision.ocr",
"deepdoc.parser.figure_parser",
"rag.app.picture",
"rag.app.audio",
"rag.app.resume",
"rag.app.naive",
"rag.app.paper",
"rag.app.book",
"rag.app.presentation",
"rag.app.manual",
"rag.app.laws",
"rag.app.qa",
"rag.app.table",
"rag.app.one",
"rag.app.email",
"rag.app.tag",
]
original_modules = {name: sys.modules.get(name) for name in stubbed_names}
file_service_stub = MagicMock()
file_service_stub.FileService = MagicMock()
try:
sys.modules["api.db.services.file_service"] = file_service_stub
for name in stubbed_names[1:]:
stub = MagicMock()
stub.chunk = MagicMock(return_value=[{"content_with_weight": "stub"}])
sys.modules[name] = stub
module = import_module("agent.component.pipeline_chunker")
module = reload(module)
yield module
finally:
for name, original in original_modules.items():
if original is None:
sys.modules.pop(name, None)
else:
sys.modules[name] = original
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
class TestPipelineChunkerParam:
"""Validate parameter parsing and the strategy whitelist."""
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
def test_default_param_validates(self, pipeline_chunker_module):
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
"""A freshly constructed param object should pass ``check()``."""
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
p = pipeline_chunker_module.PipelineChunkerParam()
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
assert p.check() is True
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
def test_accepts_each_known_parser(self, pipeline_chunker_module):
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
"""Every parser id in the lookup table must validate."""
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
for parser_id in pipeline_chunker_module._PARSER_MODULES:
p = pipeline_chunker_module.PipelineChunkerParam()
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
p.parser_id = parser_id
assert p.check() is True
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
def test_rejects_unknown_parser(self, pipeline_chunker_module):
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
"""Unknown parser ids must raise ``ValueError`` at validation time."""
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
p = pipeline_chunker_module.PipelineChunkerParam()
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
p.parser_id = "nonsense-parser"
with pytest.raises(ValueError):
p.check()
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
def test_rejects_non_dict_parser_config(self, pipeline_chunker_module):
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
"""``parser_config`` must be a dict; anything else must raise."""
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
p = pipeline_chunker_module.PipelineChunkerParam()
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
p.parser_config = "not a dict"
with pytest.raises(ValueError):
p.check()
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
def test_rejects_negative_pages(self, pipeline_chunker_module):
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
"""Negative page indices must raise ``ValueError``."""
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
p = pipeline_chunker_module.PipelineChunkerParam()
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
p.from_page = -1
with pytest.raises(ValueError):
p.check()
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
def test_rejects_inverted_page_range(self, pipeline_chunker_module):
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
"""``from_page`` greater than ``to_page`` must raise ``ValueError``."""
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
p = pipeline_chunker_module.PipelineChunkerParam()
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
p.from_page = 10
p.to_page = 5
with pytest.raises(ValueError, match="from_page must be <= to_page"):
p.check()
class TestLoadChunker:
"""Verify the lazy parser-id -> chunker callable resolver."""
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
def test_load_chunker_returns_callable_for_each_known_parser(self, pipeline_chunker_module):
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
"""Every known parser id should resolve to a callable ``chunk`` function."""
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
for parser_id in pipeline_chunker_module._PARSER_MODULES:
chunker = pipeline_chunker_module._load_chunker(parser_id)
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
assert callable(chunker)
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
def test_load_chunker_raises_for_unknown_parser(self, pipeline_chunker_module):
feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773) (#15068) ### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-27 20:52:58 -07:00
"""Unknown parser ids should raise ``KeyError`` from the lookup."""
with pytest.raises(KeyError):
Fix: UserFillUp interactive forms not working in agent explore mode (#14589) ## Summary - **Backend**: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - **Frontend**: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in **explore mode** - [x] Same agent continues to work correctly in **run (editor) mode** - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-06-28 21:57:57 +08:00
pipeline_chunker_module._load_chunker("not-a-real-parser")