mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-06-29 23:41:12 +08:00
### What problem does this PR solve? Closes #14773. Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a dataset ingestion that always embeds and indexes the result. There is no way to drive Pipeline-style chunking from an Agent workflow without paying that vectorization/persistence cost. This PR adds a single new Agent component, `PipelineChunker`, that: - Takes one or more file references (from `Begin` / `UserFillUp` uploads) as input. - Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`, `qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`, `picture`, `audio`, `resume`, `tag`) against each file. - Emits the resulting chunks as `chunks: list[str]` and `chunks_full: list[dict]` for downstream Agent nodes. - Performs **no embedding and no persistence** — chunks live only in canvas variables for the duration of the run, exactly as requested in the issue. The component is auto-discovered by `agent/component/__init__.py`; no registry edits required. Chunker functions are imported lazily so the component itself does not pull `deepdoc` / OCR / VLM at component-discovery time. File resolution mirrors the existing `ExcelProcessor` convention. Out of scope for this PR (potential follow-ups): - Vectorization / KB persistence (explicit ask in the issue). - Frontend canvas UI for the new component. - Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker` (consumes a parser node's structured output rather than a raw file — a separate, larger feature). ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --- ## Files changed - `agent/component/pipeline_chunker.py` — new component (~180 lines) - `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120 lines) ## Test plan - [x] `ruff check` on changed files — clean. - [x] `ruff format` applied to the new component file. - [x] `python -m py_compile` on both new files — both compile. - [x] New unit test file carries `pytestmark = pytest.mark.p2` so it runs under marker-filtered CI. - [x] Every new function, method, and class has a docstring (CodeRabbit 80% docstring-coverage gate). - [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x -q` — **7 passed in 1.95s** locally. Tests stub `api.db.services.file_service` and `rag.app.*` so they exercise the parameter validation and parser-id lookup table without requiring the full backend / model stack. ## Manual integration plan (post-merge) 1. Drop the component into an Agent canvas after a `Begin` node with a file input. 2. Set `parser_id = "naive"` (or any other strategy) and reference the file input in `inputs`. 3. Wire the `chunks` output into a downstream `LLM` / `Message` / `Iteration` node — chunks are available as plain text without any embedding or KB write. Co-authored-by: John Baillie <johnbaillie2007@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>