ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Files

Khaostica f57f3b4b3a feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773 ) (#15068 )

### What problem does this PR solve?

Closes #14773.

Today, Pipeline (`rag/flow/`) chunking strategies only run as part of a
dataset ingestion that always embeds and indexes the result. There is no
way to drive Pipeline-style chunking from an Agent workflow without
paying that vectorization/persistence cost.

This PR adds a single new Agent component, `PipelineChunker`, that:

- Takes one or more file references (from `Begin` / `UserFillUp`
uploads) as input.
- Runs the existing `rag.app.*` chunking strategies (`naive`, `paper`,
`qa`, `manual`, `book`, `presentation`, `laws`, `table`, `one`, `email`,
`picture`, `audio`, `resume`, `tag`) against each file.
- Emits the resulting chunks as `chunks: list[str]` and `chunks_full:
list[dict]` for downstream Agent nodes.
- Performs **no embedding and no persistence** — chunks live only in
canvas variables for the duration of the run, exactly as requested in
the issue.

The component is auto-discovered by `agent/component/__init__.py`; no
registry edits required. Chunker functions are imported lazily so the
component itself does not pull `deepdoc` / OCR / VLM at
component-discovery time. File resolution mirrors the existing
`ExcelProcessor` convention.

Out of scope for this PR (potential follow-ups):

- Vectorization / KB persistence (explicit ask in the issue).
- Frontend canvas UI for the new component.
- Bridging to the newer Pydantic-based `rag/flow/chunker/TokenChunker`
(consumes a parser node's structured output rather than a raw file — a
separate, larger feature).

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

---

## Files changed

- `agent/component/pipeline_chunker.py` — new component (~180 lines)
- `test/unit_test/agent/test_pipeline_chunker.py` — unit tests (~120
lines)

## Test plan

- [x] `ruff check` on changed files — clean.
- [x] `ruff format` applied to the new component file.
- [x] `python -m py_compile` on both new files — both compile.
- [x] New unit test file carries `pytestmark = pytest.mark.p2` so it
runs under marker-filtered CI.
- [x] Every new function, method, and class has a docstring (CodeRabbit
80% docstring-coverage gate).
- [x] `python -m pytest test/unit_test/agent/test_pipeline_chunker.py -x
-q` — **7 passed in 1.95s** locally. Tests stub
`api.db.services.file_service` and `rag.app.*` so they exercise the
parameter validation and parser-id lookup table without requiring the
full backend / model stack.

## Manual integration plan (post-merge)

1. Drop the component into an Agent canvas after a `Begin` node with a
file input.
2. Set `parser_id = "naive"` (or any other strategy) and reference the
file input in `inputs`.
3. Wire the `chunks` output into a downstream `LLM` / `Message` /
`Iteration` node — chunks are available as plain text without any
embedding or KB write.

Co-authored-by: John Baillie <johnbaillie2007@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>

2026-06-29 09:45:16 +08:00

benchmark

Fix: replace tenant_llm apis (#16131 )

2026-06-18 16:38:32 +08:00

fixtures/mineru

fix(mineru): skip page chrome blocks to prevent duplicate chunks (#15387 )

2026-06-01 20:15:04 +08:00

playwright

Feature: Allow page_size max value 100 (#15292 )

2026-05-28 11:13:01 +08:00

testcases

Harden closed-advisory fixes (#16409 )

2026-06-29 09:45:16 +08:00

unit_test

feat(agent): add Pipeline chunker component for pre-chunking workflows (#14773 ) (#15068 )

2026-06-29 09:45:16 +08:00

__init__.py

Feat: UI testing automation with playwright (#12749 )

2026-03-02 13:04:08 +08:00

README.md

Docs: Update version references to v0.26.2 in READMEs and docs (#16387 )

2026-06-29 09:45:16 +08:00

README.md

(1). Deploy RAGFlow services and images

https://ragflow.io/docs/build_docker_image

(2). Configure the required environment for testing

Install Python dependencies (including test dependencies):

uv sync --python 3.13 --only-group test --no-default-groups --frozen

Activate the environment:

source .venv/bin/activate

Install SDK:

uv pip install sdk/python

Modify the .env file: Add the following code:

COMPOSE_PROFILES=${COMPOSE_PROFILES},tei-cpu
TEI_MODEL=BAAI/bge-small-en-v1.5
RAGFLOW_IMAGE=infiniflow/ragflow:v0.26.2 #Replace with the image you are using

Start the container（wait two minutes）:

docker compose -f docker/docker-compose.yml up -d

(3). Test Elasticsearch

a) Run sdk tests against Elasticsearch:

export HTTP_API_TEST_LEVEL=p2
export HOST_ADDRESS=http://127.0.0.1:9380  # Ensure that this port is the API port mapped to your localhost
pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_sdk_api

b) Run http api tests against Elasticsearch:

pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_http_api

(4). Test Infinity

Modify the .env file:

DOC_ENGINE=${DOC_ENGINE:-infinity}

Start the container:

docker compose -f docker/docker-compose.yml down -v 
docker compose -f docker/docker-compose.yml up -d

a) Run sdk tests against Infinity:

DOC_ENGINE=infinity pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_sdk_api

b) Run http api tests against Infinity:

DOC_ENGINE=infinity pytest -s --tb=short --level=${HTTP_API_TEST_LEVEL} test/testcases/test_http_api

README.md Unescape Escape

(1). Deploy RAGFlow services and images

(2). Configure the required environment for testing

(3). Test Elasticsearch

(4). Test Infinity

README.md