mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-06-29 15:31:05 +08:00
## Summary
Decomposes the monolithic `task_executor.py` (1945 lines) into a 6-layer
architecture with clear separation of concerns. The refactored code is
functionally equivalent to the original, verified through 400 passing
tests and a production-vs-dry-run comparison framework.
## Architecture
```
entry (task_manager)
└─ orchestration (task_handler)
├─ services (chunk_service, embedding_service, dataflow_service, raptor_service, post_processor)
│ └─ utilities (chunk_builder, chunk_post_processor, embedding_utils)
└─ infrastructure (task_context, recording_context, interceptor)
```
Key design decisions:
- **TaskContext** — typed facade over raw task dict, injects rate
limiters + callbacks via composition
- **RecordingContext + Comparator** — enables side-by-side production vs
dry-run execution for safe migration
- **NullRecordingContext** — zero-allocation no-op for production, uses
`__slots__`
- **WriteOperationInterceptor** — FIFO replay of previous runs function
returns for comparison mode
## Migration Strategy
The original `handle_task()` in `task_executor.py` uses a 3-way switch
via `TE_RUN_MODE`:
- `TE_RUN_MODE=0` (default) → runs refactored code
- `TE_RUN_MODE=1` → runs both original + refactored, compares all
intermediate results
- `TE_RUN_MODE=2` → runs original code (fallback)
The comparison mode (`TE_RUN_MODE=1`) records ~40 intermediate values
(chunks, vectors, token counts, func return values) from the production
run and replays them during dry-run, then uses `ContextComparator` to
report mismatches.
## Functional Equivalence Fixes
All divergences between original and refactored code were identified and
fixed:
- Timeout decorators (handle/build_chunks/raptor/embedding)
- NullRecordingContext leak in finally block causing RuntimeError
- MinIO None-binary check with proper FileNotFoundError
- Dataflow dispatch after embedding binding + init_kb
- Memory task missing return after processing
- RAPTOR checkpoint progress reporting
- Tag cache (get_tags_from_cache/set_tags_to_cache) restoration
- dataflow_id correction in _load_dsl
- Language default Chinese, dead code guard removal
- embed_chunks made async with proper thread_pool_exec
- Full GraphRAG default configuration (10 parameters)
- Hardcoded q_768_vec fallback removal in RAPTOR
## Test Changes
- 20 new tests covering table parser manual mode, tag cache, embedding
edge cases, RAPTOR checkpoint, dataflow_id correction, storage binary
None, cancel cleanup, metadata=None boundary
- Unified `make_task_context`/`make_task_dict` factories eliminated 10+
duplicated helpers
- DataflowService tests migrated from internal method mocks to IO
boundary mocks (real orchestration code executes)
- Parametrized duplicate build_chunks post-processor tests
- 7 raptor tests modernized to @pytest.mark.asyncio
- Mock count per test reduced through boundary-level mocking strategy
**Test count: 400 passing, 0 warnings, 0 skips**
## Files Changed
| File | Change |
|------|--------|
| `rag/svr/task_executor.py` | +1 line (NullRecordingContext fix) |
| `rag/svr/task_executor_refactor/task_handler.py` | Orchestration
layer, 8 logic fixes |
| `rag/svr/task_executor_refactor/chunk_service.py` | +timeout +
None-check |
| `rag/svr/task_executor_refactor/embedding_service.py` | sync→async
rewrite |
| `rag/svr/task_executor_refactor/dataflow_service.py` | dataflow_id fix
+ timeout |
| `rag/svr/task_executor_refactor/raptor_service.py` | checkpoint fix +
assert |
| `rag/svr/task_executor_refactor/chunk_post_processor.py` | tag cache
restore |
| `rag/svr/task_executor_refactor/task_context.py` | language default
fix |
| `test/.../conftest.py` | +294 lines shared helpers |
| `test/.../*.py` | 15 test files refactored, 20 new tests |
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
8.1 KiB
8.1 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It's a full-stack application with:
- Python backend (Quart-based async API server — Quart is the async reimplementation of Flask)
- React/TypeScript frontend (built with vitejs)
- Background task executor workers (separate Python processes, Redis-queue-driven)
- Peewee ORM for database models (not SQLAlchemy)
- Multiple data stores (MySQL/PostgreSQL, Elasticsearch/Infinity/OpenSearch/OceanBase, Redis, MinIO)
Architecture
Runtime Architecture
RAGFlow runs as two separate Python process types, orchestrated by docker/launch_backend_service.sh:
- API Server (
api/ragflow_server.py): Quart-based async HTTP server - Task Executors (
rag/svr/task_executor.py): Background workers processing documents from Redis streams. Multiple instances run in parallel (controlled byWSenv var). Each consumes from priority-ordered Redis streams (te.1.common,te.0.common), using consumer groups for load distribution.
Key consequence: task executors import a different code surface than the API server, so always check which process a module is meant for.
Backend API (/api/)
- App factory:
api/apps/__init__.py— creates the Quart app, configures auth (login_requireddecorator, JWT + API token + session fallback), and dynamically discovers/registers blueprints - Two API coexisting patterns:
- RESTful APIs in
api/apps/restful_apis/— newer pattern with Pydantic request validation, service layer inapi/apps/services/, routes registered under/api/v1 - Legacy APIs in
api/apps/*_app.py— older pattern using@validate_request(), routes registered under/v1/<page_name> - SDK APIs in
api/apps/sdk/— registered under/v1/
- RESTful APIs in
- Services:
api/db/services/— business logic wrapping Peewee model operations.api/apps/services/— service layer for the RESTful APIs - Models:
api/db/db_models.py— Peewee ORM models with pooled MySQL/PostgreSQL connections, customJSONField/ListFieldtypes, retry logic on connection loss
Core Processing (/rag/)
- Document ingestion pipeline:
rag/flow/pipeline.py—Pipeline(extendsagent.canvas.Graph) orchestrates the ingestion DAG. Components: File (fetches binary from storage), Parser (dispatches todeepdoc.parserbased on file type), TokenChunker/TitleChunker (splits into chunks), Tokenizer (computes full-text tokens + embedding vectors), Extractor (LLM-based extraction). Data flows via Pydantic*FromUpstreamschemas. - Document parsing:
deepdoc/— PDF parsing (vision-based OCR, layout analysis, table structure recognition) and format-specific parsers (DOCX, XLSX, PPT, Markdown, HTML, images). All parsers normalize to a common structure (list of bbox dicts for PDFs,{text, doc_type_kwd}for others). - LLM Integration:
rag/llm/— factory pattern with runtime class discovery.chat_model.py(30+ providers via OpenAI SDK and LiteLLM wrappers),embedding_model.py,rerank_model.py,cv_model.py(image-to-text),sequence2txt_model.py(ASR),tts_model.py. UseLLMBundle(fromapi.db.services.llm_service) as the unified interface. - Graph RAG:
rag/graphrag/— multi-phase pipeline: per-document subgraph extraction (LLM or spaCy NER), Leiden community detection, entity resolution, community summarization. Entities/relations/reports are indexed as chunks alongside regular text chunks, differentiated byknowledge_graph_kwd. - Search:
rag/nlp/search.py—Dealerclass combines vector similarity + BM25 + re-ranking.KGSearchextends it for graph-aware retrieval (entity resolution, n-hop enrichment).
Agent System (/agent/)
- Execution engine:
agent/canvas.py—Canvas(extendsGraph) executes the DAG. Components are run in topological order via_run_batch, each receiving upstream outputs as kwargs. Control-flow components (Categorize,Switch,Iteration,Loop) dynamically modify the execution path. - Component base:
agent/component/base.py—ComponentBasewithinvoke(**kwargs)/invoke_async(**kwargs)lifecycle. Variable references ({component_id@output_var}or{sys.query}) are resolved from the canvas graph at runtime. - Components: Modular workflow components in
agent/component/— Begin, LLM, Agent (tool-calling LLM), Categorize, Switch, Iteration, Loop, Message, Invoke (HTTP), and data manipulation nodes. Auto-discovered by__init__.py. - Templates: Pre-built agent workflows as JSON DSL files in
agent/templates/. Each contains a completecomponentsDAG,path, andglobals. - Tools:
agent/tools/— Retrieval, web search (DuckDuckGo, Google, Tavily, SearXNG), academic search (ArXiv, PubMed, Google Scholar, Wikipedia), code execution, SQL execution, email, GitHub, finance data, translation, weather. Tools implementToolBase(extendsComponentBase) and produce OpenAI-compatible function descriptors. - Plugins:
agent/plugin/— plugin system usingpluginlibfor loading external LLM tool plugins fromembedded_plugins/.
Frontend (/web/)
- React/TypeScript with vitejs framework
- shadcn/ui components (Radix UI primitives + Tailwind CSS)
@tanstack/react-queryfor server state (cache keys, mutations, invalidation)- Zustand for local state (primarily agent canvas graph store)
react-routerv7 with lazy-loaded pagesreact-i18nextfor i18n (17 languages)- Axios for HTTP with a layered pattern: endpoint definitions (
utils/api.ts) → HTTP client (utils/next-request.ts) → service layer (services/) → query hooks (hooks/use-*-request.ts) → components @xyflow/reactfor the agent workflow canvasreact-hook-form+zodfor form validation- Two API proxy prefixes:
webAPI = '/v1'(legacy) andrestAPIv1 = '/api/v1'(RESTful)
Common Development Commands
Backend Development
# Install Python dependencies
uv sync --python 3.13 --all-extras
uv run python3 download_deps.py
pre-commit install
# Start dependent services
docker compose -f docker/docker-compose-base.yml up -d
# Run backend (requires services to be running)
source .venv/bin/activate
export PYTHONPATH=$(pwd)
bash docker/launch_backend_service.sh
# Run tests
uv run pytest
# Linting
ruff check
ruff format
Frontend Development
cd web
npm install
npm run dev # Development server
npm run build # Production build
npm run lint # ESLint
npm run test # Jest tests
Docker Development
# Full stack with Docker
cd docker
docker compose -f docker-compose.yml up -d
# Check server status
docker logs -f ragflow-server
# Rebuild images
docker build --platform linux/amd64 -f Dockerfile -t infiniflow/ragflow:nightly .
Key Configuration Files
docker/.env- Environment variables for Docker deploymentdocker/service_conf.yaml.template- Backend service configurationpyproject.toml- Python dependencies and project configurationweb/package.json- Frontend dependencies and scripts
Testing
- Python: pytest with markers (p1/p2/p3 priority levels)
- Frontend: Jest with React Testing Library
- API Tests: HTTP API and SDK tests in
test/andsdk/python/test/
Database Engines
RAGFlow supports switching between Elasticsearch (default) and Infinity:
- Set
DOC_ENGINE=infinityindocker/.envto use Infinity - Requires container restart:
docker compose down -v && docker compose up -d
Development Environment Requirements
- Python 3.10-3.13
- Node.js >=18.20.4
- Docker & Docker Compose
- uv package manager
- 16GB+ RAM, 50GB+ disk space
- Think before acting. Read existing files before writing code.
- Be concise in output but thorough in reasoning.
- Prefer editing over rewriting whole files.
- Do not re-read files you have already read.
- Test your code before declaring done.
- No sycophantic openers or closing fluff.
- Keep solutions simple and direct.
- User instructions always override this file.