ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-01 00:05:43 +08:00

Author	SHA1	Message	Date
Jack	b363146997	refactor: overhaul task executor with layered architecture and comprehensive test suite (#15471 ) ## Summary Decomposes the monolithic `task_executor.py` (1945 lines) into a 6-layer architecture with clear separation of concerns. The refactored code is functionally equivalent to the original, verified through 400 passing tests and a production-vs-dry-run comparison framework. ## Architecture ``` entry (task_manager) └─ orchestration (task_handler) ├─ services (chunk_service, embedding_service, dataflow_service, raptor_service, post_processor) │ └─ utilities (chunk_builder, chunk_post_processor, embedding_utils) └─ infrastructure (task_context, recording_context, interceptor) ``` Key design decisions: - TaskContext — typed facade over raw task dict, injects rate limiters + callbacks via composition - RecordingContext + Comparator — enables side-by-side production vs dry-run execution for safe migration - NullRecordingContext — zero-allocation no-op for production, uses `__slots__` - WriteOperationInterceptor — FIFO replay of previous runs function returns for comparison mode ## Migration Strategy The original `handle_task()` in `task_executor.py` uses a 3-way switch via `TE_RUN_MODE`: - `TE_RUN_MODE=0` (default) → runs refactored code - `TE_RUN_MODE=1` → runs both original + refactored, compares all intermediate results - `TE_RUN_MODE=2` → runs original code (fallback) The comparison mode (`TE_RUN_MODE=1`) records ~40 intermediate values (chunks, vectors, token counts, func return values) from the production run and replays them during dry-run, then uses `ContextComparator` to report mismatches. ## Functional Equivalence Fixes All divergences between original and refactored code were identified and fixed: - Timeout decorators (handle/build_chunks/raptor/embedding) - NullRecordingContext leak in finally block causing RuntimeError - MinIO None-binary check with proper FileNotFoundError - Dataflow dispatch after embedding binding + init_kb - Memory task missing return after processing - RAPTOR checkpoint progress reporting - Tag cache (get_tags_from_cache/set_tags_to_cache) restoration - dataflow_id correction in _load_dsl - Language default Chinese, dead code guard removal - embed_chunks made async with proper thread_pool_exec - Full GraphRAG default configuration (10 parameters) - Hardcoded q_768_vec fallback removal in RAPTOR ## Test Changes - 20 new tests covering table parser manual mode, tag cache, embedding edge cases, RAPTOR checkpoint, dataflow_id correction, storage binary None, cancel cleanup, metadata=None boundary - Unified `make_task_context`/`make_task_dict` factories eliminated 10+ duplicated helpers - DataflowService tests migrated from internal method mocks to IO boundary mocks (real orchestration code executes) - Parametrized duplicate build_chunks post-processor tests - 7 raptor tests modernized to @pytest.mark.asyncio - Mock count per test reduced through boundary-level mocking strategy Test count: 400 passing, 0 warnings, 0 skips ## Files Changed \| File \| Change \| \|------\|--------\| \| `rag/svr/task_executor.py` \| +1 line (NullRecordingContext fix) \| \| `rag/svr/task_executor_refactor/task_handler.py` \| Orchestration layer, 8 logic fixes \| \| `rag/svr/task_executor_refactor/chunk_service.py` \| +timeout + None-check \| \| `rag/svr/task_executor_refactor/embedding_service.py` \| sync→async rewrite \| \| `rag/svr/task_executor_refactor/dataflow_service.py` \| dataflow_id fix + timeout \| \| `rag/svr/task_executor_refactor/raptor_service.py` \| checkpoint fix + assert \| \| `rag/svr/task_executor_refactor/chunk_post_processor.py` \| tag cache restore \| \| `rag/svr/task_executor_refactor/task_context.py` \| language default fix \| \| `test/.../conftest.py` \| +294 lines shared helpers \| \| `test/.../*.py` \| 15 test files refactored, 20 new tests \| --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 17:18:31 +08:00
Lynn	dc4b82523b	Feat: tenant llm provider (#14595 ) ### What problem does this PR solve? Python implementation of the Go-based model_provider API suite. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: bill <yibie_jingnian@163.com>	2026-05-29 17:39:41 +08:00
Jack	f0cb7a544b	Refactor: Task Executor (#15154 ) ### What problem does this PR solve? 1. Break huge function into smaller pieces 2. Add unit test for the smaller pieces function 3. Layer-ed design a. infra layer - task_context.py, recording_context.py, write_operation_interceptor.py, ... b. service layer - *_service.py c. business layer - task_handler.py 4. Default behavior: use "refactor-ed version" - can switch to original version by change env variable ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement --------- Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-05-27 21:54:17 +08:00
Wang Qi	619b971785	Fix: empty file with better message (#15232 ) Fix: empty file with better message	2026-05-26 12:28:53 +08:00
Wang Qi	a9ec78cb9c	Refactor: enahnce retry and timeout (#14983 ) ### What problem does this PR solve? 1. Enhance retry and timeout, and adjust the default timeout 2. NER: spacy do not batch chunks 3. extract _has_cancel_and_exit 4. enhance log messages ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2026-05-22 13:16:39 +08:00
buua436	04bdb41909	Fix: guard missing task language (#15136 ) ### What problem does this PR solve? guard missing task language ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-22 11:46:38 +08:00
Wang Qi	c5a46fda44	Fix: <asyncio.locks.Semaphore object at 0xabcd [locked]> is bound to a different event loop (#15100 ) Fix: <asyncio.locks.Semaphore object at 0xabcd [locked]> is bound to a different event loop	2026-05-21 19:23:41 +08:00
Wang Qi	13b422037f	Refactor: enhance graphrag - part 2 (#14972 ) ### What problem does this PR solve? 1. expose batch_chunk_token_size for configuration 2. retrieve chunks when build subgraph for the doc, not retreive all docs chunks at the begining 3. get all chunks for a document, used to be hard coded 10000 4. delete not used method run_graphrag ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring Follow on: #14617	2026-05-18 16:10:21 +08:00
shawnxiao105-afk	8b6dd6a5c2	fix: guard whitespace-only chunks before embedding (#13938 ) ## Problem When parsing DOCX files with many tables, DeepDOC generates chunks containing only empty HTML table tags, such as: ```html <table><tr><td></td></tr><tr><td></td></tr><tr><td></td></tr><tr><td></td></tr></table> ``` After the regex cleanup at `task_executor.py:584`, this becomes `" "` (whitespace only). The guard at line 585 (`if not c`) only catches empty strings `""`, but whitespace strings are truthy in Python and pass through. When sent to Zhipu `embedding-3` API, it rejects them with error 1213: `未正常接收到prompt参数`. ## Root Cause ```python c = re.sub(r"</?(table\|td\|caption\|tr\|th)( [^<>]{0,12})?>", " ", c) if not c: # ← only catches "", not " " / "\n" / "\t" c = "None" ``` Verified with Zhipu `embedding-3`: \| Input \| Result \| \|---\|---\| \| `""` \| error 1213 \| \| `" "` \| error 1213 \| \| `"\n"` \| error 1213 \| \| `"None"` \| OK \| ## Fix ```diff - if not c: + if not c.strip(): c = "None" ``` ## Testing Reproduced with a 678KB DOCX file (166 tables, 270 chunks). Chunk #89 is the empty table above. After fix, `"None"` is sent instead and embedding succeeds. --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-05-13 11:47:50 +08:00
Wang Qi	4374e07a29	Speed up start time (#14833 ) ### What problem does this PR solve? Speed up start time ### Type of change - [x] Refactoring	2026-05-12 17:00:45 +08:00
CaptainTimon	2717ee283f	feat(raptor): add Psi tree builder with original-space ranking and safe migration (#14679 ) ### What problem does this PR solve? Closes #14674. This PR improves RAPTOR configuration and tree construction while preserving the existing RAPTOR behavior as the default. RAPTOR currently builds summary layers with the original UMAP + GMM clustering path. This PR keeps that default path, and adds: - A hidden backend tree-builder option: - `tree_builder="raptor"`: default, existing RAPTOR behavior. - `tree_builder="psi"`: rank-aware Psi-style tree builder using original embedding-space cosine ranking. - A user-facing clustering method option for the default RAPTOR builder: - `clustering_method="gmm"`: existing default. - `clustering_method="ahc"`: agglomerative hierarchical clustering path. - A RAPTOR UI setting for `Clustering method` and `Max cluster`. ### What changed #### Backend - Added `tree_builder` support for RAPTOR/Psi. - Added `clustering_method` support for GMM/AHC. - Kept existing RAPTOR + GMM as the default. - Added Psi tree building from original-space cosine similarity. - Added bucketed Psi building controls for large inputs: - `raptor.ext.psi_exact_max_leaves` - `raptor.ext.psi_bucket_size` - Added method-aware RAPTOR summary metadata using existing `extra.raptor_method`. - Avoided adding a dedicated DB schema field for experimental method tracking. - Added cleanup/migration logic to avoid mixing stale RAPTOR summary trees. - Added defensive checks for Psi tree construction and summary failures. #### Frontend/UI - Added `Clustering method` in RAPTOR settings with `GMM` and `AHC`. - Added/kept `Max cluster` in RAPTOR settings. - Enlarged max cluster UI limit to `1024`, matching backend validation. - Kept AHC editable even when a RAPTOR task has already finished. - Fixed the UI save payload so `clustering_method` and `tree_builder` are serialized through `parser_config.raptor.ext`, avoiding backend validation errors for extra top-level RAPTOR fields. Example saved RAPTOR config: ```json { "raptor": { "max_cluster": 317, "ext": { "clustering_method": "ahc", "tree_builder": "raptor" } } } Co-authored-by: CaptainTimon <CaptainTimon@users.noreply.github.com>	2026-05-12 09:42:31 +08:00
web-dev0521	cc207b5b05	Refactor: tidy up ThreadPoolExecutor lifecycle in file_service and task executor (#14668 ) ## Summary - Wrap the `ThreadPoolExecutor` instances in `FileService.parse_docs` and `FileService.get_files` with `with ... as exe:` blocks for deterministic cleanup - Replace the `concurrent.futures.ThreadPoolExecutor` in `do_handle_task` with `asyncio.create_task(asyncio.to_thread(build_TOC, ...))`, preserving the existing parallelism with chunk insertion while leveraging the surrounding async context - Drop the now-unused `import concurrent` and the `executor.shutdown(wait=False)` call in the `finally` block Closes #14622. No behavioral change, no public API change. Net diff: ~19 insertions / 25 deletions across two files. ## Test plan - [ ] `uv run ruff check api/db/services/file_service.py rag/svr/task_executor.py` passes - [ ] Upload a multi-file batch through the chat/file endpoint and confirm `FileService.parse_docs` still returns combined parsed text - [ ] Trigger `FileService.get_files` via the chat reference flow with a mix of image and non-image files; verify both `raw=True` and `raw=False` paths return correctly - [ ] Run a `naive`-parser document task with `toc_extraction: true` and confirm the TOC chunk is generated and inserted exactly as before - [ ] Run a `naive`-parser document task with `toc_extraction: false` and confirm the path with `toc_thread = None` is unaffected - [ ] Cancel a running task to exercise the `finally` block and confirm cleanup still works without the executor shutdown call --------- Co-authored-by: web-dev0521 <jasonpette1783@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-05-11 12:59:00 +08:00
Qinsanz	d6660cf156	fix(keyword_extraction): accept Chinese commas/semicolons/newlines as keyword delimiters (#14540 ) ## What Widen the keyword delimiter in `rag/svr/task_executor.py`: both `build_chunks` (LLM `keyword_extraction` cache parsing) and `run_dataflow` (chunk-level `keywords` ingestion) now split on `, ， ; ；、 \r \n` instead of only ASCII comma. ## Why `rag/prompts/keyword_prompt.md` instructs the LLM: > The keywords are delimited by ENGLISH COMMA. In practice, Chinese-leaning models (Qwen / Tongyi-Qianwen, GLM, etc.) frequently ignore this instruction when the source content is Chinese and emit Chinese commas (`，`) instead. Result: `cached.split(",")` sees the full LLM output as a single keyword. Repro: `auto_keywords>=4` + Chinese docs + `qwen-plus@Tongyi-Qianwen`. We observed entries in `important_kwd` like `"功能介绍，配置说明，参数详解，问题排查"` — one bucket instead of four. ## Impact - Silent data-quality bug; no exception thrown. - BM25 `important_kwd^30` boost effectively stops firing — the indexed term is the whole list, never matches user query tokens. - Any downstream aggregating `important_kwd` (tagging, analytics, candidate-keyword review UIs) sees garbage. ## Compatibility - Pure widening of the splitter; ASCII-comma-only outputs continue to work identically. - No schema / API change. ## Test plan Manually verified against `qwen-plus@Tongyi-Qianwen` with `auto_keywords=10` on Chinese .txt files: - Before: `important_kwd` contains one element per chunk that is the full LLM string with `，`-separated phrases inside. - After: `important_kwd` contains N elements, one per phrase, as the LLM intended.	2026-05-11 12:05:24 +08:00
Ahmad Intisar	3c4d1da98f	Feature/table parser column roles (#13710 ) ### What problem does this PR solve? The table file parser (CSV/Excel) currently treats all columns identically — every column is both vectorized (embedded in chunk text) and stored as filterable metadata. There's no way for users to control which columns should be searchable by semantic meaning versus which should only be filterable attributes. For example, when ingesting a news articles CSV with columns like title, content, country, category, source, etc., the embedding includes metadata fields like country: Brazil and source: Reuters in the chunk text, which dilutes the semantic quality of the embedding without adding retrieval value. The RDBMS connector (MySQL/PostgreSQL) already supports content_columns / metadata_columns, but this capability was missing for file-based table ingestion. This PR adds column-level control (vectorize / metadata / both) for the table file parser, following RAGFlow's existing patterns. Backward compatible: Datasets without table_column_roles or with table_column_mode: auto behave exactly as before (all columns = both). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-11 10:06:04 +08:00
sapienza yoan	811e9826d0	perf: avoid O(n²) array growth in embedding accumulation (#14369 ) ### What problem does this PR solve? Both tokenizer (`rag/flow/tokenizer/tokenizer.py`) and `BuiltinEmbed.encode` (`rag/llm/embedding_model.py`) currently accumulate embedding batches via `np.concatenate` inside the per-batch loop. `np.concatenate` allocates a new array and copies all existing data on every call, so accumulating N batches is O(N²) in both time and peak memory. Replacing the incremental concatenate with a list-of-batches + a single `np.vstack` at the end gives O(N) total work. For tokenizer the title-vector broadcast `np.concatenate([vts[0]] * N)` is also replaced by `np.tile`, which does the same job with a single contiguous allocation instead of building a Python list of references. This is purely a CPU/memory optimisation — output shape and dtype are unchanged. Measured impact grows with document size: - 1k chunks (batch 512, 2 iters): ~negligible - 10k chunks (20 iters): ~10× speedup on this stage - 100k chunks (195 iters): ~100× speedup, and peak RAM drops from O(N) extra to near-zero ### Type of change - [x] Performance Improvement Co-authored-by: yoan sapienza <Yoan Sapienza yoan.sapienza@orange.fr Yoan Sapienza zappy@macbookpro.home>	2026-04-30 11:00:10 +08:00
Jack	872ff08304	Fix: add executor.shutdown (#14403 ) ### What problem does this PR solve? Add executor shutdown in finally clause to free resources. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-27 22:38:43 +08:00
Idriss Sbaaoui	4303be223f	Fix metadata parsing regression for upgraded v0.24 datasets (#14383 ) ### What problem does this PR solve? This PR fixes issue #14371 where file parsing failed after upgrading from v0.24.0 to v0.25.0, because metadata config could be a JSON Schema object but was handled like a list and later caused `KeyError: 'properties'`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-27 16:18:06 +08:00
yuch85	0d87cecae2	feat: persist PDF bookmark outline as document metadata (#13287 ) ## Summary PDF files often contain a bookmark/outline tree (table of contents built into the file by the authoring tool). RAGFlow's `pdf_parser.outlines` already extracts these `(title, depth)` tuples via pypdf, but they are used ephemerally during chunking (`manual` parser uses them for hierarchy detection) and then discarded. This PR persists the outline as `doc.meta_fields["outline"]` — a JSON array of `{"title": str, "depth": int}` objects — so downstream features can use the structural information. ### Why this matters - Complementary to `toc_extraction` — the existing `toc_extraction` feature uses LLM calls to generate a TOC and only works for the `naive` parser. The raw PDF outline is free (already extracted by pypdf), works for all parsers, and captures the author's original document structure. - Document navigation — frontends can render a clickable TOC from the outline - Entity extraction — the outline provides a structural map for identifying document sections and key topics - Search result context — knowing which section a chunk belongs to helps users evaluate relevance ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/app/naive.py` \| Attach `pdf_parser.outlines` as `__outline__` on first chunk dict \| ~7 \| \| `rag/app/manual.py` \| Same for the manual parser \| ~5 \| \| `rag/svr/task_executor.py` \| Extract `__outline__`, persist via `DocMetadataService.update_document_metadata()` \| ~12 \| ### Design decisions - Transient key pattern: The outline is passed from parser → task_executor via `__outline__` on the first chunk dict, then removed before indexing. This follows the same pattern as `metadata_obj` for LLM-generated metadata. - No schema changes: Uses the existing `meta_fields` JSON column on the document table. - Graceful degradation: If a PDF has no outline (common for scanned docs), nothing is stored. If persistence fails, it logs a warning and continues — parsing is not interrupted. ### Backward compatibility - Fully backward compatible — no existing fields, behavior, or schemas changed - PDFs without outlines are unaffected - Existing `meta_fields` data is preserved (merged, not overwritten) ## Test plan - [ ] Parse a PDF with bookmarks (e.g. any multi-chapter document), verify `meta_fields["outline"]` is populated - [ ] Parse a PDF without bookmarks, verify no errors and no outline key in meta_fields - [ ] Verify existing `meta_fields` data is preserved (not overwritten) when outline is added - [ ] Verify `manual` parser also persists outlines - [ ] Verify outline JSON structure: `[{"title": "Chapter 1", "depth": 0}, ...]` Related: #9921 (Deterministic Document Access Layer) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 11:57:06 +08:00
yuch85	3ad3241ae0	feat: persist RAPTOR layer metadata on summary chunks (#13286 ) ## Summary RAPTOR's recursive clustering builds a `layers` list tracking `(start_idx, end_idx)` boundaries per level, but currently discards this information — only the flat `chunks` list is returned. This makes it impossible to distinguish leaf-level summaries from top-level ones. This PR: - Returns `(chunks, layers)` tuple from `raptor.py`'s `__call__` - Annotates each RAPTOR summary chunk with `raptor_layer_int` (1 = first summary level, 2 = summary-of-summaries, etc.) - Adds `raptor_layer_int` to `infinity_mapping.json` (Elasticsearch handles it via existing `_int` dynamic template) ### Why this matters Downstream features need to know which RAPTOR layer a summary belongs to: - Retrieving the top-level document summary* for entity extraction, search snippets, or document comparison - Filtering by abstraction level — users may want only high-level summaries or only leaf-level cluster summaries - RAPTOR recall quality — #10951 reports summaries not being recalled for definition queries; layer metadata enables targeted retrieval ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/raptor.py` \| Return `(chunks, layers)` tuple \| ~3 \| \| `rag/svr/task_executor.py` \| Build `chunk_layer` mapping, set `raptor_layer_int` \| ~12 \| \| `conf/infinity_mapping.json` \| Add `raptor_layer_int` integer field \| ~1 \| ### Backward compatibility - Additive only — no existing fields or behavior changed - Existing RAPTOR chunks continue to work (they'll have `raptor_layer_int = 0` by default) - New RAPTOR chunks get layer metadata automatically ## Test plan - [ ] Parse a document with RAPTOR enabled, verify `raptor_layer_int` is set on indexed chunks - [ ] Verify `raptor_layer_int` values increase with abstraction level (layer 1 < layer 2 < ...) - [ ] Verify existing RAPTOR deletion (`delete by raptor_kwd`) still works - [ ] Verify Infinity backend accepts the new field Fixes #7488 Related: #4104, #11191, #10951 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 10:20:46 +08:00
Lynn	afdf0814d7	Fix: get metadata conf (#14250 ) ### What problem does this PR solve? Get metadata configuration from union of custom metadata and built_in_metadata. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-21 17:22:42 +08:00
Magicbook1108	19eedeec61	Fix: accept empty value as 0 chunk (#14220 ) ### What problem does this PR solve? Fix: accept empty value as 0 chunk ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-20 12:53:47 +08:00
Qi Wang	969ce3a79f	[Bug fix #14133 ] fix graph rag, raptor, mindmap log cannot show correctly in UI (#14136 ) ### What problem does this PR solve? Fix #14133, knowledge graph, raptor, mindmap log cannot show correctly in UI <img width="1930" height="982" alt="Image" src="https://github.com/user-attachments/assets/d2f8e6c1-d82d-4b00-a377-949aada545ca" /> After Fix: <img width="2108" height="805" alt="image" src="https://github.com/user-attachments/assets/b37426c1-83d3-4a32-a83c-9d340d69e0e6" /> <img width="2173" height="1067" alt="image" src="https://github.com/user-attachments/assets/30105222-3310-43a0-9f83-1e320d05e413" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-16 13:08:36 +08:00
Minal Mahala	f930389311	Refact: improve task resume mechanism for graphrag (#14096 ) ### What problem does this PR solve? Addresses review feedback on #14074 (Checkpoint mechanism for long-running workflow jobs, issue #12494). Changes based on @yuzhichang's review: 1. Renamed `checkpoint_service.py` → `task_checkpoint.py` as suggested. 2. Replaced Redis with direct docEngine queries as suggested — the subgraph already gets persisted to the doc store by `generate_subgraph()`, so we just query for it instead of maintaining a separate checkpoint in Redis. This is simpler, has no extra dependency, and uses a single source of truth. Changes based on CodeRabbit review: 3. Fixed `source_id` query format mismatch — subgraphs are stored with `source_id: [doc_id]` (list), but the original query used `source_id: doc_id` (string). Now follows the same pattern as `does_graph_contains()` in `rag/graphrag/utils.py`: filter by `knowledge_graph_kwd` only, then match `source_id` in Python. This avoids ambiguity across Elasticsearch / Infinity / OceanBase backends. ### Changes \| File \| Change \| \|---\|---\| \| `api/db/services/task_checkpoint.py` (new) \| `load_subgraph_from_store()` and `has_raptor_chunks()` — docEngine-based checkpoint queries \| \| `rag/graphrag/general/index.py` \| `build_one()` calls `load_subgraph_from_store()` before running LLM extraction \| \| `rag/svr/task_executor.py` \| RAPTOR per-doc loop calls `has_raptor_chunks()` before processing \| \| `test/unit_test/rag/graphrag/test_checkpoint_resume.py` (new) \| 10 unit tests covering subgraph loading, source_id filtering, edge cases \| ### How it works - GraphRAG: Before running expensive LLM entity/relation extraction for a doc, checks the doc store for an existing subgraph (saved by a previous interrupted run). If found, loads it directly and skips LLM calls. - RAPTOR: Before processing a doc, checks if RAPTOR chunks (`raptor_kwd="raptor"`) already exist for it. If yes, skips. ### Testing - 10 new unit tests — all passing - Full existing suite: 617 passed ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2026-04-15 17:37:28 +08:00
Zhichang Yu	a9ca4ea1a1	Disable flask and quart debug (#14042 ) ### What problem does this PR solve? Visit `http://127.0.0.1:9381/?__debugger__=yes&cmd=resource&f=debugger.js` will expose the flask code: ``` docReady(() => { if (!EVALEX_TRUSTED) { initPinBox(); } // if we are in console mode, show the console. if (CONSOLE_MODE && EVALEX) { createInteractiveConsole(); } const frames = document.querySelectorAll("div.traceback div.frame"); if (EVALEX) { addConsoleIconToFrames(frames); } addEventListenersToElements(document.querySelectorAll("div.detail"), "click", () => document.querySelector("div.traceback").scrollIntoView(false) ); addToggleFrameTraceback(frames); addToggleTraceTypesOnClick(document.querySelectorAll("h2.traceback")); addInfoPrompt(document.querySelectorAll("span.nojavascript")); wrapPlainTraceback(); }); function addToggleFrameTraceback(frames) { frames.forEach((frame) => { frame.addEventListener("click", () => { frame.getElementsByTagName("pre")[0].parentElement.classList.toggle("expanded"); }); }) } ``` ### Type of change - [x] Other (please describe): Fix security risk	2026-04-10 18:01:49 +08:00
Jin Hai	24fcd6bbc7	Update CI (#13774 ) ### What problem does this PR solve? CI isn't stable, try to fix it. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-25 18:17:52 +08:00
Idriss Sbaaoui	249b78561b	Fix missmatch docnm_kwd in raptor chunks (#13451 ) ### What problem does this PR solve? issue #13393 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 14:24:33 +08:00
Lynn	62cb292635	Feat/tenant model (#13072 ) ### What problem does this PR solve? Add id for table tenant_llm and apply in LLMBundle. ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-05 17:27:17 +08:00
Yao Wei	cf6fd6f115	fix: When using OceanBase as storage, the list_chunk sorting is abnormal. #13198 (#13208 ) Actual behavior When using OceanBase as storage, the list_chunk sorting is abnormal. The following is the SQL statement. SELECT id, content_with_weight, important_kwd, question_kwd, img_id, available_int, position_int, doc_type_kwd, create_timestamp_flt, create_time, array_to_string(page_num_int, ',') AS page_num_int_sort, array_to_string(top_int, ',') AS top_int_sort FROM rag_store_284250730805059584 WHERE doc_id = '' AND kb_id IN ('') ORDER BY page_num_int_sort ASC, top_int_sort ASC, create_timestamp_flt DESC LIMIT 0, 20 <img width="1610" height="740" alt="image" src="https://github.com/user-attachments/assets/84e14c30-a97f-4e8f-8c8c-6ccac915d97d" /> Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>	2026-02-25 13:36:18 +08:00
Magicbook1108	301ed76aa4	Fix: task cancel (#13034 ) ### What problem does this PR solve? Fix: task cancel #11745 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-06 14:48:24 +08:00
Magicbook1108	4b0d65f089	Fix: correct llm_id for graphrag (#13032 ) ### What problem does this PR solve? Fix: correct llm_id for graphrag #13030 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-06 14:05:32 +08:00
Kevin Hu	32c0161ff1	Refa: Clean the folders. (#12890 ) ### Type of change - [x] Refactoring	2026-01-29 14:23:26 +08:00
qinling0210	9a5208976c	Put document metadata in ES/Infinity (#12826 ) ### What problem does this PR solve? Put document metadata in ES/Infinity. Index name of meta data: ragflow_doc_meta_{tenant_id} ### Type of change - [x] Refactoring	2026-01-28 13:29:34 +08:00
Kevin Hu	3beb85efa0	Feat: enhance metadata arranging. (#12745 ) ### What problem does this PR solve? #11564 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-22 15:34:08 +08:00
Kevin Hu	927db0b373	Refa: asyncio.to_thread to ThreadPoolExecutor to break thread limitat… (#12716 ) ### Type of change - [x] Refactoring	2026-01-20 13:29:37 +08:00
E.G	f367189703	fix(raptor): handle missing vector fields gracefully (#12713 ) ## Summary This PR fixes a `KeyError` crash when running RAPTOR tasks on documents that don't have the expected vector field. ## Related Issue Fixes https://github.com/infiniflow/ragflow/issues/12675 ## Problem When running RAPTOR tasks, the code assumes all chunks have the vector field `q_<size>_vec` (e.g., `q_1024_vec`). However, chunks may not have this field if: 1. They were indexed with a different embedding model (different vector size) 2. The embedding step failed silently during initial parsing 3. The document was parsed before the current embedding model was configured This caused a crash: ``` KeyError: 'q_1024_vec' ``` ## Solution Added defensive validation in `run_raptor_for_kb()`: 1. Check for vector field existence before accessing it 2. Skip chunks that don't have the required vector field instead of crashing 3. Log warnings for skipped chunks with actionable guidance 4. Provide informative error messages suggesting users re-parse documents with the current embedding model 5. Handle both scopes (`file` and `kb` modes) ## Changes - `rag/svr/task_executor.py`: Added validation and error handling in `run_raptor_for_kb()` ## Testing 1. Create a knowledge base with an embedding model 2. Parse documents 3. Change the embedding model to one with a different vector size 4. Run RAPTOR task 5. Before: Crashes with `KeyError` 6. After: Gracefully skips incompatible chunks with informative warnings --- <!-- Gittensor Contribution Tag: @GlobalStar117 --> Co-authored-by: GlobalStar117 <GlobalStar117@users.noreply.github.com>	2026-01-20 12:24:20 +08:00
qinling0210	b40d639fdb	Add dataset with table parser type for Infinity and answer question in chat using SQL (#12541 ) ### What problem does this PR solve? 1) Create dataset using table parser for infinity 2) Answer questions in chat using SQL ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-19 19:35:14 +08:00
Yongteng Lei	68e5c86e9c	Fix: image not displaying thumbnails when using pipeline (#12574 ) ### What problem does this PR solve? Fix image not displaying thumbnails when using pipeline. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-13 12:54:13 +08:00
Jin Hai	a7dd3b7e9e	Add time cost when start servers (#12552 ) ### What problem does this PR solve? - API server - Ingestion server - Data sync server - Admin server ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-01-12 12:48:23 +08:00
Magicbook1108	011bbe9556	Feat: support context window for docx (#12455 ) ### What problem does this PR solve? Feat: support context window for docx #12303 Done: - [x] naive.py - [x] one.py TODO: - [ ] book.py - [ ] manual.py Fix: incorrect image position Fix: incorrect chunk type tag ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-01-07 15:08:17 +08:00
Liu An	606f4e6c9e	Refa: improve TOC building with better error handling (#12427 ) ### What problem does this PR solve? Refactor TOC building logic to use enumerate instead of while loop, add comprehensive error handling for missing/invalid chunk_id values, and improve logging with more specific error messages. The changes make the code more robust against malformed TOC data while maintaining the same functionality for valid inputs. ### Type of change - [x] Refactoring	2026-01-05 10:02:42 +08:00
OliverW	d6e006f086	Improve task executor heartbeat handling and cleanup (#12390 ) Improve task executor heartbeat handling and cleanup. ### What problem does this PR solve? - Reduce lock contention during executor cleanup: The cleanup lock is acquired only when removing expired executors, not during regular heartbeat reporting, reducing potential lock contention. - Optimize own heartbeat cleanup: Each executor removes its own expired heartbeat using `zremrangebyscore` instead of `zcount` + `zpopmin`, reducing Redis operations and improving efficiency. - Improve cleanup of other executors' heartbeats: Expired executors are detected by checking their latest heartbeat, and stale entries are removed safely. - Other improvements: IP address and PID are captured once at startup, and unnecessary global declarations are removed. ### Type of change - [x] Performance Improvement Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-01-04 11:24:05 +08:00
Kevin Hu	1a4a7d1705	Fix: apply kb configured llm issue. (#12354 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-31 12:40:28 +08:00
Kevin Hu	52f91c2388	Refine: image/table context. (#12336 ) ### What problem does this PR solve? #12303 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-30 20:24:27 +08:00
Lynn	4a6d37f0e8	Fix: use async task to save memory (#12308 ) ### What problem does this PR solve? Use async task to save memory. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 11:41:38 +08:00
Jin Hai	df3cbb9b9e	Refactor code (#12305 ) ### What problem does this PR solve? as title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 11:09:18 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Lynn	6e9691a419	Feat: message manage (#12196 ) ### What problem does this PR solve? Manage message and use in agent. Issue #4213 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-25 21:18:13 +08:00
Kevin Hu	8cbfb5aef6	Fix: toc no chunk found issue. (#12197 ) ### What problem does this PR solve? #12170 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 14:06:20 +08:00
Kevin Hu	ce08ee399b	Fix: metadata_obj issue. (#12146 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 11:54:09 +08:00
Kevin Hu	8197f9a873	Fix: table tag on chunks. (#12126 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 11:25:38 +08:00

1 2 3 4 5 ...

267 Commits