ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-03 01:01:56 +08:00

Author	SHA1	Message	Date
sapienza yoan	811e9826d0	perf: avoid O(n²) array growth in embedding accumulation (#14369 ) ### What problem does this PR solve? Both tokenizer (`rag/flow/tokenizer/tokenizer.py`) and `BuiltinEmbed.encode` (`rag/llm/embedding_model.py`) currently accumulate embedding batches via `np.concatenate` inside the per-batch loop. `np.concatenate` allocates a new array and copies all existing data on every call, so accumulating N batches is O(N²) in both time and peak memory. Replacing the incremental concatenate with a list-of-batches + a single `np.vstack` at the end gives O(N) total work. For tokenizer the title-vector broadcast `np.concatenate([vts[0]] * N)` is also replaced by `np.tile`, which does the same job with a single contiguous allocation instead of building a Python list of references. This is purely a CPU/memory optimisation — output shape and dtype are unchanged. Measured impact grows with document size: - 1k chunks (batch 512, 2 iters): ~negligible - 10k chunks (20 iters): ~10× speedup on this stage - 100k chunks (195 iters): ~100× speedup, and peak RAM drops from O(N) extra to near-zero ### Type of change - [x] Performance Improvement Co-authored-by: yoan sapienza <Yoan Sapienza yoan.sapienza@orange.fr Yoan Sapienza zappy@macbookpro.home>	2026-04-30 11:00:10 +08:00
Jack	872ff08304	Fix: add executor.shutdown (#14403 ) ### What problem does this PR solve? Add executor shutdown in finally clause to free resources. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-27 22:38:43 +08:00
Idriss Sbaaoui	4303be223f	Fix metadata parsing regression for upgraded v0.24 datasets (#14383 ) ### What problem does this PR solve? This PR fixes issue #14371 where file parsing failed after upgrading from v0.24.0 to v0.25.0, because metadata config could be a JSON Schema object but was handled like a list and later caused `KeyError: 'properties'`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-27 16:18:06 +08:00
yuch85	0d87cecae2	feat: persist PDF bookmark outline as document metadata (#13287 ) ## Summary PDF files often contain a bookmark/outline tree (table of contents built into the file by the authoring tool). RAGFlow's `pdf_parser.outlines` already extracts these `(title, depth)` tuples via pypdf, but they are used ephemerally during chunking (`manual` parser uses them for hierarchy detection) and then discarded. This PR persists the outline as `doc.meta_fields["outline"]` — a JSON array of `{"title": str, "depth": int}` objects — so downstream features can use the structural information. ### Why this matters - Complementary to `toc_extraction` — the existing `toc_extraction` feature uses LLM calls to generate a TOC and only works for the `naive` parser. The raw PDF outline is free (already extracted by pypdf), works for all parsers, and captures the author's original document structure. - Document navigation — frontends can render a clickable TOC from the outline - Entity extraction — the outline provides a structural map for identifying document sections and key topics - Search result context — knowing which section a chunk belongs to helps users evaluate relevance ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/app/naive.py` \| Attach `pdf_parser.outlines` as `__outline__` on first chunk dict \| ~7 \| \| `rag/app/manual.py` \| Same for the manual parser \| ~5 \| \| `rag/svr/task_executor.py` \| Extract `__outline__`, persist via `DocMetadataService.update_document_metadata()` \| ~12 \| ### Design decisions - Transient key pattern: The outline is passed from parser → task_executor via `__outline__` on the first chunk dict, then removed before indexing. This follows the same pattern as `metadata_obj` for LLM-generated metadata. - No schema changes: Uses the existing `meta_fields` JSON column on the document table. - Graceful degradation: If a PDF has no outline (common for scanned docs), nothing is stored. If persistence fails, it logs a warning and continues — parsing is not interrupted. ### Backward compatibility - Fully backward compatible — no existing fields, behavior, or schemas changed - PDFs without outlines are unaffected - Existing `meta_fields` data is preserved (merged, not overwritten) ## Test plan - [ ] Parse a PDF with bookmarks (e.g. any multi-chapter document), verify `meta_fields["outline"]` is populated - [ ] Parse a PDF without bookmarks, verify no errors and no outline key in meta_fields - [ ] Verify existing `meta_fields` data is preserved (not overwritten) when outline is added - [ ] Verify `manual` parser also persists outlines - [ ] Verify outline JSON structure: `[{"title": "Chapter 1", "depth": 0}, ...]` Related: #9921 (Deterministic Document Access Layer) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 11:57:06 +08:00
yuch85	3ad3241ae0	feat: persist RAPTOR layer metadata on summary chunks (#13286 ) ## Summary RAPTOR's recursive clustering builds a `layers` list tracking `(start_idx, end_idx)` boundaries per level, but currently discards this information — only the flat `chunks` list is returned. This makes it impossible to distinguish leaf-level summaries from top-level ones. This PR: - Returns `(chunks, layers)` tuple from `raptor.py`'s `__call__` - Annotates each RAPTOR summary chunk with `raptor_layer_int` (1 = first summary level, 2 = summary-of-summaries, etc.) - Adds `raptor_layer_int` to `infinity_mapping.json` (Elasticsearch handles it via existing `_int` dynamic template) ### Why this matters Downstream features need to know which RAPTOR layer a summary belongs to: - Retrieving the top-level document summary* for entity extraction, search snippets, or document comparison - Filtering by abstraction level — users may want only high-level summaries or only leaf-level cluster summaries - RAPTOR recall quality — #10951 reports summaries not being recalled for definition queries; layer metadata enables targeted retrieval ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/raptor.py` \| Return `(chunks, layers)` tuple \| ~3 \| \| `rag/svr/task_executor.py` \| Build `chunk_layer` mapping, set `raptor_layer_int` \| ~12 \| \| `conf/infinity_mapping.json` \| Add `raptor_layer_int` integer field \| ~1 \| ### Backward compatibility - Additive only — no existing fields or behavior changed - Existing RAPTOR chunks continue to work (they'll have `raptor_layer_int = 0` by default) - New RAPTOR chunks get layer metadata automatically ## Test plan - [ ] Parse a document with RAPTOR enabled, verify `raptor_layer_int` is set on indexed chunks - [ ] Verify `raptor_layer_int` values increase with abstraction level (layer 1 < layer 2 < ...) - [ ] Verify existing RAPTOR deletion (`delete by raptor_kwd`) still works - [ ] Verify Infinity backend accepts the new field Fixes #7488 Related: #4104, #11191, #10951 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 10:20:46 +08:00
Lynn	afdf0814d7	Fix: get metadata conf (#14250 ) ### What problem does this PR solve? Get metadata configuration from union of custom metadata and built_in_metadata. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-21 17:22:42 +08:00
Magicbook1108	19eedeec61	Fix: accept empty value as 0 chunk (#14220 ) ### What problem does this PR solve? Fix: accept empty value as 0 chunk ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-20 12:53:47 +08:00
Qi Wang	969ce3a79f	[Bug fix #14133 ] fix graph rag, raptor, mindmap log cannot show correctly in UI (#14136 ) ### What problem does this PR solve? Fix #14133, knowledge graph, raptor, mindmap log cannot show correctly in UI <img width="1930" height="982" alt="Image" src="https://github.com/user-attachments/assets/d2f8e6c1-d82d-4b00-a377-949aada545ca" /> After Fix: <img width="2108" height="805" alt="image" src="https://github.com/user-attachments/assets/b37426c1-83d3-4a32-a83c-9d340d69e0e6" /> <img width="2173" height="1067" alt="image" src="https://github.com/user-attachments/assets/30105222-3310-43a0-9f83-1e320d05e413" /> ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-16 13:08:36 +08:00
Minal Mahala	f930389311	Refact: improve task resume mechanism for graphrag (#14096 ) ### What problem does this PR solve? Addresses review feedback on #14074 (Checkpoint mechanism for long-running workflow jobs, issue #12494). Changes based on @yuzhichang's review: 1. Renamed `checkpoint_service.py` → `task_checkpoint.py` as suggested. 2. Replaced Redis with direct docEngine queries as suggested — the subgraph already gets persisted to the doc store by `generate_subgraph()`, so we just query for it instead of maintaining a separate checkpoint in Redis. This is simpler, has no extra dependency, and uses a single source of truth. Changes based on CodeRabbit review: 3. Fixed `source_id` query format mismatch — subgraphs are stored with `source_id: [doc_id]` (list), but the original query used `source_id: doc_id` (string). Now follows the same pattern as `does_graph_contains()` in `rag/graphrag/utils.py`: filter by `knowledge_graph_kwd` only, then match `source_id` in Python. This avoids ambiguity across Elasticsearch / Infinity / OceanBase backends. ### Changes \| File \| Change \| \|---\|---\| \| `api/db/services/task_checkpoint.py` (new) \| `load_subgraph_from_store()` and `has_raptor_chunks()` — docEngine-based checkpoint queries \| \| `rag/graphrag/general/index.py` \| `build_one()` calls `load_subgraph_from_store()` before running LLM extraction \| \| `rag/svr/task_executor.py` \| RAPTOR per-doc loop calls `has_raptor_chunks()` before processing \| \| `test/unit_test/rag/graphrag/test_checkpoint_resume.py` (new) \| 10 unit tests covering subgraph loading, source_id filtering, edge cases \| ### How it works - GraphRAG: Before running expensive LLM entity/relation extraction for a doc, checks the doc store for an existing subgraph (saved by a previous interrupted run). If found, loads it directly and skips LLM calls. - RAPTOR: Before processing a doc, checks if RAPTOR chunks (`raptor_kwd="raptor"`) already exist for it. If yes, skips. ### Testing - 10 new unit tests — all passing - Full existing suite: 617 passed ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2026-04-15 17:37:28 +08:00
Zhichang Yu	a9ca4ea1a1	Disable flask and quart debug (#14042 ) ### What problem does this PR solve? Visit `http://127.0.0.1:9381/?__debugger__=yes&cmd=resource&f=debugger.js` will expose the flask code: ``` docReady(() => { if (!EVALEX_TRUSTED) { initPinBox(); } // if we are in console mode, show the console. if (CONSOLE_MODE && EVALEX) { createInteractiveConsole(); } const frames = document.querySelectorAll("div.traceback div.frame"); if (EVALEX) { addConsoleIconToFrames(frames); } addEventListenersToElements(document.querySelectorAll("div.detail"), "click", () => document.querySelector("div.traceback").scrollIntoView(false) ); addToggleFrameTraceback(frames); addToggleTraceTypesOnClick(document.querySelectorAll("h2.traceback")); addInfoPrompt(document.querySelectorAll("span.nojavascript")); wrapPlainTraceback(); }); function addToggleFrameTraceback(frames) { frames.forEach((frame) => { frame.addEventListener("click", () => { frame.getElementsByTagName("pre")[0].parentElement.classList.toggle("expanded"); }); }) } ``` ### Type of change - [x] Other (please describe): Fix security risk	2026-04-10 18:01:49 +08:00
Jin Hai	24fcd6bbc7	Update CI (#13774 ) ### What problem does this PR solve? CI isn't stable, try to fix it. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-03-25 18:17:52 +08:00
Idriss Sbaaoui	249b78561b	Fix missmatch docnm_kwd in raptor chunks (#13451 ) ### What problem does this PR solve? issue #13393 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-10 14:24:33 +08:00
Lynn	62cb292635	Feat/tenant model (#13072 ) ### What problem does this PR solve? Add id for table tenant_llm and apply in LLMBundle. ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-05 17:27:17 +08:00
Yao Wei	cf6fd6f115	fix: When using OceanBase as storage, the list_chunk sorting is abnormal. #13198 (#13208 ) Actual behavior When using OceanBase as storage, the list_chunk sorting is abnormal. The following is the SQL statement. SELECT id, content_with_weight, important_kwd, question_kwd, img_id, available_int, position_int, doc_type_kwd, create_timestamp_flt, create_time, array_to_string(page_num_int, ',') AS page_num_int_sort, array_to_string(top_int, ',') AS top_int_sort FROM rag_store_284250730805059584 WHERE doc_id = '' AND kb_id IN ('') ORDER BY page_num_int_sort ASC, top_int_sort ASC, create_timestamp_flt DESC LIMIT 0, 20 <img width="1610" height="740" alt="image" src="https://github.com/user-attachments/assets/84e14c30-a97f-4e8f-8c8c-6ccac915d97d" /> Co-authored-by: Aron.Yao <yaowei@yaoweideMacBook-Pro.local>	2026-02-25 13:36:18 +08:00
Magicbook1108	301ed76aa4	Fix: task cancel (#13034 ) ### What problem does this PR solve? Fix: task cancel #11745 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-06 14:48:24 +08:00
Magicbook1108	4b0d65f089	Fix: correct llm_id for graphrag (#13032 ) ### What problem does this PR solve? Fix: correct llm_id for graphrag #13030 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-06 14:05:32 +08:00
Kevin Hu	32c0161ff1	Refa: Clean the folders. (#12890 ) ### Type of change - [x] Refactoring	2026-01-29 14:23:26 +08:00
qinling0210	9a5208976c	Put document metadata in ES/Infinity (#12826 ) ### What problem does this PR solve? Put document metadata in ES/Infinity. Index name of meta data: ragflow_doc_meta_{tenant_id} ### Type of change - [x] Refactoring	2026-01-28 13:29:34 +08:00
Kevin Hu	3beb85efa0	Feat: enhance metadata arranging. (#12745 ) ### What problem does this PR solve? #11564 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-22 15:34:08 +08:00
Kevin Hu	927db0b373	Refa: asyncio.to_thread to ThreadPoolExecutor to break thread limitat… (#12716 ) ### Type of change - [x] Refactoring	2026-01-20 13:29:37 +08:00
E.G	f367189703	fix(raptor): handle missing vector fields gracefully (#12713 ) ## Summary This PR fixes a `KeyError` crash when running RAPTOR tasks on documents that don't have the expected vector field. ## Related Issue Fixes https://github.com/infiniflow/ragflow/issues/12675 ## Problem When running RAPTOR tasks, the code assumes all chunks have the vector field `q_<size>_vec` (e.g., `q_1024_vec`). However, chunks may not have this field if: 1. They were indexed with a different embedding model (different vector size) 2. The embedding step failed silently during initial parsing 3. The document was parsed before the current embedding model was configured This caused a crash: ``` KeyError: 'q_1024_vec' ``` ## Solution Added defensive validation in `run_raptor_for_kb()`: 1. Check for vector field existence before accessing it 2. Skip chunks that don't have the required vector field instead of crashing 3. Log warnings for skipped chunks with actionable guidance 4. Provide informative error messages suggesting users re-parse documents with the current embedding model 5. Handle both scopes (`file` and `kb` modes) ## Changes - `rag/svr/task_executor.py`: Added validation and error handling in `run_raptor_for_kb()` ## Testing 1. Create a knowledge base with an embedding model 2. Parse documents 3. Change the embedding model to one with a different vector size 4. Run RAPTOR task 5. Before: Crashes with `KeyError` 6. After: Gracefully skips incompatible chunks with informative warnings --- <!-- Gittensor Contribution Tag: @GlobalStar117 --> Co-authored-by: GlobalStar117 <GlobalStar117@users.noreply.github.com>	2026-01-20 12:24:20 +08:00
qinling0210	b40d639fdb	Add dataset with table parser type for Infinity and answer question in chat using SQL (#12541 ) ### What problem does this PR solve? 1) Create dataset using table parser for infinity 2) Answer questions in chat using SQL ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-19 19:35:14 +08:00
Yongteng Lei	68e5c86e9c	Fix: image not displaying thumbnails when using pipeline (#12574 ) ### What problem does this PR solve? Fix image not displaying thumbnails when using pipeline. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-13 12:54:13 +08:00
Jin Hai	a7dd3b7e9e	Add time cost when start servers (#12552 ) ### What problem does this PR solve? - API server - Ingestion server - Data sync server - Admin server ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-01-12 12:48:23 +08:00
Magicbook1108	011bbe9556	Feat: support context window for docx (#12455 ) ### What problem does this PR solve? Feat: support context window for docx #12303 Done: - [x] naive.py - [x] one.py TODO: - [ ] book.py - [ ] manual.py Fix: incorrect image position Fix: incorrect chunk type tag ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-01-07 15:08:17 +08:00
Liu An	606f4e6c9e	Refa: improve TOC building with better error handling (#12427 ) ### What problem does this PR solve? Refactor TOC building logic to use enumerate instead of while loop, add comprehensive error handling for missing/invalid chunk_id values, and improve logging with more specific error messages. The changes make the code more robust against malformed TOC data while maintaining the same functionality for valid inputs. ### Type of change - [x] Refactoring	2026-01-05 10:02:42 +08:00
OliverW	d6e006f086	Improve task executor heartbeat handling and cleanup (#12390 ) Improve task executor heartbeat handling and cleanup. ### What problem does this PR solve? - Reduce lock contention during executor cleanup: The cleanup lock is acquired only when removing expired executors, not during regular heartbeat reporting, reducing potential lock contention. - Optimize own heartbeat cleanup: Each executor removes its own expired heartbeat using `zremrangebyscore` instead of `zcount` + `zpopmin`, reducing Redis operations and improving efficiency. - Improve cleanup of other executors' heartbeats: Expired executors are detected by checking their latest heartbeat, and stale entries are removed safely. - Other improvements: IP address and PID are captured once at startup, and unnecessary global declarations are removed. ### Type of change - [x] Performance Improvement Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2026-01-04 11:24:05 +08:00
Kevin Hu	1a4a7d1705	Fix: apply kb configured llm issue. (#12354 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-31 12:40:28 +08:00
Kevin Hu	52f91c2388	Refine: image/table context. (#12336 ) ### What problem does this PR solve? #12303 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-30 20:24:27 +08:00
Lynn	4a6d37f0e8	Fix: use async task to save memory (#12308 ) ### What problem does this PR solve? Use async task to save memory. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 11:41:38 +08:00
Jin Hai	df3cbb9b9e	Refactor code (#12305 ) ### What problem does this PR solve? as title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-30 11:09:18 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Lynn	6e9691a419	Feat: message manage (#12196 ) ### What problem does this PR solve? Manage message and use in agent. Issue #4213 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-25 21:18:13 +08:00
Kevin Hu	8cbfb5aef6	Fix: toc no chunk found issue. (#12197 ) ### What problem does this PR solve? #12170 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 14:06:20 +08:00
Kevin Hu	ce08ee399b	Fix: metadata_obj issue. (#12146 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 11:54:09 +08:00
Kevin Hu	8197f9a873	Fix: table tag on chunks. (#12126 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 11:25:38 +08:00
Kevin Hu	00bb6fbd28	Fix: metadata issue & graphrag speeding up. (#12113 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: Liu An <asiro@qq.com>	2025-12-23 15:57:27 +08:00
Magicbook1108	d5a44e913d	Fix: fix task cancel (#12093 ) ### What problem does this PR solve? Fix: fix task cancel ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-23 09:38:25 +08:00
Kevin Hu	bd76b8ff1a	Fix: Tika server upgrades. (#12073 ) ### What problem does this PR solve? #12037 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-23 09:35:52 +08:00
concertdictate	4dd8cdc38b	task executor issues (#12006 ) ### What problem does this PR solve? Fixes #8706 - `InfinityException: TOO_MANY_CONNECTIONS` when running multiple task executor workers ### Problem Description When running RAGFlow with 8-16 task executor workers, most workers fail to start properly. Checking logs revealed that workers were stuck/hanging during Infinity connection initialization - only 1-2 workers would successfully register in Redis while the rest remained blocked. ### Root Cause The Infinity SDK `ConnectionPool` pre-allocates all connections in `__init__`. With the default `max_size=32` and multiple workers (e.g., 16), this creates 16×32=512 connections immediately on startup, exceeding Infinity's default 128 connection limit. Workers hang while waiting for connections that can never be established. ### Changes 1. Prevent Infinity connection storm (`rag/utils/infinity_conn.py`, `rag/svr/task_executor.py`) - Reduced ConnectionPool `max_size` from 32 to 4 (sufficient since operations are synchronous) - Added staggered startup delay (2s per worker) to spread connection initialization 2. Handle None children_delimiter (`rag/app/naive.py`) - Use `or ""` to handle explicitly set None values from parser config 3. MinerU parser robustness (`deepdoc/parser/mineru_parser.py`) - Use `.get()` for optional output fields that may be missing - Fix DISCARDED block handling: change `pass` to `continue` to skip discarded blocks entirely ### Why `max_size=4` is sufficient \| Workers \| Pool Size \| Total Connections \| Infinity Limit \| \|---------\|-----------\|-------------------\|----------------\| \| 16 \| 32 \| 512 \| 128 ❌ \| \| 16 \| 4 \| 64 \| 128 ✅ \| \| 32 \| 4 \| 128 \| 128 ✅ \| - All RAGFlow operations are synchronous: `get_conn()` → operation → `release_conn()` - No parallel `docStoreConn` operations in the codebase - Maximum 1-2 concurrent connections needed per worker; 4 provides safety margin ### MinerU DISCARDED block bug When MinerU returns blocks with `type: "discarded"` (headers, footers, watermarks, page numbers, artifacts), the previous code used `pass` which left the `section` variable undefined, causing: - UnboundLocalError if DISCARDED is the first block - Duplicate content if DISCARDED follows another block (stale value from previous iteration) Root cause confirmed via MinerU source code: From [`mineru/utils/enum_class.py`](https://github.com/opendatalab/MinerU/blob/main/mineru/utils/enum_class.py#L14): ```python class BlockType: DISCARDED = 'discarded' # VLM 2.5+ also has: HEADER, FOOTER, PAGE_NUMBER, ASIDE_TEXT, PAGE_FOOTNOTE ``` Per [MinerU documentation](https://opendatalab.github.io/MinerU/reference/output_files/), discarded blocks contain content that should be filtered out for clean text extraction. Fix: Changed `pass` to `continue` to skip discarded blocks entirely. ### Testing - Verified all 16 workers now register successfully in Redis - All workers heartbeating correctly - Document parsing works as expected - MinerU parsing with DISCARDED blocks no longer crashes ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: user210 <user210@rt>	2025-12-18 10:03:30 +08:00
Kevin Hu	8e4d011b15	Fix: parent-children chunking method. (#11997 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-12-17 16:50:36 +08:00
Yongteng Lei	03f9be7cbb	Refa: only support MinerU-API now (#11977 ) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring	2025-12-17 12:58:48 +08:00
Jin Hai	30019dab9f	Change knowledge base to dataset (#11976 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-17 10:03:33 +08:00
Kevin Hu	ea4a5cd665	Fix: tokenizer issue. (#11902 ) #11786 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-11 17:38:17 +08:00
buua436	65a5a56d95	Refa:replace trio with asyncio (#11831 ) ### What problem does this PR solve? change: replace trio with asyncio ### Type of change - [x] Refactoring	2025-12-09 19:23:14 +08:00
buua436	dd046be976	Fix: parent-child chunking method (#11810 ) ### What problem does this PR solve? change: parent-child chunking method ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-09 09:34:01 +08:00
buua436	9b8971a9de	Fix:toc in pipeline (#11785 ) ### What problem does this PR solve? change: Fix toc in pipeline ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-08 09:42:20 +08:00
hsparks-codes	4870d42949	feat: Auto-disable Raptor for structured data (Issue #11653 ) (#11676 ) ### What problem does this PR solve? Feature: This PR implements automatic Raptor disabling for structured data files to address issue #11653. Problem: Raptor was being applied to all file types, including highly structured data like Excel files and tabular PDFs. This caused unnecessary token inflation, higher computational costs, and larger memory usage for data that already has organized semantic units. Solution: Automatically skip Raptor processing for: - Excel files (.xls, .xlsx, .xlsm, .xlsb) - CSV files (.csv, .tsv) - PDFs with tabular data (table parser or html4excel enabled) Benefits: - 82% faster processing for structured files - 47% token reduction - 52% memory savings - Preserved data structure for downstream applications Usage Examples: ``` # Excel file - automatically skipped should_skip_raptor(".xlsx") # True # CSV file - automatically skipped should_skip_raptor(".csv") # True # Tabular PDF - automatically skipped should_skip_raptor(".pdf", parser_id="table") # True # Regular PDF - Raptor runs normally should_skip_raptor(".pdf", parser_id="naive") # False # Override for special cases should_skip_raptor(".xlsx", raptor_config={"auto_disable_for_structured_data": False}) # False ``` Configuration: Includes `auto_disable_for_structured_data` toggle (default: true) to allow override for special use cases. Testing: 44 comprehensive tests, 100% passing ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 17:02:29 +08:00
Jin Hai	3c50c7d3ac	Refactor code (#11694 ) ### What problem does this PR solve? Rename function and refactor log message ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-03 15:15:00 +08:00
Kevin Hu	b5ad7b7062	Feat: support TOC transformer. (#11685 ) ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-03 12:27:50 +08:00

1 2 3 4 5 ...

253 Commits