ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Author	SHA1	Message	Date
buua436	5751a22444	fix: add toc field to extractor output (#16059 ) ### What problem does this PR solve? TOC chunks now include a toc field so the agent pipeline logs expose the data the frontend expects. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-16 13:27:45 +08:00
Lynn	7355db183f	Fix: model list (#15905 ) ### What problem does this PR solve? Set OpenDataLoader and call in parser and naive ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-10 17:44:50 +08:00
Lynn	478c9846a1	Fix: model list (#15860 ) ### What problem does this PR solve? Remove tenant_llm call in rag. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-10 14:59:57 +08:00
Wang Qi	9aa81e7cad	Fix paddle ocr / minerU cannot add (#15858 ) Fix paddle ocr / minerU cannot add	2026-06-10 13:04:13 +08:00
buua436	7b8d6f34b3	fix: force image parser json output (#15847 ) ### What problem does this PR solve? Force image parser runtime output format to JSON so downstream chunking reads OCR results from the JSON output and image parser chunks can be displayed. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-09 19:02:37 +08:00
euvre	1e80419c21	fix: restore TitleChunker output for json/chunks upstream formats (#15396 ) fix: restore TitleChunker output for json/chunks upstream formats ## Summary The refactor commit `e194027b` (#14247) introduced two regressions that caused `TitleChunker` to produce zero chunks when the upstream Parser node outputs `json` or `chunks` format (e.g. PDF parsing). ## Root Cause ### 1. Dead code in `extract_line_records` (critical) After refactor, when `payload` is `None` (which is the case for `json` and `chunks` output formats), the method returns an empty list immediately via `return []`, so no records are ever extracted from structured upstream output. The original `json`/`chunks` handling code became unreachable dead code. ### 2. Unconditional overwrite in `build_chunks_from_record_groups` The `chunks` variable assigned in the `if` branch for markdown/text/html formats was unconditionally overwritten by the statement below it, due to a missing `else` keyword. ## Fix - Remove the premature `return []` so the `json`/`chunks` branch is reachable again. - Add `else` branch in `build_chunks_from_record_groups` so the two format families are handled independently. ## Test Plan - [x] Verified no lint errors on the changed file - [ ] Tested with a PDF document parsed via DeepDOC → TitleChunker pipeline - [ ] Tested with markdown input through TitleChunker - [ ] Tested hierarchy and group chunking modes ## Impact - Fixes the regression where documents parsed with `json`/`chunks` output format produced no chunks from `TitleChunker`. - No API or configuration changes. Fully backward compatible. Signed-off-by: noob <yixiao121314@outlook.com>	2026-06-01 17:14:22 +08:00
Lynn	dc4b82523b	Feat: tenant llm provider (#14595 ) ### What problem does this PR solve? Python implementation of the Go-based model_provider API suite. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: bill <yibie_jingnian@163.com>	2026-05-29 17:39:41 +08:00
Jack	f0cb7a544b	Refactor: Task Executor (#15154 ) ### What problem does this PR solve? 1. Break huge function into smaller pieces 2. Add unit test for the smaller pieces function 3. Layer-ed design a. infra layer - task_context.py, recording_context.py, write_operation_interceptor.py, ... b. service layer - *_service.py c. business layer - task_handler.py 4. Default behavior: use "refactor-ed version" - can switch to original version by change env variable ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement --------- Co-authored-by: Liu An <asiro@qq.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-05-27 21:54:17 +08:00
07heco	e194027b01	refactor: optimize BaseTitleChunker to improve RAG document chunk quality (#14247 ) ## RAG Optimization Description Optimize the core `BaseTitleChunker` in `rag/flow/chunker/title_chunker/common.py` to improve RAG document chunking quality and retrieval accuracy. ## Key Changes 1. Format-branched text processing: Preserve original whitespace & indentation for Markdown/HTML payloads to maintain document semantics and chunk fidelity; only perform full whitespace cleaning on plain text content. 2. Empty chunk filtering: Thoroughly filter invalid pure-blank lines to reduce noisy data in vector database. 3. Code deduplication: Unified markdown/text/html payload extraction logic, removed redundant repeated code blocks. 4. None serialization fix: Avoid converting `None` value into literal `"None"` string in chunk text fields. 5. Production logging: Added input/output line count logging for filter logic, observable in online environment. 6. 100% backward compatible: No changes to chunking hierarchy rules, output format and all existing workflows. ## RAG Business Value - Preserves document format fidelity for structured Markdown/HTML files - Reduces invalid noisy chunks → improves RAG retrieval precision - Cleans plain text data → optimizes vector embedding quality - Improves code maintainability with no breaking changes - Provides observable logging for chunk filtering behavior ## Compatibility - ✅ No API changes - ✅ No chunk logic modifications - ✅ All document parsing/chunking workflows unaffected - ✅ All pre-checks passed, no code conflicts ### Type of change - [x] Refactoring - [x] Performance Improvement	2026-05-18 10:00:18 +08:00
Magicbook1108	bb3b99f0a5	Feat: add button for remove header & footer in pipeline (#14486 ) ### What problem does this PR solve? Feat: add button for remove header & footer in pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-30 12:30:41 +08:00
sapienza yoan	811e9826d0	perf: avoid O(n²) array growth in embedding accumulation (#14369 ) ### What problem does this PR solve? Both tokenizer (`rag/flow/tokenizer/tokenizer.py`) and `BuiltinEmbed.encode` (`rag/llm/embedding_model.py`) currently accumulate embedding batches via `np.concatenate` inside the per-batch loop. `np.concatenate` allocates a new array and copies all existing data on every call, so accumulating N batches is O(N²) in both time and peak memory. Replacing the incremental concatenate with a list-of-batches + a single `np.vstack` at the end gives O(N) total work. For tokenizer the title-vector broadcast `np.concatenate([vts[0]] * N)` is also replaced by `np.tile`, which does the same job with a single contiguous allocation instead of building a Python list of references. This is purely a CPU/memory optimisation — output shape and dtype are unchanged. Measured impact grows with document size: - 1k chunks (batch 512, 2 iters): ~negligible - 10k chunks (20 iters): ~10× speedup on this stage - 100k chunks (195 iters): ~100× speedup, and peak RAM drops from O(N) extra to near-zero ### Type of change - [x] Performance Improvement Co-authored-by: yoan sapienza <Yoan Sapienza yoan.sapienza@orange.fr Yoan Sapienza zappy@macbookpro.home>	2026-04-30 11:00:10 +08:00
wdeveloper16	78188ce9e9	Feat: add OpenDataLoader PDF parser backend (#14058 ) (#14097 ) ### What problem does this PR solve? Closes #14058. RAGFlow supports multiple PDF parsing backends (DeepDOC, MinerU, Docling, TCADP, PaddleOCR). This PR adds OpenDataLoader ([opendataloader-project/opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf)) as a new optional backend, giving users a deterministic, local-first alternative with competitive table extraction accuracy. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update --- ### Changes #### Backend - `deepdoc/parser/opendataloader_parser.py` — new `OpenDataLoaderParser` class inheriting `RAGFlowPdfParser`. Implements `check_installation()` (guards Python package + Java 11+ runtime), `parse_pdf()` with JSON-first extraction (heading/paragraph/table/list/image/formula) and Markdown fallback, position-tag generation compatible with the shared `@@page\tx0\tx1\ty0\ty1##` format, and temp-dir lifecycle with cleanup. - `rag/app/naive.py` — new `by_opendataloader()` wrapper, registered in `PARSERS` dict, added to `chunk_token_num=0` override list. - `rag/flow/parser/parser.py` — `"opendataloader"` branch in the pipeline PDF handler + check validation list. #### Infrastructure - `docker/entrypoint.sh` — `ensure_opendataloader()` function: opt-in via `USE_OPENDATALOADER=true`, skips gracefully if Java is not on PATH. #### Frontend - `web/src/components/layout-recognize-form-field.tsx` — `OpenDataLoader` added to `ParseDocumentType` enum and parser dropdown. Cascades automatically to the pipeline editor's Parser component. #### Docs - `docs/guides/dataset/select_pdf_parser.md` — added OpenDataLoader entry and full env-var reference. --- ### Environment variables \| Variable \| Default \| Description \| \|---\|---\|---\| \| `USE_OPENDATALOADER` \| `false` \| Set `true` to install `opendataloader-pdf` on container startup \| \| `OPENDATALOADER_VERSION` \| latest \| Pin the PyPI release (e.g. `==2.2.1`) \| \| `OPENDATALOADER_HYBRID` \| _(unset)_ \| Enable hybrid AI mode (e.g. `docling-fast`) \| \| `OPENDATALOADER_IMAGE_OUTPUT` \| _(unset)_ \| `off` / `embedded` / `external` \| \| `OPENDATALOADER_OUTPUT_DIR` \| _(tmp)_ \| Persistent output dir; temp dir used + cleaned if unset \| \| `OPENDATALOADER_DELETE_OUTPUT` \| `1` \| `0` to retain intermediate files for debugging \| \| `OPENDATALOADER_SANITIZE` \| _(unset)_ \| `1` to filter prompt-injection patterns from output \| --- ### Dependencies - Runtime: `opendataloader-pdf` (PyPI, Apache 2.0) — opt-in, not added to `pyproject.toml` core deps. Installed by `ensure_opendataloader()` at container startup when `USE_OPENDATALOADER=true`. - System: Java 11+ on PATH (JVM is the underlying engine). The installer skips with a warning if `java` is not found. --- ### How to test Standalone parser: ```bash source .venv/bin/activate uv pip install opendataloader-pdf python3 -c " import sys; sys.path.insert(0, '.') from deepdoc.parser.opendataloader_parser import OpenDataLoaderParser p = OpenDataLoaderParser() print('available:', p.check_installation()) s, t = p.parse_pdf('path/to/test.pdf', parse_method='pipeline') print(f'sections={len(s)} tables={len(t)}') " ``` ### Benchmark vs Docling ``` file parser secs sections tables ---------------------------------------------------------------------- text-heavy.pdf docling 45.29 148 10 text-heavy.pdf opendataloader 3.14 559 0 table-heavy.pdf docling 7.05 76 3 table-heavy.pdf opendataloader 3.71 90 0 complex.pdf docling 42.67 114 8 complex.pdf opendataloader 3.51 180 0 ```	2026-04-25 00:33:02 +08:00
Magicbook1108	25089600d0	Feat: introduce minimum type check for pipeline (#14354 ) ### What problem does this PR solve? Feat: introduce minimum type check for pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-24 21:12:50 +08:00
Magicbook1108	75a5548b85	Feat: optimize title chunk (#14325 ) ### What problem does this PR solve? Feat: optimize title chunk 1. Add a new button to enable "Use root chunk as H0 heading", so that the first chunk is carried on to all remaining chunks. 2. Update resume agent template ### Type of change - [x] New Feature (non-breaking change which adds functionality) <img width="700" alt="img_v3_02111_63b04951-b3d7-4001-a08b-539db6d5298g" src="https://github.com/user-attachments/assets/4179ac4d-90e7-4353-9b93-d649a455e634" /> <img width="700" alt="image" src="https://github.com/user-attachments/assets/c0ba0f3c-05aa-4f2c-b418-e808ca1a2641" />	2026-04-23 18:55:55 +08:00
Magicbook1108	b3891ba6a4	Fix audio/video in pipeline (#14241 ) ### What problem does this PR solve? Fix audio/video in pipeline ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-21 12:17:57 +08:00
Magicbook1108	19eedeec61	Fix: accept empty value as 0 chunk (#14220 ) ### What problem does this PR solve? Fix: accept empty value as 0 chunk ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-20 12:53:47 +08:00
Magicbook1108	944a90d645	Feat: add button to turn off vlm parsing (#14125 ) ### What problem does this PR solve? Feat: add button to turn off vlm parsing ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: chanx <1243304602@qq.com>	2026-04-15 19:06:00 +08:00
Magicbook1108	d51789e2be	Feat: update templates && add resume template (#14124 ) ### What problem does this PR solve? Feat: update templates && add resume template ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-15 18:42:29 +08:00
Magicbook1108	18cafff790	Fix: markdown parser in pipeline (#14032 ) ### What problem does this PR solve? Fix: markdown parser in pipeline ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-10 14:11:14 +08:00
Magicbook1108	87a87a7122	Feat: pipeline support ONE chunking method (#14024 ) ### What problem does this PR solve? Feat: pipeline support ONE chunking method ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-04-10 13:11:22 +08:00
Magicbook1108	27329b40ed	Refact: refact on parser structure (#14012 ) ### What problem does this PR solve? Refact: refact on parser structure ### Type of change - [x] Refactoring	2026-04-10 10:03:44 +08:00
Magicbook1108	52f5880d21	Fix: support vlm fall back in pipeline (#14007 ) ### What problem does this PR solve? Fix: support vlm fall back in pipeline for img/table parsing ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-09 20:20:11 +08:00
Zhichang Yu	b7744e053e	fix: support dense_vector from ES fields response (ES 9.x compatibility) (#13972 ) fix: support dense_vector from ES fields response (ES 9.x compatibility) - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Configuration Chore (non-breaking change which updates configuration) ## Summary by CodeRabbit * Bug Fixes * More accurate handling and unwrapping of dense-vector fields so returned values have correct shapes. * Field selection reliably limits returned data and falls back to alternate result locations when needed. * Use of consistent result IDs and tolerant handling when score values are missing. * Chores / Configuration * Increased build memory and adjusted build-time flags for the frontend build. * Simplified runtime model/GPU checks and removed an automated runtime GPU-install attempt. * Build Fixes * `web/vite.config.ts`: make `build.minify` and `build.sourcemap` respect `VITE_MINIFY` and `VITE_BUILD_SOURCEMAP` env vars from Dockerfile instead of hardcoding `terser` and `true`. * Environment * Allow stack version override and default the runtime image tag to "latest". <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Correct unwrapping of dense-vector fields and reliable field selection with fallback locations. * Consistent use of hit-level IDs and tolerant handling when score values are missing. * Chores / Configuration * Increased frontend build memory and added build-time minify/sourcemap flags; build minification and sourcemap now configurable. * Removed runtime GPU detection for model initialization; force CPU initialization. * Environment * Allow stack version override and default runtime image tag to "latest". <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 17:44:13 +08:00
Magicbook1108	107fe6cf90	Feat: support doc for pipeline parser in word (#14005 ) ### What problem does this PR solve? Feat: support doc for pipeline parser in word ### Type of change - [x] New Feature (non-breaking change which adds functionality) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added support for processing legacy Word `.doc` file formats, extending document compatibility. * Bug Fixes * Enhanced error handling during document parsing to improve reliability and prevent processing failures.	2026-04-09 16:40:42 +08:00
Magicbook1108	69264b3a70	Feat: Refact pipeline (#13826 ) ### What problem does this PR solve? ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-03 19:26:45 +08:00
Daniil Sivak	60ad32a0c2	Feat: support epub parsing (#13650 ) Closes #1398 ### What problem does this PR solve? Adds native support for EPUB files. EPUB content is extracted in spine (reading) order and parsed using the existing HTML parser. No new dependencies required. ### Type of change - [x] New Feature (non-breaking change which adds functionality) To check this parser manually: ```python uv run --python 3.12 python -c " from deepdoc.parser import EpubParser with open('$HOME/some_epub_book.epub', 'rb') as f: data = f.read() sections = EpubParser()(None, binary=data, chunk_token_num=512) print(f'Got {len(sections)} sections') for i, s in enumerate(sections[:5]): print(f'\n--- Section {i} ---') print(s[:200]) " ```	2026-03-17 20:14:06 +08:00
Magicbook1108	eda7835d47	Fix: image pdf in ingestion pipeline (#13563 ) ### What problem does this PR solve? Fix: image pdf in ingestion pipeline #13550 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-12 17:49:02 +08:00
NeedmeFordev	387b0b27c4	feat(parser): support external Docling server via DOCLING_SERVER_URL (#13527 ) ### What problem does this PR solve? This PR adds support for parsing PDFs through an external Docling server, so RAGFlow can connect to remote `docling serve` deployments instead of relying only on local in-process Docling. It addresses the feature request in [#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns with the external-server usage pattern already used by MinerU. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### What is changed? - Add external Docling server support in `DoclingParser`: - Use `DOCLING_SERVER_URL` to enable remote parsing mode. - Try `POST /v1/convert/source` first, and fallback to `/v1alpha/convert/source`. - Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not set. - Wire Docling env settings into parser invocation paths: - `rag/app/naive.py` - `rag/flow/parser/parser.py` - Add Docling env hints in constants and update docs: - `docs/guides/dataset/select_pdf_parser.md` - `docs/guides/agent/agent_component_reference/parser.md` - `docs/faq.mdx` ### Why this approach? This keeps the change focused on one issue and one capability (external Docling connectivity), without introducing unrelated provider-model plumbing. ### Validation - Static checks: - `python -m py_compile` on changed Python files - `python -m ruff check` on changed Python files - Functional checks: - Remote v1 endpoint path works - v1alpha fallback works - Local Docling path remains available when server URL is unset ### Related links - Feature request: [Support external Docling server (issue #13426)](https://github.com/infiniflow/ragflow/issues/13426) - Compare view for this branch: [main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1) ##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)	2026-03-12 17:09:03 +08:00
Lynn	62cb292635	Feat/tenant model (#13072 ) ### What problem does this PR solve? Add id for table tenant_llm and apply in LLMBundle. ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-05 17:27:17 +08:00
Magicbook1108	f0dd12289c	Feat: add preprocess parameters for ingestion pipeline (#13300 ) ### What problem does this PR solve? Feat: add preprocess parameters for ingestion pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-03-02 13:18:57 +08:00
Magicbook1108	158503a1aa	Feat: optimize ingestion pipeline with preprocess (#13211 ) ### What problem does this PR solve? Feat: optimize ingestion pipeline with preprocess ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-02-26 10:24:13 +08:00
Magicbook1108	109441628b	Fix: upload image files (#13071 ) ### What problem does this PR solve? Fix: upload image files ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-11 09:47:33 +08:00
Magicbook1108	75b2d482e2	Fix: ingestion pipeline (#13012 ) ### What problem does this PR solve? Fix ingestion pipeline Only 1 file is acceptable for ingestion pipeline. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-05 15:55:41 +08:00
Magicbook1108	f11ca54e0e	Fix: docx parser output consistent (#12965 ) ### What problem does this PR solve? Fix: docx parser output consistent > File "/home/bxy/ragflow/rag/flow/parser/parser.py", line 506, in _word > sections, tbls = docx_parser(name, binary=blob) > ^^^^^^^^^^^^^^ > ValueError: too many values to unpack (expected 2) > ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-03 15:36:58 +08:00
Yongteng Lei	f096917eeb	Fix: overlap cannot be properly applied (#12828 ) ### What problem does this PR solve? Overlap cannot be properly applied. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-27 12:43:01 +08:00
Kevin Hu	927db0b373	Refa: asyncio.to_thread to ThreadPoolExecutor to break thread limitat… (#12716 ) ### Type of change - [x] Refactoring	2026-01-20 13:29:37 +08:00
Kevin Hu	cec06bfb5d	Fix: empty chunk issue. (#12638 ) #12570 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-15 17:46:21 +08:00
lys1313013	f72a35188d	refactor: remove debug print statements (#12598 ) ### What problem does this PR solve? This PR eliminates unnecessary debug print statements that were left in hot paths of the codebase. ### Type of change - [x] Refactoring	2026-01-14 10:05:34 +08:00
Yongteng Lei	68e5c86e9c	Fix: image not displaying thumbnails when using pipeline (#12574 ) ### What problem does this PR solve? Fix image not displaying thumbnails when using pipeline. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-13 12:54:13 +08:00
Lin Manhui	4fe3c24198	feat: PaddleOCR PDF parser supports thumnails and positions (#12565 ) ### What problem does this PR solve? 1. PaddleOCR PDF parser supports thumnails and positions. 2. Add FAQ documentation for PaddleOCR PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-13 09:51:08 +08:00
Lin Manhui	2e09db02f3	feat: add paddleocr parser (#12513 ) ### What problem does this PR solve? Add PaddleOCR as a new PDF parser. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-01-09 17:48:45 +08:00
Kevin Hu	bc9e1e3b9a	Fix: parent-children pipleine bad case. (#12246 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-26 18:57:16 +08:00
Magicbook1108	e23c8a5dcd	Fix: type check for chunks (#12164 ) ### What problem does this PR solve? Fix: type check for chunks ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-25 12:37:14 +08:00
Yongteng Lei	df0c092b22	Feat: add image table context to pipeline splitter (#12167 ) ### What problem does this PR solve? Add image table context to pipeline splitter. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-25 12:12:23 +08:00
Kevin Hu	bd76b8ff1a	Fix: Tika server upgrades. (#12073 ) ### What problem does this PR solve? #12037 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-23 09:35:52 +08:00
Kevin Hu	8e4d011b15	Fix: parent-children chunking method. (#11997 ) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2025-12-17 16:50:36 +08:00
Yongteng Lei	03f9be7cbb	Refa: only support MinerU-API now (#11977 ) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring	2025-12-17 12:58:48 +08:00
Kevin Hu	ea4a5cd665	Fix: tokenizer issue. (#11902 ) #11786 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-12-11 17:38:17 +08:00
Yongteng Lei	e9710b7aa9	Refa: treat MinerU as an OCR model 2 (#11905 ) ### What problem does this PR solve? Treat MinerU as an OCR model 2. #11903 ### Type of change - [x] Refactoring	2025-12-11 17:33:12 +08:00
buua436	65a5a56d95	Refa:replace trio with asyncio (#11831 ) ### What problem does this PR solve? change: replace trio with asyncio ### Type of change - [x] Refactoring	2025-12-09 19:23:14 +08:00

1 2

94 Commits