### What problem does this PR solve?
TOC chunks now include a toc field so the agent pipeline logs expose the
data the frontend expects.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Set OpenDataLoader and call in parser and naive
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Force image parser runtime output format to JSON so downstream chunking
reads OCR results from the JSON output and image parser chunks can be
displayed.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: Wang Qi <wangq8@outlook.com>
fix: restore TitleChunker output for json/chunks upstream formats
## Summary
The refactor commit e194027b (#14247) introduced two regressions that
caused `TitleChunker` to produce zero chunks when the upstream Parser
node outputs `json` or `chunks` format (e.g. PDF parsing).
## Root Cause
### 1. Dead code in `extract_line_records` (critical)
After refactor, when `payload` is `None` (which is the case for `json`
and `chunks` output formats), the method returns an empty list
immediately via `return []`, so no records are ever extracted from
structured upstream output. The original `json`/`chunks` handling code
became unreachable dead code.
### 2. Unconditional overwrite in `build_chunks_from_record_groups`
The `chunks` variable assigned in the `if` branch for markdown/text/html
formats was unconditionally overwritten by the statement below it, due
to a missing `else` keyword.
## Fix
- Remove the premature `return []` so the `json`/`chunks` branch is
reachable again.
- Add `else` branch in `build_chunks_from_record_groups` so the two
format families are handled independently.
## Test Plan
- [x] Verified no lint errors on the changed file
- [ ] Tested with a PDF document parsed via DeepDOC → TitleChunker
pipeline
- [ ] Tested with markdown input through TitleChunker
- [ ] Tested hierarchy and group chunking modes
## Impact
- Fixes the regression where documents parsed with `json`/`chunks`
output format produced no chunks from `TitleChunker`.
- No API or configuration changes. Fully backward compatible.
Signed-off-by: noob <yixiao121314@outlook.com>
### What problem does this PR solve?
Python implementation of the Go-based model_provider API suite.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: bill <yibie_jingnian@163.com>
### What problem does this PR solve?
1. Break huge function into smaller pieces
2. Add unit test for the smaller pieces function
3. Layer-ed design
a. infra layer - task_context.py, recording_context.py,
write_operation_interceptor.py, ...
b. service layer - *_service.py
c. business layer - task_handler.py
4. Default behavior: use "refactor-ed version" - can switch to original
version by change env variable
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Performance Improvement
---------
Co-authored-by: Liu An <asiro@qq.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
## RAG Optimization Description
Optimize the core `BaseTitleChunker` in
`rag/flow/chunker/title_chunker/common.py` to improve RAG document
chunking quality and retrieval accuracy.
## Key Changes
1. **Format-branched text processing**: Preserve original whitespace &
indentation for Markdown/HTML payloads to maintain document semantics
and chunk fidelity; only perform full whitespace cleaning on plain text
content.
2. **Empty chunk filtering**: Thoroughly filter invalid pure-blank lines
to reduce noisy data in vector database.
3. **Code deduplication**: Unified markdown/text/html payload extraction
logic, removed redundant repeated code blocks.
4. **None serialization fix**: Avoid converting `None` value into
literal `"None"` string in chunk text fields.
5. **Production logging**: Added input/output line count logging for
filter logic, observable in online environment.
6. **100% backward compatible**: No changes to chunking hierarchy rules,
output format and all existing workflows.
## RAG Business Value
- Preserves document format fidelity for structured Markdown/HTML files
- Reduces invalid noisy chunks → improves RAG retrieval precision
- Cleans plain text data → optimizes vector embedding quality
- Improves code maintainability with no breaking changes
- Provides observable logging for chunk filtering behavior
## Compatibility
- ✅ No API changes
- ✅ No chunk logic modifications
- ✅ All document parsing/chunking workflows unaffected
- ✅ All pre-checks passed, no code conflicts
### Type of change
- [x] Refactoring
- [x] Performance Improvement
### What problem does this PR solve?
Feat: add button for remove header & footer in pipeline
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Both tokenizer (`rag/flow/tokenizer/tokenizer.py`) and
`BuiltinEmbed.encode`
(`rag/llm/embedding_model.py`) currently accumulate embedding batches
via
`np.concatenate` inside the per-batch loop. `np.concatenate` allocates a
new
array and copies all existing data on every call, so accumulating N
batches
is O(N²) in both time and peak memory.
Replacing the incremental concatenate with a list-of-batches + a single
`np.vstack` at the end gives O(N) total work.
For tokenizer the title-vector broadcast `np.concatenate([vts[0]] * N)`
is
also replaced by `np.tile`, which does the same job with a single
contiguous
allocation instead of building a Python list of references.
This is purely a CPU/memory optimisation — output shape and dtype are
unchanged. Measured impact grows with document size:
- 1k chunks (batch 512, 2 iters): ~negligible
- 10k chunks (20 iters): ~10× speedup on this stage
- 100k chunks (195 iters): ~100× speedup, and peak RAM
drops from O(N) extra to near-zero
### Type of change
- [x] Performance Improvement
Co-authored-by: yoan sapienza <Yoan Sapienza yoan.sapienza@orange.fr Yoan Sapienza zappy@macbookpro.home>
### What problem does this PR solve?
Feat: introduce minimum type check for pipeline
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: add button to turn off vlm parsing
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: chanx <1243304602@qq.com>
### What problem does this PR solve?
Feat: update templates && add resume template
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: pipeline support ONE chunking method
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
Fix: support vlm fall back in pipeline for img/table parsing
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
fix: support dense_vector from ES fields response (ES 9.x compatibility)
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Configuration Chore (non-breaking change which updates
configuration)
## Summary by CodeRabbit
* **Bug Fixes**
* More accurate handling and unwrapping of dense-vector fields so
returned values have correct shapes.
* Field selection reliably limits returned data and falls back to
alternate result locations when needed.
* Use of consistent result IDs and tolerant handling when score values
are missing.
* **Chores / Configuration**
* Increased build memory and adjusted build-time flags for the frontend
build.
* Simplified runtime model/GPU checks and removed an automated runtime
GPU-install attempt.
* **Build Fixes**
* `web/vite.config.ts`: make `build.minify` and `build.sourcemap`
respect `VITE_MINIFY` and `VITE_BUILD_SOURCEMAP` env vars from
Dockerfile instead of hardcoding `terser` and `true`.
* **Environment**
* Allow stack version override and default the runtime image tag to
"latest".
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Correct unwrapping of dense-vector fields and reliable field selection
with fallback locations.
* Consistent use of hit-level IDs and tolerant handling when score
values are missing.
* **Chores / Configuration**
* Increased frontend build memory and added build-time minify/sourcemap
flags; build minification and sourcemap now configurable.
* Removed runtime GPU detection for model initialization; force CPU
initialization.
* **Environment**
* Allow stack version override and default runtime image tag to
"latest".
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
### What problem does this PR solve?
Feat: support doc for pipeline parser in word
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added support for processing legacy Word `.doc` file formats,
extending document compatibility.
* **Bug Fixes**
* Enhanced error handling during document parsing to improve reliability
and prevent processing failures.
### What problem does this PR solve?
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
---------
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Closes#1398
### What problem does this PR solve?
Adds native support for EPUB files. EPUB content is extracted in spine
(reading) order and parsed using the existing HTML parser. No new
dependencies required.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
To check this parser manually:
```python
uv run --python 3.12 python -c "
from deepdoc.parser import EpubParser
with open('$HOME/some_epub_book.epub', 'rb') as f:
data = f.read()
sections = EpubParser()(None, binary=data, chunk_token_num=512)
print(f'Got {len(sections)} sections')
for i, s in enumerate(sections[:5]):
print(f'\n--- Section {i} ---')
print(s[:200])
"
```
### What problem does this PR solve?
Fix: image pdf in ingestion pipeline #13550
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
This PR adds support for parsing PDFs through an external Docling
server, so RAGFlow can connect to remote `docling serve` deployments
instead of relying only on local in-process Docling.
It addresses the feature request in
[#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns
with the external-server usage pattern already used by MinerU.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What is changed?
- Add external Docling server support in `DoclingParser`:
- Use `DOCLING_SERVER_URL` to enable remote parsing mode.
- Try `POST /v1/convert/source` first, and fallback to
`/v1alpha/convert/source`.
- Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not
set.
- Wire Docling env settings into parser invocation paths:
- `rag/app/naive.py`
- `rag/flow/parser/parser.py`
- Add Docling env hints in constants and update docs:
- `docs/guides/dataset/select_pdf_parser.md`
- `docs/guides/agent/agent_component_reference/parser.md`
- `docs/faq.mdx`
### Why this approach?
This keeps the change focused on one issue and one capability (external
Docling connectivity), without introducing unrelated provider-model
plumbing.
### Validation
- Static checks:
- `python -m py_compile` on changed Python files
- `python -m ruff check` on changed Python files
- Functional checks:
- Remote v1 endpoint path works
- v1alpha fallback works
- Local Docling path remains available when server URL is unset
### Related links
- Feature request: [Support external Docling server (issue
#13426)](https://github.com/infiniflow/ragflow/issues/13426)
- Compare view for this branch:
[main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1)
##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)
### What problem does this PR solve?
Add id for table tenant_llm and apply in LLMBundle.
### Type of change
- [x] Refactoring
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
Co-authored-by: Liu An <asiro@qq.com>
### What problem does this PR solve?
Feat: add preprocess parameters for ingestion pipeline
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Feat: optimize ingestion pipeline with preprocess
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Fix ingestion pipeline
Only 1 file is acceptable for ingestion pipeline.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Fix: docx parser output consistent
> File "/home/bxy/ragflow/rag/flow/parser/parser.py", line 506, in _word
> sections, tbls = docx_parser(name, binary=blob)
> ^^^^^^^^^^^^^^
> ValueError: too many values to unpack (expected 2)
>
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
This PR eliminates unnecessary debug print statements that were left in
hot paths of the codebase.
### Type of change
- [x] Refactoring
### What problem does this PR solve?
Fix image not displaying thumbnails when using pipeline.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
1. PaddleOCR PDF parser supports thumnails and positions.
2. Add FAQ documentation for PaddleOCR PDF parser.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add PaddleOCR as a new PDF parser.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add image table context to pipeline splitter.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Only support MinerU-API now, still need to complete frontend for
pipeline to allow the configuration of MinerU options.
### Type of change
- [x] Refactoring