## Summary
This PR fixes two issues discovered during testing of the PaddleOCR
async API refactoring:
### 1. PP-OCRv6 returns `ocrResults` instead of `layoutParsingResults`
Models like PP-OCRv6 are pure text recognition models that return
results in `ocrResults.prunedResult.rec_texts` format rather than the
`layoutParsingResults.prunedResult.parsing_res_list` format used by
layout-aware models (PaddleOCR-VL series).
**Changes:**
- `deepdoc/parser/paddleocr_parser.py`: Extract `ocrResults` alongside
`layoutParsingResults` in `_send_request()`, add fallback logic in
`_transfer_to_sections()` and `parse_image()`
- `internal/entity/models/paddleocr.go`: Add `ocrResults` struct and
fallback extraction in Go OCR handler
### 2. Image parsing not integrated into picture chunker
The `parse_image()` method existed in PaddleOCRParser but was never
called from `rag/app/picture.py` (the module that handles image file
uploads). Users configuring PaddleOCR as their layout recognizer would
still get local deepdoc OCR for images.
**Changes:**
- `rag/app/picture.py`: When `layout_recognize` is set to PaddleOCR, use
`PaddleOCROcrModel.parse_image()` instead of local OCR. Falls back
gracefully to local OCR on failure.
## Testing
Verified end-to-end in Docker:
- PaddleOCR-VL-1.6 PDF parsing: ✅ (10 text blocks with bbox)
- PaddleOCR-VL-1.6 image parsing: ✅ (219 chars)
- PP-OCRv6 PDF parsing with ocrResults fallback: ✅ (10 text blocks)
- PP-OCRv6 image parsing with ocrResults fallback: ✅ (136 chars)
## Related PRs
- #15967 (merged) - PaddleOCR async Job API refactoring + new models
- #16086 (merged) - PaddleOCR image parsing support
## Summary
Fixes#15487 — lone markdown headers are no longer isolated as empty
chunks when a custom `delimiter` is set.
- Merge consecutive lone headers before attaching to the following prose
body
- Skip code fences, tables, lists, and blockquotes via
`_is_attachable_body()`
- Unit tests include the `# Title / ## Intro / Body` regression from
CodeRabbit review
## Validation
- `pytest test/unit_test/deepdoc/parser/test_markdown_parser.py` (11
passed locally)
Closes#15487
## Summary
Add image parsing capability to PaddleOCR integration, building on top
of #15967 (async Job API migration).
## Changes
### `deepdoc/parser/paddleocr_parser.py`
- Add `parse_image()` method that uses the same async Job API flow as
`parse_pdf()`
- Extracts text from `layoutParsingResults` → `prunedResult` →
`parsing_res_list`
- Returns concatenated block content as a single string
### `rag/llm/ocr_model.py`
- Add `parse_image()` wrapper to `PaddleOCROcrModel` with availability
check and logging
## Relationship to other PRs
- **Depends on**: #15967 (async Job API migration) — this PR is based on
that branch
- **Replaces**: #14826 (original image processing PR based on old sync
API)
## Notes
This PR uses `base_url` and the async Job API (submit → poll → fetch)
consistent with #15967, rather than the old `api_url` + sync POST
pattern from #14826.
## Summary
Migrate PaddleOCR integration from the deprecated synchronous HTTP API
to the new asynchronous Job API (`submit → poll → fetch`), aligning with
PaddleOCR 3.6.0+ architecture.
## Changes
### Python (`deepdoc/parser/paddleocr_parser.py`)
- Replace synchronous `requests.post()` with async Job API flow (submit
→ poll → fetch)
- Authentication: `token {token}` → `Bearer {token}`
- File transfer: base64 JSON body → multipart file upload
- Polling: exponential backoff (initial 3s, ×1.5, max 15s, timeout
controlled by `request_timeout`)
- Result: fetch full JSONL from result URL, preserving `prunedResult`
with bbox info for crop functionality
- Rename `api_url` → `base_url` (backward compatible: `api_url` still
accepted as fallback)
### Python (`rag/llm/ocr_model.py`)
- Prefer `paddleocr_base_url` / `PADDLEOCR_BASE_URL`, fallback to
`paddleocr_api_url` / `PADDLEOCR_API_URL`
### Go (`internal/entity/models/paddleocr.go`)
- Add `Client-Platform: ragflow` header to submit and poll requests
- Change polling from fixed 3s to exponential backoff (initial 3s, ×1.5,
max 15s)
### Python (`common/constants.py`)
- Add `PADDLEOCR_BASE_URL` to env keys and default config
## Backward Compatibility
- Old env var `PADDLEOCR_API_URL` still works (used as fallback)
- Frontend field `paddleocr_api_url` still works (backend reads it as
fallback)
- No user-facing configuration changes required for existing setups
## Why not use the `paddleocr` SDK package directly?
RAGFlow's `_transfer_to_sections()` relies on `prunedResult` (containing
`block_bbox`, `block_label`, `parsing_res_list`) from the raw API
response for PDF crop functionality. The SDK's public `parse_document()`
API only returns `DocParsingResult` with `markdown_text`, discarding the
bbox data. Therefore we implement the async Job API flow directly via
HTTP, following the same logic as the SDK internally.
### What problem does this PR solve?
`RAGFlowExcelParser.html()` iterates `(len(rows) - 1) // chunk_rows + 1`
times. `rows[0]` is the header, so `len(rows) - 1` is the data-row
count. When that count is an exact multiple of `chunk_rows`, the `+ 1`
over-counts by one: the final iteration's data slice is empty, but the
header row is still appended — producing a chunk that contains only the
table header and no data.
This is reachable via `rag/app/naive.py` (`html4excel`, `chunk_rows=12`)
and `rag/app/one.py`. A sheet with 12/24/36… data rows (or 256/512… with
the default `chunk_rows=256`) produces an extra
`<table><caption>…</caption><tr><th>…</th></tr></table>` chunk. It is
non-empty, so it passes the `if _` filter and gets indexed as a real
(empty) chunk.
| data rows (chunk_rows=12) | before | after |
|---|---|---|
| 12 | 2 chunks (1 header-only) | 1 |
| 24 | 3 chunks (1 header-only) | 2 |
| 13 | 2 (unchanged) | 2 |
### Fix
Iterate `ceil(n_data / chunk_rows)` times instead of `n_data //
chunk_rows + 1`. Adds
`test/unit_test/deepdoc/parser/test_excel_parser.py`; the
header-only-chunk cases fail before this change and pass after.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Used the Claude CLI while working on this.
### What problem does this PR solve?
Markdown extraction can split tables row by row when delimiter-based
extraction uses a newline delimiter. That loses table structure during
chunking even though delimiters should still split normally outside
tables.
This PR keeps the follow-up to #15482 intentionally narrow:
- preserve Markdown pipe tables during delimiter-based extraction
- preserve borderless pipe tables during delimiter-based extraction
- preserve multiline HTML tables during delimiter-based extraction
- keep delimiter splitting unchanged outside protected table ranges
Refs #15482
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Testing
- `ruff check deepdoc/parser/markdown_parser.py
test/unit_test/deepdoc/parser/test_markdown_parser.py`
- `python3 run_tests.py -t
test/unit_test/deepdoc/parser/test_markdown_parser.py`
- `git diff --check`
## Summary
- keep the native Docling chunking path when it returns usable chunks
- fall back to the standard Docling response parser when a chunked
request gets HTTP 200 but returns no usable chunks
- add a regression test for older Docling servers that accept the
chunking request but return a standard conversion payload
## Why
Older external Docling servers can accept a request containing
`do_chunking: true` and still return the standard conversion response
shape. The current code treats any HTTP 200 from the chunked request as
a native chunk response, finds no chunk entries, and returns zero
sections without trying the standard response parser.
Fixes#15569.
## Validation
- `python -m pytest
test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py -q`
- `python -m py_compile deepdoc\\parser\\docling_parser.py
test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py`
- `python -m ruff check deepdoc\\parser\\docling_parser.py
test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py`
- `git diff --check`
### What problem does this PR solve?
Markdown extraction currently applies custom delimiters before
respecting fenced code blocks. When a delimiter such as a newline is
configured, fenced code can be split into separate chunks, and longer
outer fences can be closed incorrectly by shorter nested fences.
This PR keeps the fix intentionally narrow for the Markdown chunking
discussion in #15482:
- preserve fenced code blocks when delimiter-based extraction is used
- support both backtick and tilde fences
- respect fence length so longer outer fences can contain shorter inner
fences
- keep delimiter splitting unchanged outside fenced blocks
Refs #15482
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### Testing
- `ruff check deepdoc/parser/markdown_parser.py
test/unit_test/deepdoc/parser/test_markdown_parser.py`
- `python3 run_tests.py -t
test/unit_test/deepdoc/parser/test_markdown_parser.py`
## Summary
- Skip MinerU `header`, `footer`, and `page_number` blocks when
converting `content_list.json` into sections.
- Ignore unsupported block types explicitly so future MinerU output
types cannot re-emit the previous text block.
Fixes duplicate text in General/naive chunks when parsing PDFs via
MinerU (reported with repeated page headers and body text in slices).
Closes#15335
## Test plan
- [x] `pytest test/unit_test/deepdoc/parser/test_mineru_parser.py -v`
(4/4 passed)
## Summary
This change fixes ingestion quality issues where MinerU parser output
may contain HTML fragments (for example, table-related tags like `<tr>`,
`<td>`, `<br>`), which were previously passed directly into
chunking/tokenization and degraded chunk quality.
The fix adds a sanitization step in the MinerU parser path so parsed
sections are normalized to clean text before chunking.
## Change Type (select all)
- [x] Bug fix
- [x] Ingestion pipeline improvement
- [x] Parser/chunking quality fix
## Related Issue
- https://github.com/infiniflow/ragflow/issues/14831
## Summary
Closes#14869.
Adds VLM-based semantic descriptions to **image chunks produced by the
MinerU parser**, closing a long-standing parity gap with the deepdoc
parser's `VisionFigureParser`. A maintainer flagged this in #13342
("We may add the VLM enhancement to MinerU parser as well") and an
earlier proposal exists in #13824; this PR lands the change end-to-end
inside the existing parser plumbing.
## Why
Today the MinerU parser returns image chunks containing only the
native `image_caption` and `image_footnote` strings from MinerU's
JSON. When neither is present (or when both are sparse), the chunk
carries effectively no searchable content for the figure and
retrieval misses it entirely. Users who configured a local VLM
(reporter's case: Gemma-4-31B) had to post-process MinerU's
`tmp/*.json` themselves.
The deepdoc parser already solves this via
[`VisionFigureParser`](deepdoc/parser/figure_parser.py): when the
tenant has an `IMAGE2TEXT` model configured, each figure gets a
semantic description merged into its chunk. This PR brings the same
behavior to MinerU.
## What changed
### `deepdoc/parser/mineru_parser.py`
- **New method `_enhance_images_with_vlm(outputs, vision_model,
callback=None)`** —
collects every `IMAGE` block with a readable `img_path`, runs
`rag.app.picture.vision_llm_chunk` in a 10-worker
`ThreadPoolExecutor` using the existing
`vision_llm_figure_describe_prompt`, and writes the result back as
`vlm_description`. Per-image failures are logged and skipped — they
never abort the run.
- **`_transfer_to_sections` (IMAGE branch)** — folds
`vlm_description` into the section text alongside caption +
footnote, so the description becomes part of the chunk and is
searchable / retrievable.
- **`parse_pdf`** — after `_read_output`, calls
`_enhance_images_with_vlm(outputs, vision_model, callback=callback)`
when a `vision_model` kwarg is supplied. Wrapped in `try / except`
so a VLM outage cannot break parsing.
### `rag/app/naive.py` (`by_mineru`)
After successfully resolving the MinerU OCR parser, also resolves the
tenant's default `LLMType.IMAGE2TEXT` model via
`get_tenant_default_model_by_type`, wraps it in an `LLMBundle`, and
injects it as `kwargs["vision_model"]` before delegating to
`parse_pdf`.
## Behavior
| Tenant config | Behavior |
|---|---|
| `IMAGE2TEXT` model configured | MinerU image chunks contain `caption +
footnote + VLM description`. Retrieval against figures now actually
works. |
| No `IMAGE2TEXT` model configured | Exact same output as today (caption
+ footnote only). Lookup fails silently with an info log; no error, no
regression. |
| VLM call fails for a single image | That image silently falls back to
caption + footnote; other images proceed. |
| Caller already passes `vision_model` in kwargs | We don't override it
— `if "vision_model" not in kwargs` guards the lookup. |
## Files
- `deepdoc/parser/mineru_parser.py` (+56)
- `rag/app/naive.py` (+13)
## Problem
When using MinerU with `vlm-http-client` backend, the parser fails to
find the output files because they are located in a `vlm/` subdirectory,
but the `_read_output`
method doesn't check this location.
## Error Message
[ERROR]MinerU not found.
[MinerU] Missing output file, tried: ...
## Root Cause
The MinerU API with `vlm-http-client` backend returns output files in
the following structure:
output_dir/
vlm/
filename_content_list.json
filename.md
images/
However, the `_read_output` method in `mineru_parser.py` only checks:
1. `output_dir/filename_content_list.json`
2. `output_dir/sanitized_filename_content_list.json`
3. `output_dir/sanitized_filename/sanitized_filename_content_list.json`
It doesn't check the `vlm/` subdirectory.
## Solution
Added two additional fallback paths to check the `vlm/` subdirectory:
- `output_dir/vlm/filename_content_list.json`
- `output_dir/vlm/sanitized_filename_content_list.json`
## Testing
Tested with MinerU API using `vlm-http-client` backend. The parser now
successfully finds and processes the output files.
## Related
This issue occurs specifically when using:
- MinerU backend: `vlm-http-client`
- MinerU server URL configured for remote vLLM inference
Closes#14753
## What changed
| File | Change |
|---|---|
| `pyproject.toml` | `requires-python` → `>=3.13,<3.15`; remove
`strenum==0.4.15` |
| `Dockerfile` | `uv python install 3.13`, `uv sync --python 3.13` |
| `.github/workflows/tests.yml` | `uv sync --python 3.13` on both matrix
legs |
| `CLAUDE.md` | dev setup command + requirements note updated |
| `deepdoc/parser/mineru_parser.py` | `from strenum import StrEnum` →
`from enum import StrEnum` |
| `agent/tools/code_exec.py` | same |
`StrEnum` has been in the stdlib since Python 3.11 — the `strenum`
backport package is no longer needed once the floor is 3.13.
## Why uv.lock is not regenerated
`uv lock --python 3.13` fails because:
1. The infiniflow/graspologic fork pins `numpy>=1.26.4,<2.0.0`
2. `tensorflow-cpu>=2.20.0` (the first release with cp313 wheels)
depends on `ml-dtypes>=0.5.1`, which requires `numpy>=2.1.0`
3. These two constraints are irreconcilable on Python 3.13
The lockfile regeneration requires loosening the `numpy` upper bound in
the `infiniflow/graspologic` fork. Once that fork commit is updated and
the SHA in `pyproject.toml:49` is bumped, `uv lock --python 3.13` will
succeed.
## RFC corrections
Two claims in the original RFC (#14753) did not hold up under code
review:
- **"graspologic hard-blocks 3.13"** — the infiniflow fork at the pinned
commit has no `<3.13` Python constraint. The blocker is the transitive
`numpy<2.0.0` conflict with tensorflow-cpu's test dependency, not a
direct Python version cap.
- **"free-threading throughput gains for I/O-bound workload"** — Python
3.13 free-threading requires a special `--disable-gil` build and
provides no benefit for async I/O code (the GIL is already released
during I/O). The real motivation is forward compatibility and improved
error messages.
## Summary
- Rename misspelled attribute `model_speciess` to `model_species` across
4 files
- The extra `s` is a typo — `species` is already plural
## Test plan
- [ ] Verify PDF parsing with laws/manual/paper parser types still works
correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: yuj <yuj@ztjzsoft.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### What problem does this PR solve?
This fixes a MinerU parsing failure where output JSON was not found in
nested v0.24.0 layouts, and also fixes a `content_names` NameError in
`_read_output()`. As a result, successful MinerU API runs no longer end
with false “MinerU not found” parsing failures.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Feat: add button for remove header & footer in pipeline
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
## Summary
- **Lazy img_np loading**: `np.array(img)` is now deferred until the
first OCR text extraction is actually needed, avoiding unnecessary
memory allocation for pages that already have text.
- **Chunked parse_into_bboxes**: Large PDFs (>50 pages, configurable via
`PDF_PARSER_PAGE_BATCH_SIZE`) are processed in batches. Each chunk's
boxes are normalized with `_to_global_boxes` to produce globally
consistent page numbers and position tags.
- **DLA early init**: Move remote-client initialization before model
loading in `LayoutRecognizer.__init__` so `DEEPDOC_URL` (or legacy
`TENSORRT_DLA_SVR`) short-circuits unnecessary model download for parser
containers relying on remote inference.
- **Fix outline regression**: Restore `self.outlines =
extract_pdf_outlines(fnm)` in `parse_into_bboxes`; this was dropped
during refactoring and is required by downstream `remove_toc` and
metadata handling in `rag/flow/parser/parser.py`.
## Test plan
- [ ] Small PDF (<=50 pages): verify parse succeeds and `self.outlines`
is populated
- [ ] Large PDF (>50 pages): verify chunked processing produces globally
consistent page numbers
- [ ] With `DEEPDOC_URL` set: verify remote DLA client is used and local
model is not downloaded
- [ ] With legacy `TENSORRT_DLA_SVR` set: verify backward compatibility
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
### What problem does this PR solve?
Fixes#14196
## Problem
When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:
- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports
Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.
## Root Cause
```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# Only the first 300 pages were rendered; everything beyond was silently dropped
```
While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.
## Solution
### 1. Define constants in `common/constants.py`
```python
MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer
```
### 2. Replace all hardcoded sentinel values
| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |
### 3. Fix `parse_into_bboxes()` missing parameters
Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.
## Files Changed (22)
- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
Resolves#14211
**Background:** Currently, RAGFlow routes all Docling parsing through
the standard `/convert/source` endpoint. For large documents, this
returns massive, unchunked text that exceeds RAGFlow's internal
embedding model context limits, causing pipeline failures.
**Solution:**
This PR updates the `_parse_pdf_remote` ingestion logic in
`docling_parser.py` to prioritize `docling-serve`'s native chunking
endpoints (`/v1/chunk/source` and `/v1alpha/chunk/source`).
- By receiving pre-sliced chunk objects directly from Docling, RAGFlow
natively bypasses token limit overflows.
- Included a graceful fallback mechanism to the standard
`/convert/source` endpoints to maintain backwards compatibility for
users running older versions of the Docling server that return 404s on
the new routes.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
update MinerU parser to most recent minerU v3 logic
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
update MinerU endpoint to /pdf_parse which has been exposed since v3.x.
fixes#14263
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Summary
- Replace `json.load(open(...))` with `with open(...) as f:
json.load(f)` in 2 resume parser files
- Fixes 4 leaked file descriptors in `corporations.py` (3) and
`schools.py` (1)
## Why
In a long-running server process like RAGFlow, leaked file handles can
accumulate and hit the OS file descriptor limit (`OSError: [Errno 24]
Too many open files`). The other instances mentioned in the issue
(`infinity_conn_base.py` and `init_data.py`) have already been fixed.
## Test plan
- [x] Verified affected files use `with` statement after fix
- [x] Grep confirms no remaining `json.load(open(` patterns in codebase
Fixes#13996🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: support dense_vector from ES fields response (ES 9.x compatibility)
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Configuration Chore (non-breaking change which updates
configuration)
## Summary by CodeRabbit
* **Bug Fixes**
* More accurate handling and unwrapping of dense-vector fields so
returned values have correct shapes.
* Field selection reliably limits returned data and falls back to
alternate result locations when needed.
* Use of consistent result IDs and tolerant handling when score values
are missing.
* **Chores / Configuration**
* Increased build memory and adjusted build-time flags for the frontend
build.
* Simplified runtime model/GPU checks and removed an automated runtime
GPU-install attempt.
* **Build Fixes**
* `web/vite.config.ts`: make `build.minify` and `build.sourcemap`
respect `VITE_MINIFY` and `VITE_BUILD_SOURCEMAP` env vars from
Dockerfile instead of hardcoding `terser` and `true`.
* **Environment**
* Allow stack version override and default the runtime image tag to
"latest".
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Correct unwrapping of dense-vector fields and reliable field selection
with fallback locations.
* Consistent use of hit-level IDs and tolerant handling when score
values are missing.
* **Chores / Configuration**
* Increased frontend build memory and added build-time minify/sourcemap
flags; build minification and sourcemap now configurable.
* Removed runtime GPU detection for model initialization; force CPU
initialization.
* **Environment**
* Allow stack version override and default runtime image tag to
"latest".
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Fix `a image` → `an image` in README and log message
- Fix `colomn` → `column` in table structure recognizer comment
- Fix `formated` → `formatted` in confluence connector docstring
- Fix `tabel of content` → `table of contents` in TOC prompt
## Test plan
- [ ] Documentation and comment changes, no functional impact
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: yuj <yuj@ztjzsoft.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
---------
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
### What problem does this PR solve?
Fixes markdown tables being parsed twice (once as markdown and again as
generated HTML), which caused duplicate table chunks in the chunk list
UI.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
## Summary
- Fix incorrect Markdown heading mapping for `h4` in `TITLE_TAGS`
dictionary
- `h4` was mapped to `"#####"` (h5 level) instead of `"####"` (correct
h4 level)
Closes#13819
## Details
In `deepdoc/parser/html_parser.py`, the `TITLE_TAGS` dictionary had a
typo where `h4` was assigned 5 `#` characters instead of 4, causing h4
headings to be converted to h5-level Markdown headings during HTML
parsing.
## Test plan
- [ ] Parse an HTML document containing `<h4>` tags and verify the
output uses `####` (4 hashes)
- [ ] Verify other heading levels remain correct
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Asksksn <Asksksn@noreply.gitcode.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
Closes#13803
The `__images__` method in `paddleocr_parser.py` defaulted to
`page_to=100`, only loading the first 100 pages for image cropping.
However, the PaddleOCR API processes **all** pages of the PDF. For PDFs
with more than 100 pages, page indices beyond 99 were rejected as out of
range during crop validation, causing content loss.
## Root Cause
```
__images__(page_to=100) → loads pages 0-99 → page_images has 100 entries
PaddleOCR API → processes all 226 pages → tags reference pages 1-226
extract_positions() → converts tag "101" to index 100
crop() validation → 0 <= 100 < 100 → False → "All page indices [100] out of range"
```
## Fix
Changed `page_to` default from `100` to `10**9`, so all PDF pages are
loaded for cropping. Python's list slicing safely handles oversized
indices.
## Test plan
- [ ] Parse a PDF with >100 pages using PaddleOCR — no more "out of
range" warnings
- [ ] Parse a PDF with <100 pages — behavior unchanged
- [ ] Verify cropped images are generated correctly for all pages
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Asksksn <Asksksn@noreply.gitcode.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
### What problem does this PR solve?
Minor fix.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
---------
Co-authored-by: Hu Di <812791840@qq.com>
### What problem does this PR solve?
let excel use lazy image loader
### Type of change
- [x] Refactoring
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
RAGFlow had no Turkish language support. This PR adds Turkish (tr)
locale translations to the UI.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Co-authored-by: Mustafa YILDIZ <mustafa.yildiz@cilek.com>
Closes#1398
### What problem does this PR solve?
Adds native support for EPUB files. EPUB content is extracted in spine
(reading) order and parsed using the existing HTML parser. No new
dependencies required.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
To check this parser manually:
```python
uv run --python 3.12 python -c "
from deepdoc.parser import EpubParser
with open('$HOME/some_epub_book.epub', 'rb') as f:
data = f.read()
sections = EpubParser()(None, binary=data, chunk_token_num=512)
print(f'Got {len(sections)} sections')
for i, s in enumerate(sections[:5]):
print(f'\n--- Section {i} ---')
print(s[:200])
"
```
### What problem does this PR solve?
Fix: paddle ocr coordinate lower > upper #13618
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
### What problem does this PR solve?
Fix: image pdf in ingestion pipeline #13550
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
This PR adds support for parsing PDFs through an external Docling
server, so RAGFlow can connect to remote `docling serve` deployments
instead of relying only on local in-process Docling.
It addresses the feature request in
[#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns
with the external-server usage pattern already used by MinerU.
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What is changed?
- Add external Docling server support in `DoclingParser`:
- Use `DOCLING_SERVER_URL` to enable remote parsing mode.
- Try `POST /v1/convert/source` first, and fallback to
`/v1alpha/convert/source`.
- Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not
set.
- Wire Docling env settings into parser invocation paths:
- `rag/app/naive.py`
- `rag/flow/parser/parser.py`
- Add Docling env hints in constants and update docs:
- `docs/guides/dataset/select_pdf_parser.md`
- `docs/guides/agent/agent_component_reference/parser.md`
- `docs/faq.mdx`
### Why this approach?
This keeps the change focused on one issue and one capability (external
Docling connectivity), without introducing unrelated provider-model
plumbing.
### Validation
- Static checks:
- `python -m py_compile` on changed Python files
- `python -m ruff check` on changed Python files
- Functional checks:
- Remote v1 endpoint path works
- v1alpha fallback works
- Local Docling path remains available when server URL is unset
### Related links
- Feature request: [Support external Docling server (issue
#13426)](https://github.com/infiniflow/ragflow/issues/13426)
- Compare view for this branch:
[main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1)
##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426)
## Summary
This PR is the direct successor to the previous `docx` lazy-loading
implementation. It addresses the technical debt intentionally left out
in the last PR by fully migrating the `qa` and `manual` parsing
strategies to the new lazy-loading model.
Additionally, this PR comprehensively refactors the underlying `docx`
parsing pipeline to eliminate significant code redundancy and introduces
robust fallback mechanisms to handle completely corrupted image streams
safely.
## What's Changed
* **Centralized Abstraction (`docx_parser.py`)**: Moved the
`get_picture` extraction logic up to the `RAGFlowDocxParser` base class.
Previously, `naive`, `qa`, and `manual` parsers maintained separate,
redundant copies of this method. All downstream strategies now natively
gather raw blobs and return `LazyDocxImage` objects automatically.
* **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge
cases where `python-docx` encounters critically malformed magic headers.
Implemented an explicit `try-except` structure that safely intercepts
`UnrecognizedImageError` (and similar exceptions) and seamlessly falls
back to retrieving the raw binary via `getattr(related_part, "blob",
None)`, preventing parser crashes on damaged documents.
* **Legacy Code & Redundancy Purge**:
* Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`,
and `manual.py`.
* Removed the standalone, immediate-decoding `concat_img` method in
`manual.py`. It has been completely replaced by the globally unified,
lazy-loading-compatible `rag.nlp.concat_img`.
* Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception
packages) across all updated strategy files.
## Scope
To keep this PR focused, I have restricted these changes strictly to the
unification of `docx` extraction logic and the lazy-load migration of
`qa` and `manual`.
## Validation & Testing
I've tested this to ensure no regressions and validated the fallback
logic:
* **Output Consistency**: Compared identical `.docx` inputs using `qa`
and `manual` strategies before and after this branch: chunk counts,
extracted text, table HTML, and attached images match perfectly.
* **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory
usage when processing image-dense documents through the `qa` and
`manual` pipelines, bringing them up to parity with the `naive`
strategy's performance gains.
## Breaking Changes
* None.
### What problem does this PR solve?
Fixes#6004#7142#11959
Unlike #9207 we actually normalize the coordinates here
### Type of change
- [X] Bug Fix (non-breaking change which fixes an issue)
## Problem
When PDF fonts lack ToUnicode/CMap mappings, pdfplumber (pdfminer)
cannot map CIDs to correct Unicode characters, outputting PUA characters
(U+E000~U+F8FF) or `(cid:xxx)` placeholders. The original code fully
trusted pdfplumber text without any garbled detection, causing garbled
output in the final parsed result.
Relates to #13366
## Solution
### 1. Garbled text detection functions
- `_is_garbled_char(ch)`: Detects PUA characters (BMP/Plane 15/16),
replacement character U+FFFD, control characters, and
unassigned/surrogate codepoints
- `_is_garbled_text(text, threshold)`: Calculates garbled ratio and
detects `(cid:xxx)` patterns
### 2. Box-level fallback (in `__ocr()`)
When a text box has ≥50% garbled characters, discard pdfplumber text and
fallback to OCR recognition.
### 3. Page-level detection (in `__images__()`)
Sample characters from each page; if garbled rate ≥30%, clear all
pdfplumber characters for that page, forcing full OCR.
### 4. Layout recognizer CID filtering
Filter out `(cid:xxx)` patterns in `layout_recognizer.py` text
processing to prevent them from polluting layout analysis.
## Testing
- 29 unit tests covering: normal CJK/English text, PUA characters, CID
patterns, mixed text, boundary thresholds, edge cases
- All 85 existing project unit tests pass without regression
### What problem does this PR solve?
Fix: paddle ocr missing outlines #13422
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Add id for table tenant_llm and apply in LLMBundle.
### Type of change
- [x] Refactoring
---------
Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>
Co-authored-by: Liu An <asiro@qq.com>
### What problem does this PR solve?
When multiple columns are used as content columns in RDBMS connector,
the generated document text gets chunked by TxtParser which strips
newline delimiters during merge. This causes field names and values from
different columns to be concatenated without any separator, making the
content unreadable.
Changes:
- txt_parser.py: restore newline separator when merging adjacent text
segments within a chunk, so that split sections are not directly
concatenated
- rdbms_connector.py: use double newline between fields and place field
value on a new line after the field name bracket, giving TxtParser
clearer boundaries to work with
Closes#13001
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Co-authored-by: tunsuytang <tunsuytang@tencent.com>
### What problem does this PR solve?
Summary:
This PR addresses critical indexing issues in
deepdoc/parser/pdf_parser.py that occur when parsing long PDFs with
chunk-based pagination:
Normalize rotated table page numbering: Rotated-table re-OCR now writes
page_number in chunk-local 1-based form, eliminating double-addition of
page_from offset that caused misalignment between table positions and
document boxes.
Convert absolute positions to chunk-local coordinates: When inserting
tables/figures extracted via _extract_table_figure, positions are now
converted from absolute (0-based) to chunk-local indices before distance
matching and box insertion. This prevents IndexError and out-of-range
accesses during paged parsing of long documents.
Root Cause:
The parser mixed absolute (0-based, document-global) and relative
(1-based, chunk-local) page numbering systems. Table/figure positions
from layout extraction carried absolute page numbers, but insertion
logic expected chunk-local coordinates aligned with self.boxes and
page_cum_height.
Testing(I do):
Manual verification: Parse a 200+ page PDF with from_page > 0 and table
rotation enabled. Confirm that:
Tables and figures appear on correct pages
No IndexError or position mismatches occur
Page numbers in output match expected chunk-local offsets
Automated testing: 我没做
## Separate Discussion: Memory Optimization Strategy(from codex-5.2-max
and claude 4.5 opus and me)
### Context
The current implementation loads entire page ranges into memory
(`__images__`, `page_chars`, intermediates), which can cause RAM
exhaustion on large documents. While the page numbering fix resolves
correctness issues, scalability remains a concern.
### Proposed Architecture
**Pipeline-Driven Chunking with Explicit Resource Management:**
1. **Authoritative chunk planning**: Accept page-range specifications
from upstream pipeline as the single source of truth. The parser should
be a stateless worker that processes assigned chunks without making
independent pagination decisions.
2. **Granular memory lifecycle**:
```python
for chunk_spec in chunk_plan:
# Load only chunk_spec.pages into __images__
page_images = load_page_range(chunk_spec.start, chunk_spec.end)
# Process with offset tracking
results = process_chunk(page_images, offset=chunk_spec.start)
# Explicit cleanup before next iteration
del page_images, page_chars, layout_intermediates
gc.collect() # Force collection of large objects
```
3. **Persistent lightweight state**: Keep model instances (layout
detector, OCR engine), document metadata (outlines, PDF structure), and
configuration across chunks to avoid reinitialization overhead (~2-5s
per chunk for model loading).
4. **Adaptive fallback**: Provide `max_pages_per_chunk` (default: 50)
only when pipeline doesn't supply a plan. Never exceed
pipeline-specified ranges to maintain predictable memory bounds.
5. **Optional: Dynamic budgeting**: Expose a memory budget parameter
that adjusts chunk size based on observed image dimensions and format
(e.g., reduce chunk size for high-DPI scanned documents).
### Benefits
- **Predictable memory footprint**: RAM usage bounded by `chunk_size ×
avg_page_size` rather than total document size
- **Horizontal scalability**: Enables parallel chunk processing across
workers
- **Failure isolation**: Page extraction errors affect only current
chunk, not entire document
- **Cloud-friendly**: Works within container memory limits (e.g., 2-4GB
per worker)
### Trade-offs
- **Increased I/O**: Re-opening PDF for each chunk vs. keeping file
handle (mitigated by page-range seeks)
- **Complexity**: Requires careful offset tracking and stateful
coordination between pipeline and parser
- **Warmup cost**: Model initialization overhead amortized across chunks
(acceptable for documents >100 pages)
### Implementation Priority
This optimization should be **deferred to a separate PR** after the
current correctness fix is merged, as:
1. It requires broader architectural changes across the pipeline
2. Current fix is critical for correctness and can be backported
3. Memory optimization needs comprehensive benchmarking on
representative document corpus
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
**Summary**
This PR tackles a significant memory bottleneck when processing
image-heavy Word documents. Previously, our pipeline eagerly decoded
DOCX images into `PIL.Image` objects, which caused high peak memory
usage. To solve this, I've introduced a **lazy-loading approach**:
images are now stored as raw blobs and only decoded exactly when and
where they are consumed.
This successfully reduces the memory footprint while keeping the parsing
output completely identical to before.
**What's Changed**
Instead of a dry file-by-file list, here is the logical breakdown of the
updates:
* **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage`
along with helper APIs to handle lazy decoding, image-type checks, and
NumPy compatibility. It also supports `.close()` and detached PIL access
to ensure safe lifecycle management and prevent memory leaks.
* **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**:
Updated the general DOCX picture extraction to return these new lazy
images. Downstream consumers (like the figure/VLM flow and base64
encoding paths) now decode images right at the use site using detached
PIL instances, avoiding shared-instance side effects.
* **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added
necessary compatibility conversions so these lazy images flow smoothly
through existing merging, filtering, and presentation steps without
breaking.
**Scope & What is Intentionally Left Out**
To keep this PR focused, I have restricted these changes strictly to the
**general Word pipeline** and its downstream consumers.
The `QA` and `manual` Word parsing pipelines are explicitly **not
modified** in this PR. They can be safely migrated to this new lazy-load
model in a subsequent, standalone PR.
**Design Considerations**
I briefly considered adding image compression during processing, but
decided against it to avoid any potential quality degradation in the
derived outputs. I also held off on a massive pipeline re-architecture
to avoid overly invasive changes right now.
**Validation & Testing**
I've tested this to ensure no regressions:
* Compared identical DOCX inputs before and after this branch: chunk
counts, extracted text, table HTML, and image descriptions match
perfectly.
* **Confirmed a noticeable drop in peak memory usage when processing
image-dense documents.** For a 30MB Word document containing 243 1080p
screenshots, memory consumption is reduced by approximately 1.5GB.
**Breaking Changes**
None.