2025-01-21 20:52:28 +08:00
|
|
|
|
#
|
|
|
|
|
|
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
|
|
|
|
|
|
#
|
2024-08-15 09:17:36 +08:00
|
|
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
|
|
# you may not use this file except in compliance with the License.
|
|
|
|
|
|
# You may obtain a copy of the License at
|
|
|
|
|
|
#
|
|
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
#
|
|
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
|
|
# limitations under the License.
|
|
|
|
|
|
#
|
2025-01-21 20:52:28 +08:00
|
|
|
|
|
2024-08-15 09:17:36 +08:00
|
|
|
|
import copy
|
2026-02-02 13:40:51 +08:00
|
|
|
|
import logging
|
2024-08-15 09:17:36 +08:00
|
|
|
|
import re
|
2025-12-02 17:35:14 +08:00
|
|
|
|
from collections import defaultdict
|
2024-08-15 09:17:36 +08:00
|
|
|
|
from io import BytesIO
|
|
|
|
|
|
|
2026-03-09 04:06:00 +00:00
|
|
|
|
from pypdf import PdfReader as pdf2_read
|
2024-08-15 09:17:36 +08:00
|
|
|
|
|
2026-01-30 14:06:19 +08:00
|
|
|
|
from deepdoc.parser import PdfParser, PlainParser
|
2026-01-30 13:35:42 +08:00
|
|
|
|
from deepdoc.parser.ppt_parser import RAGFlowPptParser
|
2025-11-06 15:20:35 +08:00
|
|
|
|
from rag.app.naive import by_plaintext, PARSERS
|
Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?
Fixes #14196
## Problem
When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:
- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports
Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.
## Root Cause
```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# Only the first 300 pages were rendered; everything beyond was silently dropped
```
While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.
## Solution
### 1. Define constants in `common/constants.py`
```python
MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer
```
### 2. Replace all hardcoded sentinel values
| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |
### 3. Fix `parse_into_bboxes()` missing parameters
Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.
## Files Changed (22)
- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 06:57:20 +00:00
|
|
|
|
from common.constants import MAXIMUM_PAGE_NUMBER
|
2025-12-17 19:48:24 +08:00
|
|
|
|
from common.parser_config_utils import normalize_layout_recognizer
|
2025-12-02 17:35:14 +08:00
|
|
|
|
from rag.nlp import rag_tokenizer
|
2026-01-30 14:06:19 +08:00
|
|
|
|
from rag.nlp import tokenize
|
refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233)
**Summary**
This PR tackles a significant memory bottleneck when processing
image-heavy Word documents. Previously, our pipeline eagerly decoded
DOCX images into `PIL.Image` objects, which caused high peak memory
usage. To solve this, I've introduced a **lazy-loading approach**:
images are now stored as raw blobs and only decoded exactly when and
where they are consumed.
This successfully reduces the memory footprint while keeping the parsing
output completely identical to before.
**What's Changed**
Instead of a dry file-by-file list, here is the logical breakdown of the
updates:
* **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage`
along with helper APIs to handle lazy decoding, image-type checks, and
NumPy compatibility. It also supports `.close()` and detached PIL access
to ensure safe lifecycle management and prevent memory leaks.
* **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**:
Updated the general DOCX picture extraction to return these new lazy
images. Downstream consumers (like the figure/VLM flow and base64
encoding paths) now decode images right at the use site using detached
PIL instances, avoiding shared-instance side effects.
* **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added
necessary compatibility conversions so these lazy images flow smoothly
through existing merging, filtering, and presentation steps without
breaking.
**Scope & What is Intentionally Left Out**
To keep this PR focused, I have restricted these changes strictly to the
**general Word pipeline** and its downstream consumers.
The `QA` and `manual` Word parsing pipelines are explicitly **not
modified** in this PR. They can be safely migrated to this new lazy-load
model in a subsequent, standalone PR.
**Design Considerations**
I briefly considered adding image compression during processing, but
decided against it to avoid any potential quality degradation in the
derived outputs. I also held off on a massive pipeline re-architecture
to avoid overly invasive changes right now.
**Validation & Testing**
I've tested this to ensure no regressions:
* Compared identical DOCX inputs before and after this branch: chunk
counts, extracted text, table HTML, and image descriptions match
perfectly.
* **Confirmed a noticeable drop in peak memory usage when processing
image-dense documents.** For a 30MB Word document containing 243 1080p
screenshots, memory consumption is reduced by approximately 1.5GB.
**Breaking Changes**
None.
2026-02-28 11:22:31 +08:00
|
|
|
|
from rag.utils.lazy_image import ensure_pil_image, is_image_like
|
2025-12-02 17:35:14 +08:00
|
|
|
|
|
2024-08-15 09:17:36 +08:00
|
|
|
|
|
|
|
|
|
|
class Pdf(PdfParser):
|
|
|
|
|
|
def __init__(self):
|
|
|
|
|
|
super().__init__()
|
|
|
|
|
|
|
Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?
Fixes #14196
## Problem
When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:
- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports
Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.
## Root Cause
```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# Only the first 300 pages were rendered; everything beyond was silently dropped
```
While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.
## Solution
### 1. Define constants in `common/constants.py`
```python
MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer
```
### 2. Replace all hardcoded sentinel values
| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |
### 3. Fix `parse_into_bboxes()` missing parameters
Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.
## Files Changed (22)
- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 06:57:20 +00:00
|
|
|
|
def __call__(self, filename, binary=None, from_page=0, to_page=MAXIMUM_PAGE_NUMBER, zoomin=3, callback=None, **kwargs):
|
2025-12-02 17:35:14 +08:00
|
|
|
|
# 1. OCR
|
2024-11-30 18:48:06 +08:00
|
|
|
|
callback(msg="OCR started")
|
2026-01-09 17:48:45 +08:00
|
|
|
|
self.__images__(filename if not binary else binary, zoomin, from_page, to_page, callback)
|
2025-12-02 17:35:14 +08:00
|
|
|
|
|
|
|
|
|
|
# 2. Layout Analysis
|
|
|
|
|
|
callback(msg="Layout Analysis")
|
|
|
|
|
|
self._layouts_rec(zoomin)
|
|
|
|
|
|
|
|
|
|
|
|
# 3. Table Analysis
|
|
|
|
|
|
callback(msg="Table Analysis")
|
|
|
|
|
|
self._table_transformer_job(zoomin)
|
|
|
|
|
|
|
|
|
|
|
|
# 4. Text Merge
|
|
|
|
|
|
self._text_merge()
|
|
|
|
|
|
|
|
|
|
|
|
# 5. Extract Tables (Force HTML)
|
|
|
|
|
|
tbls = self._extract_table_figure(True, zoomin, True, True)
|
|
|
|
|
|
|
|
|
|
|
|
# 6. Re-assemble Page Content
|
|
|
|
|
|
page_items = defaultdict(list)
|
|
|
|
|
|
|
|
|
|
|
|
# (A) Add text
|
|
|
|
|
|
for b in self.boxes:
|
2025-12-04 11:23:34 +08:00
|
|
|
|
# b["page_number"] is relative page number,must + from_page
|
|
|
|
|
|
global_page_num = b["page_number"] + from_page
|
|
|
|
|
|
if not (from_page < global_page_num <= to_page + from_page):
|
2025-12-02 17:35:14 +08:00
|
|
|
|
continue
|
2026-01-09 17:48:45 +08:00
|
|
|
|
page_items[global_page_num].append({"top": b["top"], "x0": b["x0"], "text": b["text"], "type": "text"})
|
2025-12-02 17:35:14 +08:00
|
|
|
|
|
|
|
|
|
|
# (B) Add table and figure
|
|
|
|
|
|
for (img, content), positions in tbls:
|
|
|
|
|
|
if not positions:
|
|
|
|
|
|
continue
|
|
|
|
|
|
|
|
|
|
|
|
if isinstance(content, list):
|
|
|
|
|
|
final_text = "\n".join(content)
|
|
|
|
|
|
elif isinstance(content, str):
|
|
|
|
|
|
final_text = content
|
|
|
|
|
|
else:
|
|
|
|
|
|
final_text = str(content)
|
|
|
|
|
|
|
|
|
|
|
|
try:
|
|
|
|
|
|
pn_index = positions[0][0]
|
|
|
|
|
|
if isinstance(pn_index, list):
|
|
|
|
|
|
pn_index = pn_index[0]
|
2025-12-04 11:23:34 +08:00
|
|
|
|
|
|
|
|
|
|
# pn_index in tbls is absolute page number
|
2025-12-02 17:35:14 +08:00
|
|
|
|
current_page_num = int(pn_index) + 1
|
|
|
|
|
|
except Exception as e:
|
|
|
|
|
|
print(f"Error parsing position: {e}")
|
|
|
|
|
|
continue
|
|
|
|
|
|
|
|
|
|
|
|
if not (from_page < current_page_num <= to_page + from_page):
|
|
|
|
|
|
continue
|
|
|
|
|
|
|
|
|
|
|
|
top = positions[0][3]
|
|
|
|
|
|
left = positions[0][1]
|
|
|
|
|
|
|
2026-01-09 17:48:45 +08:00
|
|
|
|
page_items[current_page_num].append({"top": top, "x0": left, "text": final_text, "type": "table_or_figure"})
|
2025-12-02 17:35:14 +08:00
|
|
|
|
|
|
|
|
|
|
# 7. Generate result
|
2024-08-15 09:17:36 +08:00
|
|
|
|
res = []
|
2025-12-02 17:35:14 +08:00
|
|
|
|
for i in range(len(self.page_images)):
|
|
|
|
|
|
current_pn = from_page + i + 1
|
|
|
|
|
|
items = page_items.get(current_pn, [])
|
|
|
|
|
|
# Sort by vertical position
|
|
|
|
|
|
items.sort(key=lambda x: (x["top"], x["x0"]))
|
|
|
|
|
|
full_page_text = "\n\n".join([item["text"] for item in items])
|
|
|
|
|
|
if not full_page_text.strip():
|
|
|
|
|
|
full_page_text = f"[No text or data found in Page {current_pn}]"
|
|
|
|
|
|
page_img = self.page_images[i]
|
|
|
|
|
|
res.append((full_page_text, page_img))
|
|
|
|
|
|
|
|
|
|
|
|
callback(0.9, "Parsing finished")
|
|
|
|
|
|
|
2025-11-05 13:00:42 +08:00
|
|
|
|
return res, []
|
2024-08-15 09:17:36 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
class PlainPdf(PlainParser):
|
Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?
Fixes #14196
## Problem
When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:
- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports
Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.
## Root Cause
```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# Only the first 300 pages were rendered; everything beyond was silently dropped
```
While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.
## Solution
### 1. Define constants in `common/constants.py`
```python
MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer
```
### 2. Replace all hardcoded sentinel values
| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |
### 3. Fix `parse_into_bboxes()` missing parameters
Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.
## Files Changed (22)
- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 06:57:20 +00:00
|
|
|
|
def __call__(self, filename, binary=None, from_page=0, to_page=MAXIMUM_PAGE_NUMBER, callback=None, **kwargs):
|
2024-08-15 09:17:36 +08:00
|
|
|
|
self.pdf = pdf2_read(filename if not binary else BytesIO(binary))
|
|
|
|
|
|
page_txt = []
|
2026-01-09 17:48:45 +08:00
|
|
|
|
for page in self.pdf.pages[from_page:to_page]:
|
2024-08-15 09:17:36 +08:00
|
|
|
|
page_txt.append(page.extract_text())
|
|
|
|
|
|
callback(0.9, "Parsing finished")
|
2025-11-05 13:00:42 +08:00
|
|
|
|
return [(txt, None) for txt in page_txt], []
|
2024-08-15 09:17:36 +08:00
|
|
|
|
|
|
|
|
|
|
|
Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?
Fixes #14196
## Problem
When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:
- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports
Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.
## Root Cause
```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# Only the first 300 pages were rendered; everything beyond was silently dropped
```
While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.
## Solution
### 1. Define constants in `common/constants.py`
```python
MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer
```
### 2. Replace all hardcoded sentinel values
| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |
### 3. Fix `parse_into_bboxes()` missing parameters
Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.
## Files Changed (22)
- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 06:57:20 +00:00
|
|
|
|
def chunk(filename, binary=None, from_page=0, to_page=MAXIMUM_PAGE_NUMBER, lang="Chinese", callback=None, parser_config=None, **kwargs):
|
2024-08-15 09:17:36 +08:00
|
|
|
|
"""
|
2026-02-02 13:40:51 +08:00
|
|
|
|
The supported file formats are pdf, ppt, pptx.
|
2024-08-15 09:17:36 +08:00
|
|
|
|
Every page will be treated as a chunk. And the thumbnail of every page will be stored.
|
|
|
|
|
|
PPT file will be parsed by using this method automatically, setting-up for every PPT file is not necessary.
|
|
|
|
|
|
"""
|
2025-06-26 10:54:43 +07:00
|
|
|
|
if parser_config is None:
|
|
|
|
|
|
parser_config = {}
|
2024-08-15 09:17:36 +08:00
|
|
|
|
eng = lang.lower() == "english"
|
2026-01-09 17:48:45 +08:00
|
|
|
|
doc = {"docnm_kwd": filename, "title_tks": rag_tokenizer.tokenize(re.sub(r"\.[a-zA-Z]+$", "", filename))}
|
2024-08-15 09:17:36 +08:00
|
|
|
|
doc["title_sm_tks"] = rag_tokenizer.fine_grained_tokenize(doc["title_tks"])
|
|
|
|
|
|
res = []
|
|
|
|
|
|
if re.search(r"\.pptx?$", filename, re.IGNORECASE):
|
2026-02-02 13:40:51 +08:00
|
|
|
|
try:
|
|
|
|
|
|
ppt_parser = RAGFlowPptParser()
|
Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?
Fixes #14196
## Problem
When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:
- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports
Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.
## Root Cause
```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# Only the first 300 pages were rendered; everything beyond was silently dropped
```
While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.
## Solution
### 1. Define constants in `common/constants.py`
```python
MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer
```
### 2. Replace all hardcoded sentinel values
| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |
### 3. Fix `parse_into_bboxes()` missing parameters
Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.
## Files Changed (22)
- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 06:57:20 +00:00
|
|
|
|
for pn, txt in enumerate(ppt_parser(filename if not binary else binary, from_page, MAXIMUM_PAGE_NUMBER, callback)):
|
2026-02-02 13:40:51 +08:00
|
|
|
|
d = copy.deepcopy(doc)
|
|
|
|
|
|
pn += from_page
|
|
|
|
|
|
d["doc_type_kwd"] = "image"
|
|
|
|
|
|
d["page_num_int"] = [pn + 1]
|
|
|
|
|
|
d["top_int"] = [0]
|
|
|
|
|
|
d["position_int"] = [(pn + 1, 0, 0, 0, 0)]
|
|
|
|
|
|
tokenize(d, txt, eng)
|
|
|
|
|
|
res.append(d)
|
|
|
|
|
|
return res
|
|
|
|
|
|
except Exception as e:
|
|
|
|
|
|
logging.warning(f"python-pptx parsing failed for {filename}: {e}, trying tika as fallback")
|
|
|
|
|
|
if callback:
|
|
|
|
|
|
callback(0.1, "python-pptx failed, trying tika as fallback")
|
|
|
|
|
|
|
|
|
|
|
|
try:
|
|
|
|
|
|
from tika import parser as tika_parser
|
|
|
|
|
|
except Exception as tika_error:
|
|
|
|
|
|
error_msg = f"tika not available: {tika_error}. Unsupported .ppt/.pptx parsing."
|
|
|
|
|
|
if callback:
|
|
|
|
|
|
callback(0.8, error_msg)
|
|
|
|
|
|
logging.warning(f"{error_msg} for {filename}.")
|
|
|
|
|
|
raise NotImplementedError(error_msg)
|
|
|
|
|
|
|
2026-02-03 10:24:27 +08:00
|
|
|
|
if binary:
|
|
|
|
|
|
binary_data = binary
|
|
|
|
|
|
else:
|
|
|
|
|
|
with open(filename, 'rb') as f:
|
|
|
|
|
|
binary_data = f.read()
|
2026-02-02 13:40:51 +08:00
|
|
|
|
doc_parsed = tika_parser.from_buffer(BytesIO(binary_data))
|
|
|
|
|
|
|
|
|
|
|
|
if doc_parsed.get("content", None) is not None:
|
|
|
|
|
|
sections = doc_parsed["content"].split("\n")
|
|
|
|
|
|
sections = [s for s in sections if s.strip()]
|
|
|
|
|
|
|
|
|
|
|
|
for pn, txt in enumerate(sections):
|
|
|
|
|
|
d = copy.deepcopy(doc)
|
|
|
|
|
|
pn += from_page
|
|
|
|
|
|
d["doc_type_kwd"] = "text"
|
|
|
|
|
|
d["page_num_int"] = [pn + 1]
|
|
|
|
|
|
d["top_int"] = [0]
|
|
|
|
|
|
d["position_int"] = [(pn + 1, 0, 0, 0, 0)]
|
|
|
|
|
|
tokenize(d, txt, eng)
|
|
|
|
|
|
res.append(d)
|
|
|
|
|
|
|
|
|
|
|
|
if callback:
|
|
|
|
|
|
callback(0.8, "Finish parsing with tika.")
|
|
|
|
|
|
return res
|
|
|
|
|
|
else:
|
|
|
|
|
|
error_msg = f"tika.parser got empty content from {filename}."
|
|
|
|
|
|
if callback:
|
|
|
|
|
|
callback(0.8, error_msg)
|
|
|
|
|
|
logging.warning(error_msg)
|
|
|
|
|
|
raise NotImplementedError(error_msg)
|
2024-08-15 09:17:36 +08:00
|
|
|
|
elif re.search(r"\.pdf$", filename, re.IGNORECASE):
|
2026-01-09 17:48:45 +08:00
|
|
|
|
layout_recognizer, parser_model_name = normalize_layout_recognizer(parser_config.get("layout_recognize", "DeepDOC"))
|
2025-06-09 15:01:52 +08:00
|
|
|
|
|
2025-11-05 13:00:42 +08:00
|
|
|
|
if isinstance(layout_recognizer, bool):
|
|
|
|
|
|
layout_recognizer = "DeepDOC" if layout_recognizer else "Plain Text"
|
|
|
|
|
|
|
|
|
|
|
|
name = layout_recognizer.strip().lower()
|
2025-11-06 15:20:35 +08:00
|
|
|
|
parser = PARSERS.get(name, by_plaintext)
|
2025-11-05 13:00:42 +08:00
|
|
|
|
callback(0.1, "Start to parse.")
|
|
|
|
|
|
|
|
|
|
|
|
sections, _, _ = parser(
|
2025-12-02 17:35:14 +08:00
|
|
|
|
filename=filename,
|
|
|
|
|
|
binary=binary,
|
|
|
|
|
|
from_page=from_page,
|
|
|
|
|
|
to_page=to_page,
|
|
|
|
|
|
lang=lang,
|
|
|
|
|
|
callback=callback,
|
|
|
|
|
|
pdf_cls=Pdf,
|
|
|
|
|
|
layout_recognizer=layout_recognizer,
|
2025-12-17 19:48:24 +08:00
|
|
|
|
mineru_llm_name=parser_model_name,
|
2026-01-09 17:48:45 +08:00
|
|
|
|
paddleocr_llm_name=parser_model_name,
|
|
|
|
|
|
**kwargs,
|
2025-11-05 13:00:42 +08:00
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
if not sections:
|
|
|
|
|
|
return []
|
|
|
|
|
|
|
2026-01-09 17:48:45 +08:00
|
|
|
|
if name in ["tcadp", "docling", "mineru", "paddleocr"]:
|
2025-11-05 13:00:42 +08:00
|
|
|
|
parser_config["chunk_token_num"] = 0
|
2025-12-02 17:35:14 +08:00
|
|
|
|
|
2025-06-09 15:01:52 +08:00
|
|
|
|
callback(0.8, "Finish parsing.")
|
2025-11-05 13:00:42 +08:00
|
|
|
|
|
2025-06-09 15:01:52 +08:00
|
|
|
|
for pn, (txt, img) in enumerate(sections):
|
2024-08-15 09:17:36 +08:00
|
|
|
|
d = copy.deepcopy(doc)
|
|
|
|
|
|
pn += from_page
|
refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233)
**Summary**
This PR tackles a significant memory bottleneck when processing
image-heavy Word documents. Previously, our pipeline eagerly decoded
DOCX images into `PIL.Image` objects, which caused high peak memory
usage. To solve this, I've introduced a **lazy-loading approach**:
images are now stored as raw blobs and only decoded exactly when and
where they are consumed.
This successfully reduces the memory footprint while keeping the parsing
output completely identical to before.
**What's Changed**
Instead of a dry file-by-file list, here is the logical breakdown of the
updates:
* **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage`
along with helper APIs to handle lazy decoding, image-type checks, and
NumPy compatibility. It also supports `.close()` and detached PIL access
to ensure safe lifecycle management and prevent memory leaks.
* **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**:
Updated the general DOCX picture extraction to return these new lazy
images. Downstream consumers (like the figure/VLM flow and base64
encoding paths) now decode images right at the use site using detached
PIL instances, avoiding shared-instance side effects.
* **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added
necessary compatibility conversions so these lazy images flow smoothly
through existing merging, filtering, and presentation steps without
breaking.
**Scope & What is Intentionally Left Out**
To keep this PR focused, I have restricted these changes strictly to the
**general Word pipeline** and its downstream consumers.
The `QA` and `manual` Word parsing pipelines are explicitly **not
modified** in this PR. They can be safely migrated to this new lazy-load
model in a subsequent, standalone PR.
**Design Considerations**
I briefly considered adding image compression during processing, but
decided against it to avoid any potential quality degradation in the
derived outputs. I also held off on a massive pipeline re-architecture
to avoid overly invasive changes right now.
**Validation & Testing**
I've tested this to ensure no regressions:
* Compared identical DOCX inputs before and after this branch: chunk
counts, extracted text, table HTML, and image descriptions match
perfectly.
* **Confirmed a noticeable drop in peak memory usage when processing
image-dense documents.** For a 30MB Word document containing 243 1080p
screenshots, memory consumption is reduced by approximately 1.5GB.
**Breaking Changes**
None.
2026-02-28 11:22:31 +08:00
|
|
|
|
if not is_image_like(img):
|
2025-12-13 11:37:42 +08:00
|
|
|
|
img = None
|
refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233)
**Summary**
This PR tackles a significant memory bottleneck when processing
image-heavy Word documents. Previously, our pipeline eagerly decoded
DOCX images into `PIL.Image` objects, which caused high peak memory
usage. To solve this, I've introduced a **lazy-loading approach**:
images are now stored as raw blobs and only decoded exactly when and
where they are consumed.
This successfully reduces the memory footprint while keeping the parsing
output completely identical to before.
**What's Changed**
Instead of a dry file-by-file list, here is the logical breakdown of the
updates:
* **The Core Abstraction (`lazy_image.py`)**: Introduced `LazyDocxImage`
along with helper APIs to handle lazy decoding, image-type checks, and
NumPy compatibility. It also supports `.close()` and detached PIL access
to ensure safe lifecycle management and prevent memory leaks.
* **Pipeline Integration (`naive.py`, `figure_parser.py`, etc.)**:
Updated the general DOCX picture extraction to return these new lazy
images. Downstream consumers (like the figure/VLM flow and base64
encoding paths) now decode images right at the use site using detached
PIL instances, avoiding shared-instance side effects.
* **Compatibility Hooks (`operators.py`, `book.py`, etc.)**: Added
necessary compatibility conversions so these lazy images flow smoothly
through existing merging, filtering, and presentation steps without
breaking.
**Scope & What is Intentionally Left Out**
To keep this PR focused, I have restricted these changes strictly to the
**general Word pipeline** and its downstream consumers.
The `QA` and `manual` Word parsing pipelines are explicitly **not
modified** in this PR. They can be safely migrated to this new lazy-load
model in a subsequent, standalone PR.
**Design Considerations**
I briefly considered adding image compression during processing, but
decided against it to avoid any potential quality degradation in the
derived outputs. I also held off on a massive pipeline re-architecture
to avoid overly invasive changes right now.
**Validation & Testing**
I've tested this to ensure no regressions:
* Compared identical DOCX inputs before and after this branch: chunk
counts, extracted text, table HTML, and image descriptions match
perfectly.
* **Confirmed a noticeable drop in peak memory usage when processing
image-dense documents.** For a 30MB Word document containing 243 1080p
screenshots, memory consumption is reduced by approximately 1.5GB.
**Breaking Changes**
None.
2026-02-28 11:22:31 +08:00
|
|
|
|
else:
|
|
|
|
|
|
img = ensure_pil_image(img)
|
2025-12-13 11:37:42 +08:00
|
|
|
|
d["image"] = img
|
2024-12-10 16:32:58 +08:00
|
|
|
|
d["page_num_int"] = [pn + 1]
|
|
|
|
|
|
d["top_int"] = [0]
|
2026-01-09 17:48:45 +08:00
|
|
|
|
d["position_int"] = [(pn + 1, 0, img.size[0] if img else 0, 0, img.size[1] if img else 0)]
|
2024-08-15 09:17:36 +08:00
|
|
|
|
tokenize(d, txt, eng)
|
|
|
|
|
|
res.append(d)
|
|
|
|
|
|
return res
|
|
|
|
|
|
|
2026-02-02 13:40:51 +08:00
|
|
|
|
raise NotImplementedError("file type not supported yet(ppt, pptx, pdf supported)")
|
2024-08-15 09:17:36 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
if __name__ == "__main__":
|
|
|
|
|
|
import sys
|
|
|
|
|
|
|
|
|
|
|
|
def dummy(a, b):
|
|
|
|
|
|
pass
|
2025-12-02 17:35:14 +08:00
|
|
|
|
|
2024-08-15 09:17:36 +08:00
|
|
|
|
chunk(sys.argv[1], callback=dummy)
|