2025-01-21 20:52:28 +08:00
|
|
|
|
#
|
|
|
|
|
|
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
|
|
|
|
|
|
#
|
2024-06-01 16:24:10 +08:00
|
|
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
|
|
# you may not use this file except in compliance with the License.
|
|
|
|
|
|
# You may obtain a copy of the License at
|
|
|
|
|
|
#
|
|
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
#
|
|
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
|
|
# limitations under the License.
|
|
|
|
|
|
#
|
|
|
|
|
|
|
2023-12-14 19:19:03 +08:00
|
|
|
|
from docx import Document
|
|
|
|
|
|
import re
|
|
|
|
|
|
import pandas as pd
|
|
|
|
|
|
from collections import Counter
|
2024-04-28 19:13:33 +08:00
|
|
|
|
from rag.nlp import rag_tokenizer
|
2023-12-25 19:05:59 +08:00
|
|
|
|
from io import BytesIO
|
Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329)
## Summary
This PR is the direct successor to the previous `docx` lazy-loading
implementation. It addresses the technical debt intentionally left out
in the last PR by fully migrating the `qa` and `manual` parsing
strategies to the new lazy-loading model.
Additionally, this PR comprehensively refactors the underlying `docx`
parsing pipeline to eliminate significant code redundancy and introduces
robust fallback mechanisms to handle completely corrupted image streams
safely.
## What's Changed
* **Centralized Abstraction (`docx_parser.py`)**: Moved the
`get_picture` extraction logic up to the `RAGFlowDocxParser` base class.
Previously, `naive`, `qa`, and `manual` parsers maintained separate,
redundant copies of this method. All downstream strategies now natively
gather raw blobs and return `LazyDocxImage` objects automatically.
* **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge
cases where `python-docx` encounters critically malformed magic headers.
Implemented an explicit `try-except` structure that safely intercepts
`UnrecognizedImageError` (and similar exceptions) and seamlessly falls
back to retrieving the raw binary via `getattr(related_part, "blob",
None)`, preventing parser crashes on damaged documents.
* **Legacy Code & Redundancy Purge**:
* Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`,
and `manual.py`.
* Removed the standalone, immediate-decoding `concat_img` method in
`manual.py`. It has been completely replaced by the globally unified,
lazy-loading-compatible `rag.nlp.concat_img`.
* Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception
packages) across all updated strategy files.
## Scope
To keep this PR focused, I have restricted these changes strictly to the
unification of `docx` extraction logic and the lazy-load migration of
`qa` and `manual`.
## Validation & Testing
I've tested this to ensure no regressions and validated the fallback
logic:
* **Output Consistency**: Compared identical `.docx` inputs using `qa`
and `manual` strategies before and after this branch: chunk counts,
extracted text, table HTML, and attached images match perfectly.
* **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory
usage when processing image-dense documents through the `qa` and
`manual` pipelines, bringing them up to parity with the `naive`
strategy's performance gains.
## Breaking Changes
* None.
2026-03-11 10:00:07 +08:00
|
|
|
|
import logging
|
Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?
Fixes #14196
## Problem
When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:
- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports
Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.
## Root Cause
```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# Only the first 300 pages were rendered; everything beyond was silently dropped
```
While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.
## Solution
### 1. Define constants in `common/constants.py`
```python
MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer
```
### 2. Replace all hardcoded sentinel values
| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |
### 3. Fix `parse_into_bboxes()` missing parameters
Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.
## Files Changed (22)
- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 06:57:20 +00:00
|
|
|
|
from common.constants import MAXIMUM_PAGE_NUMBER
|
Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329)
## Summary
This PR is the direct successor to the previous `docx` lazy-loading
implementation. It addresses the technical debt intentionally left out
in the last PR by fully migrating the `qa` and `manual` parsing
strategies to the new lazy-loading model.
Additionally, this PR comprehensively refactors the underlying `docx`
parsing pipeline to eliminate significant code redundancy and introduces
robust fallback mechanisms to handle completely corrupted image streams
safely.
## What's Changed
* **Centralized Abstraction (`docx_parser.py`)**: Moved the
`get_picture` extraction logic up to the `RAGFlowDocxParser` base class.
Previously, `naive`, `qa`, and `manual` parsers maintained separate,
redundant copies of this method. All downstream strategies now natively
gather raw blobs and return `LazyDocxImage` objects automatically.
* **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge
cases where `python-docx` encounters critically malformed magic headers.
Implemented an explicit `try-except` structure that safely intercepts
`UnrecognizedImageError` (and similar exceptions) and seamlessly falls
back to retrieving the raw binary via `getattr(related_part, "blob",
None)`, preventing parser crashes on damaged documents.
* **Legacy Code & Redundancy Purge**:
* Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`,
and `manual.py`.
* Removed the standalone, immediate-decoding `concat_img` method in
`manual.py`. It has been completely replaced by the globally unified,
lazy-loading-compatible `rag.nlp.concat_img`.
* Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception
packages) across all updated strategy files.
## Scope
To keep this PR focused, I have restricted these changes strictly to the
unification of `docx` extraction logic and the lazy-load migration of
`qa` and `manual`.
## Validation & Testing
I've tested this to ensure no regressions and validated the fallback
logic:
* **Output Consistency**: Compared identical `.docx` inputs using `qa`
and `manual` strategies before and after this branch: chunk counts,
extracted text, table HTML, and attached images match perfectly.
* **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory
usage when processing image-dense documents through the `qa` and
`manual` pipelines, bringing them up to parity with the `naive`
strategy's performance gains.
## Breaking Changes
* None.
2026-03-11 10:00:07 +08:00
|
|
|
|
from docx.image.exceptions import (
|
|
|
|
|
|
InvalidImageStreamError,
|
|
|
|
|
|
UnexpectedEndOfFileError,
|
|
|
|
|
|
UnrecognizedImageError,
|
|
|
|
|
|
)
|
2026-03-23 21:24:40 +08:00
|
|
|
|
from rag.utils.lazy_image import LazyImage
|
2023-12-14 19:19:03 +08:00
|
|
|
|
|
2024-04-28 13:19:54 +08:00
|
|
|
|
class RAGFlowDocxParser:
|
Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329)
## Summary
This PR is the direct successor to the previous `docx` lazy-loading
implementation. It addresses the technical debt intentionally left out
in the last PR by fully migrating the `qa` and `manual` parsing
strategies to the new lazy-loading model.
Additionally, this PR comprehensively refactors the underlying `docx`
parsing pipeline to eliminate significant code redundancy and introduces
robust fallback mechanisms to handle completely corrupted image streams
safely.
## What's Changed
* **Centralized Abstraction (`docx_parser.py`)**: Moved the
`get_picture` extraction logic up to the `RAGFlowDocxParser` base class.
Previously, `naive`, `qa`, and `manual` parsers maintained separate,
redundant copies of this method. All downstream strategies now natively
gather raw blobs and return `LazyDocxImage` objects automatically.
* **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge
cases where `python-docx` encounters critically malformed magic headers.
Implemented an explicit `try-except` structure that safely intercepts
`UnrecognizedImageError` (and similar exceptions) and seamlessly falls
back to retrieving the raw binary via `getattr(related_part, "blob",
None)`, preventing parser crashes on damaged documents.
* **Legacy Code & Redundancy Purge**:
* Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`,
and `manual.py`.
* Removed the standalone, immediate-decoding `concat_img` method in
`manual.py`. It has been completely replaced by the globally unified,
lazy-loading-compatible `rag.nlp.concat_img`.
* Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception
packages) across all updated strategy files.
## Scope
To keep this PR focused, I have restricted these changes strictly to the
unification of `docx` extraction logic and the lazy-load migration of
`qa` and `manual`.
## Validation & Testing
I've tested this to ensure no regressions and validated the fallback
logic:
* **Output Consistency**: Compared identical `.docx` inputs using `qa`
and `manual` strategies before and after this branch: chunk counts,
extracted text, table HTML, and attached images match perfectly.
* **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory
usage when processing image-dense documents through the `qa` and
`manual` pipelines, bringing them up to parity with the `naive`
strategy's performance gains.
## Breaking Changes
* None.
2026-03-11 10:00:07 +08:00
|
|
|
|
def get_picture(self, document, paragraph):
|
|
|
|
|
|
imgs = paragraph._element.xpath(".//pic:pic")
|
|
|
|
|
|
if not imgs:
|
|
|
|
|
|
return None
|
|
|
|
|
|
image_blobs = []
|
|
|
|
|
|
for img in imgs:
|
|
|
|
|
|
embed = img.xpath(".//a:blip/@r:embed")
|
|
|
|
|
|
if not embed:
|
|
|
|
|
|
continue
|
|
|
|
|
|
embed = embed[0]
|
|
|
|
|
|
image_blob = None
|
|
|
|
|
|
try:
|
|
|
|
|
|
related_part = document.part.related_parts[embed]
|
|
|
|
|
|
except Exception as e:
|
|
|
|
|
|
logging.warning(f"Skipping image due to unexpected error getting related_part: {e}")
|
|
|
|
|
|
continue
|
|
|
|
|
|
|
|
|
|
|
|
try:
|
|
|
|
|
|
image = related_part.image
|
|
|
|
|
|
if image is not None:
|
|
|
|
|
|
image_blob = image.blob
|
|
|
|
|
|
except (
|
|
|
|
|
|
UnrecognizedImageError,
|
|
|
|
|
|
UnexpectedEndOfFileError,
|
|
|
|
|
|
InvalidImageStreamError,
|
|
|
|
|
|
UnicodeDecodeError,
|
|
|
|
|
|
) as e:
|
|
|
|
|
|
logging.info(f"Damaged image encountered, attempting blob fallback: {e}")
|
|
|
|
|
|
except Exception as e:
|
|
|
|
|
|
logging.warning(f"Unexpected error getting image, attempting blob fallback: {e}")
|
|
|
|
|
|
|
|
|
|
|
|
if image_blob is None:
|
|
|
|
|
|
image_blob = getattr(related_part, "blob", None)
|
|
|
|
|
|
if image_blob:
|
|
|
|
|
|
image_blobs.append(image_blob)
|
|
|
|
|
|
if not image_blobs:
|
|
|
|
|
|
return None
|
2026-03-23 21:24:40 +08:00
|
|
|
|
return LazyImage(image_blobs)
|
Refa: implement unified lazy image loading for Docx parsers (qa/manual) (#13329)
## Summary
This PR is the direct successor to the previous `docx` lazy-loading
implementation. It addresses the technical debt intentionally left out
in the last PR by fully migrating the `qa` and `manual` parsing
strategies to the new lazy-loading model.
Additionally, this PR comprehensively refactors the underlying `docx`
parsing pipeline to eliminate significant code redundancy and introduces
robust fallback mechanisms to handle completely corrupted image streams
safely.
## What's Changed
* **Centralized Abstraction (`docx_parser.py`)**: Moved the
`get_picture` extraction logic up to the `RAGFlowDocxParser` base class.
Previously, `naive`, `qa`, and `manual` parsers maintained separate,
redundant copies of this method. All downstream strategies now natively
gather raw blobs and return `LazyDocxImage` objects automatically.
* **Robust Corrupted Image Fallback (`docx_parser.py`)**: Handled edge
cases where `python-docx` encounters critically malformed magic headers.
Implemented an explicit `try-except` structure that safely intercepts
`UnrecognizedImageError` (and similar exceptions) and seamlessly falls
back to retrieving the raw binary via `getattr(related_part, "blob",
None)`, preventing parser crashes on damaged documents.
* **Legacy Code & Redundancy Purge**:
* Removed the duplicate `get_picture` methods from `naive.py`, `qa.py`,
and `manual.py`.
* Removed the standalone, immediate-decoding `concat_img` method in
`manual.py`. It has been completely replaced by the globally unified,
lazy-loading-compatible `rag.nlp.concat_img`.
* Cleaned up unused legacy imports (e.g., `PIL.Image`, docx exception
packages) across all updated strategy files.
## Scope
To keep this PR focused, I have restricted these changes strictly to the
unification of `docx` extraction logic and the lazy-load migration of
`qa` and `manual`.
## Validation & Testing
I've tested this to ensure no regressions and validated the fallback
logic:
* **Output Consistency**: Compared identical `.docx` inputs using `qa`
and `manual` strategies before and after this branch: chunk counts,
extracted text, table HTML, and attached images match perfectly.
* **Memory Footprint Drop**: Confirmed a noticeable drop in peak memory
usage when processing image-dense documents through the `qa` and
`manual` pipelines, bringing them up to parity with the `naive`
strategy's performance gains.
## Breaking Changes
* None.
2026-03-11 10:00:07 +08:00
|
|
|
|
|
2023-12-14 19:19:03 +08:00
|
|
|
|
|
|
|
|
|
|
def __extract_table_content(self, tb):
|
|
|
|
|
|
df = []
|
|
|
|
|
|
for row in tb.rows:
|
|
|
|
|
|
df.append([c.text for c in row.cells])
|
|
|
|
|
|
return self.__compose_table_content(pd.DataFrame(df))
|
|
|
|
|
|
|
|
|
|
|
|
def __compose_table_content(self, df):
|
|
|
|
|
|
|
|
|
|
|
|
def blockType(b):
|
2025-07-25 12:04:07 +08:00
|
|
|
|
pattern = [
|
2023-12-14 19:19:03 +08:00
|
|
|
|
("^(20|19)[0-9]{2}[年/-][0-9]{1,2}[月/-][0-9]{1,2}日*$", "Dt"),
|
|
|
|
|
|
(r"^(20|19)[0-9]{2}年$", "Dt"),
|
|
|
|
|
|
(r"^(20|19)[0-9]{2}[年/-][0-9]{1,2}月*$", "Dt"),
|
|
|
|
|
|
("^[0-9]{1,2}[月/-][0-9]{1,2}日*$", "Dt"),
|
|
|
|
|
|
(r"^第*[一二三四1-4]季度$", "Dt"),
|
|
|
|
|
|
(r"^(20|19)[0-9]{2}年*[一二三四1-4]季度$", "Dt"),
|
|
|
|
|
|
(r"^(20|19)[0-9]{2}[ABCDE]$", "DT"),
|
|
|
|
|
|
("^[0-9.,+%/ -]+$", "Nu"),
|
|
|
|
|
|
(r"^[0-9A-Z/\._~-]+$", "Ca"),
|
|
|
|
|
|
(r"^[A-Z]*[a-z' -]+$", "En"),
|
|
|
|
|
|
(r"^[0-9.,+-]+[0-9A-Za-z/$¥%<>()()' -]+$", "NE"),
|
|
|
|
|
|
(r"^.{1}$", "Sg")
|
|
|
|
|
|
]
|
2025-07-25 12:04:07 +08:00
|
|
|
|
for p, n in pattern:
|
2023-12-14 19:19:03 +08:00
|
|
|
|
if re.search(p, b):
|
|
|
|
|
|
return n
|
2024-11-28 13:00:38 +08:00
|
|
|
|
tks = [t for t in rag_tokenizer.tokenize(b).split() if len(t) > 1]
|
2023-12-14 19:19:03 +08:00
|
|
|
|
if len(tks) > 3:
|
|
|
|
|
|
if len(tks) < 12:
|
|
|
|
|
|
return "Tx"
|
|
|
|
|
|
else:
|
|
|
|
|
|
return "Lx"
|
|
|
|
|
|
|
2024-04-28 19:13:33 +08:00
|
|
|
|
if len(tks) == 1 and rag_tokenizer.tag(tks[0]) == "nr":
|
2023-12-14 19:19:03 +08:00
|
|
|
|
return "Nr"
|
|
|
|
|
|
|
|
|
|
|
|
return "Ot"
|
|
|
|
|
|
|
|
|
|
|
|
if len(df) < 2:
|
|
|
|
|
|
return []
|
|
|
|
|
|
max_type = Counter([blockType(str(df.iloc[i, j])) for i in range(
|
|
|
|
|
|
1, len(df)) for j in range(len(df.iloc[i, :]))])
|
|
|
|
|
|
max_type = max(max_type.items(), key=lambda x: x[1])[0]
|
|
|
|
|
|
|
|
|
|
|
|
colnm = len(df.iloc[0, :])
|
2025-06-18 09:41:09 +08:00
|
|
|
|
hdrows = [0] # header is not necessarily appear in the first line
|
2023-12-14 19:19:03 +08:00
|
|
|
|
if max_type == "Nu":
|
|
|
|
|
|
for r in range(1, len(df)):
|
|
|
|
|
|
tys = Counter([blockType(str(df.iloc[r, j]))
|
|
|
|
|
|
for j in range(len(df.iloc[r, :]))])
|
|
|
|
|
|
tys = max(tys.items(), key=lambda x: x[1])[0]
|
|
|
|
|
|
if tys != max_type:
|
|
|
|
|
|
hdrows.append(r)
|
|
|
|
|
|
|
|
|
|
|
|
lines = []
|
|
|
|
|
|
for i in range(1, len(df)):
|
|
|
|
|
|
if i in hdrows:
|
|
|
|
|
|
continue
|
|
|
|
|
|
hr = [r - i for r in hdrows]
|
|
|
|
|
|
hr = [r for r in hr if r < 0]
|
|
|
|
|
|
t = len(hr) - 1
|
|
|
|
|
|
while t > 0:
|
|
|
|
|
|
if hr[t] - hr[t - 1] > 1:
|
|
|
|
|
|
hr = hr[t:]
|
|
|
|
|
|
break
|
|
|
|
|
|
t -= 1
|
|
|
|
|
|
headers = []
|
|
|
|
|
|
for j in range(len(df.iloc[i, :])):
|
|
|
|
|
|
t = []
|
|
|
|
|
|
for h in hr:
|
|
|
|
|
|
x = str(df.iloc[i + h, j]).strip()
|
|
|
|
|
|
if x in t:
|
|
|
|
|
|
continue
|
|
|
|
|
|
t.append(x)
|
|
|
|
|
|
t = ",".join(t)
|
|
|
|
|
|
if t:
|
|
|
|
|
|
t += ": "
|
|
|
|
|
|
headers.append(t)
|
|
|
|
|
|
cells = []
|
|
|
|
|
|
for j in range(len(df.iloc[i, :])):
|
|
|
|
|
|
if not str(df.iloc[i, j]):
|
|
|
|
|
|
continue
|
|
|
|
|
|
cells.append(headers[j] + str(df.iloc[i, j]))
|
|
|
|
|
|
lines.append(";".join(cells))
|
|
|
|
|
|
|
|
|
|
|
|
if colnm > 3:
|
|
|
|
|
|
return lines
|
|
|
|
|
|
return ["\n".join(lines)]
|
|
|
|
|
|
|
Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382)
### What problem does this PR solve?
Fixes #14196
## Problem
When using DeepDOC to parse large PDFs (over 1000 pages), the parser
silently truncated processing at 300 pages due to a hardcoded default
`page_to=299` in `RAGFlowPdfParser.__images__()`. This caused:
- **Errors** on pages beyond the limit
- **Poor image quality** as the parser attempted to compensate with
missing page data
- **Inconsistent chunk splitting** between full PDF imports and partial
imports
Additionally, the codebase scattered magic numbers (`299`, `600`,
`10000`, `100000`, `100000000`, `10000000000`, `10**9`) across 22 files
as sentinel values for "parse all pages", making future maintenance
error-prone.
## Root Cause
```python
# deepdoc/parser/pdf_parser.py (before)
def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None):
# Only the first 300 pages were rendered; everything beyond was silently dropped
```
While most callers in `rag/app/*.py` correctly passed `to_page=100000`,
the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()`
invoked `__images__` **without** forwarding `page_from`/`page_to`,
falling back to the restrictive default of 299.
## Solution
### 1. Define constants in `common/constants.py`
```python
MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer
MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer
```
### 2. Replace all hardcoded sentinel values
| Layer | Files Changed | Old Values | New Value |
|---|---|---|---|
| **Deepdoc parsers** | `pdf_parser.py`, `mineru_parser.py`,
`docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`,
`docx_parser.py` | `299`, `600`, `10**9`, `100000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Chunk parsers** | `naive.py`, `book.py`, `qa.py`, `one.py`,
`manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`,
`email.py`, `table.py` | `100000`, `10000`, `10000000000` |
`MAXIMUM_PAGE_NUMBER` |
| **Task/DB layer** | `db_models.py`, `task_service.py`,
`document_service.py`, `file_service.py` | `100000000` |
`MAXIMUM_TASK_PAGE_NUMBER` |
### 3. Fix `parse_into_bboxes()` missing parameters
Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that
the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the
restrictive default.
## Files Changed (22)
- `common/constants.py`
- `deepdoc/parser/pdf_parser.py`
- `deepdoc/parser/mineru_parser.py`
- `deepdoc/parser/docling_parser.py`
- `deepdoc/parser/opendataloader_parser.py`
- `deepdoc/parser/paddleocr_parser.py`
- `deepdoc/parser/docx_parser.py`
- `rag/app/naive.py`
- `rag/app/book.py`
- `rag/app/qa.py`
- `rag/app/one.py`
- `rag/app/manual.py`
- `rag/app/paper.py`
- `rag/app/presentation.py`
- `rag/app/laws.py`
- `rag/app/resume.py`
- `rag/app/email.py`
- `rag/app/table.py`
- `api/db/db_models.py`
- `api/db/services/task_service.py`
- `api/db/services/document_service.py`
- `api/db/services/file_service.py`
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
---------
Signed-off-by: noob <yixiao121314@outlook.com>
2026-04-27 06:57:20 +00:00
|
|
|
|
def __call__(self, fnm, from_page=0, to_page=MAXIMUM_PAGE_NUMBER):
|
2024-03-27 11:33:46 +08:00
|
|
|
|
self.doc = Document(fnm) if isinstance(
|
|
|
|
|
|
fnm, str) else Document(BytesIO(fnm))
|
2024-06-18 09:50:59 +08:00
|
|
|
|
pn = 0 # parsed page
|
|
|
|
|
|
secs = [] # parsed contents
|
2024-02-02 19:21:37 +08:00
|
|
|
|
for p in self.doc.paragraphs:
|
2024-03-27 11:33:46 +08:00
|
|
|
|
if pn > to_page:
|
|
|
|
|
|
break
|
2024-06-18 09:50:59 +08:00
|
|
|
|
|
|
|
|
|
|
runs_within_single_paragraph = [] # save runs within the range of pages
|
2024-02-02 19:21:37 +08:00
|
|
|
|
for run in p.runs:
|
2024-06-18 09:50:59 +08:00
|
|
|
|
if pn > to_page:
|
|
|
|
|
|
break
|
|
|
|
|
|
if from_page <= pn < to_page and p.text.strip():
|
|
|
|
|
|
runs_within_single_paragraph.append(run.text) # append run.text first
|
|
|
|
|
|
|
|
|
|
|
|
# wrap page break checker into a static method
|
2024-07-23 09:25:32 +08:00
|
|
|
|
if 'lastRenderedPageBreak' in run._element.xml:
|
2024-02-02 19:21:37 +08:00
|
|
|
|
pn += 1
|
|
|
|
|
|
|
2024-11-08 09:21:42 +08:00
|
|
|
|
secs.append(("".join(runs_within_single_paragraph), p.style.name if hasattr(p.style, 'name') else '')) # then concat run.text as part of the paragraph
|
2024-06-18 09:50:59 +08:00
|
|
|
|
|
2023-12-14 19:19:03 +08:00
|
|
|
|
tbls = [self.__extract_table_content(tb) for tb in self.doc.tables]
|
|
|
|
|
|
return secs, tbls
|