fix: paginate non-DeepDOC PDF parsing tasks to prevent OOM (#15951)

### What problem does this PR solve? The parser pods suffer from OOM kills when processing large PDF documents. The root cause is in api/db/services/task_service.py: when layout_recognize is not DeepDOC (e.g. Plain Text), page_size was set to MAXIMUM_TASK_PAGE_NUMBER (100 million), causing the entire PDF to be processed as a single task with all pages loaded into memory simultaneously. This PR fixes the issue by paginating non-DeepDOC PDF parsing tasks the same way DeepDOC already does. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [x] Performance Improvement - [ ] Other (please describe):
2026-06-29 15:31:05 +08:00 · 2026-06-16 05:07:19 -07:00
parent 62698725ca
commit d2a18d5c46
1 changed files with 2 additions and 3 deletions
--- a/api/db/services/task_service.py
+++ b/api/db/services/task_service.py
@@ -390,15 +390,14 @@ def queue_tasks(doc: dict, bucket: str, name: str, priority: int):

    if doc["type"] == FileType.PDF.value:
        file_bin = settings.STORAGE_IMPL.get(bucket, name)
-        do_layout = doc["parser_config"].get("layout_recognize", "DeepDOC")
        pages = PdfParser.total_page_number(doc["name"], file_bin)
        if pages is None:
            pages = 0
        page_size = doc["parser_config"].get("task_page_size") or 12
        if doc["parser_id"] == "paper":
            page_size = doc["parser_config"].get("task_page_size") or 22
-        if doc["parser_id"] in ["one", "knowledge_graph"] or do_layout != "DeepDOC" or doc["parser_config"].get("toc_extraction", False):
-            page_size = MAXIMUM_TASK_PAGE_NUMBER
+        if doc["parser_id"] in ["one", "knowledge_graph"] or doc["parser_config"].get("toc_extraction", False):
+            page_size = doc["parser_config"].get("task_page_size") or 30
        page_ranges = doc["parser_config"].get("pages") or [(1, MAXIMUM_PAGE_NUMBER)]
        for s, e in page_ranges:
            s -= 1