perf: lazy img_np loading and chunked parse_into_bboxes for large PDFs (#14385)

## Summary - **Lazy img_np loading**: `np.array(img)` is now deferred until the first OCR text extraction is actually needed, avoiding unnecessary memory allocation for pages that already have text. - **Chunked parse_into_bboxes**: Large PDFs (>50 pages, configurable via `PDF_PARSER_PAGE_BATCH_SIZE`) are processed in batches. Each chunk's boxes are normalized with `_to_global_boxes` to produce globally consistent page numbers and position tags. - **DLA early init**: Move remote-client initialization before model loading in `LayoutRecognizer.__init__` so `DEEPDOC_URL` (or legacy `TENSORRT_DLA_SVR`) short-circuits unnecessary model download for parser containers relying on remote inference. - **Fix outline regression**: Restore `self.outlines = extract_pdf_outlines(fnm)` in `parse_into_bboxes`; this was dropped during refactoring and is required by downstream `remove_toc` and metadata handling in `rag/flow/parser/parser.py`. ## Test plan - [ ] Small PDF (<=50 pages): verify parse succeeds and `self.outlines` is populated - [ ] Large PDF (>50 pages): verify chunked processing produces globally consistent page numbers - [ ] With `DEEPDOC_URL` set: verify remote DLA client is used and local model is not downloaded - [ ] With legacy `TENSORRT_DLA_SVR` set: verify backward compatibility 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-29 15:31:05 +08:00 · 2026-04-27 16:52:43 +08:00
parent 4303be223f
commit c446c403de
2 changed files with 82 additions and 18 deletions
--- a/deepdoc/vision/layout_recognizer.py
+++ b/deepdoc/vision/layout_recognizer.py
@@ -46,6 +46,18 @@ class LayoutRecognizer(Recognizer):
    ]

    def __init__(self, domain):
+        self.garbage_layouts = ["footer", "header", "reference"]
+        self.client = None
+
+        dla_url = os.environ.get("DEEPDOC_URL") or os.environ.get("TENSORRT_DLA_SVR")
+        if dla_url:
+            from deepdoc.vision.dla_cli import DLAClient
+
+            self.client = DLAClient(dla_url)
+            env_used = "DEEPDOC_URL" if os.environ.get("DEEPDOC_URL") else "TENSORRT_DLA_SVR"
+            logging.info(f"LayoutRecognizer using remote DLA client at {dla_url} (via {env_used})")
+            return
+
        try:
            model_dir = os.path.join(get_project_base_directory(), "rag/res/deepdoc")
            super().__init__(self.labels, domain, model_dir)
@@ -53,13 +65,6 @@ class LayoutRecognizer(Recognizer):
            model_dir = snapshot_download(repo_id="InfiniFlow/deepdoc", local_dir=os.path.join(get_project_base_directory(), "rag/res/deepdoc"), local_dir_use_symlinks=False)
            super().__init__(self.labels, domain, model_dir)

-        self.garbage_layouts = ["footer", "header", "reference"]
-        self.client = None
-        if os.environ.get("TENSORRT_DLA_SVR"):
-            from deepdoc.vision.dla_cli import DLAClient
-
-            self.client = DLAClient(os.environ["TENSORRT_DLA_SVR"])
-
    def __call__(self, image_list, ocr_res, scale_factor=3, thr=0.2, batch_size=16, drop=True):
        def __is_garbage(b):
            patt = [r"\(cid\s*:\s*\d+\s*\)"]