fix(paddleocr): support PP-OCRv6 ocrResults fallback and integrate image parsing (#16150)

## Summary This PR fixes two issues discovered during testing of the PaddleOCR async API refactoring: ### 1. PP-OCRv6 returns `ocrResults` instead of `layoutParsingResults` Models like PP-OCRv6 are pure text recognition models that return results in `ocrResults.prunedResult.rec_texts` format rather than the `layoutParsingResults.prunedResult.parsing_res_list` format used by layout-aware models (PaddleOCR-VL series). **Changes:** - `deepdoc/parser/paddleocr_parser.py`: Extract `ocrResults` alongside `layoutParsingResults` in `_send_request()`, add fallback logic in `_transfer_to_sections()` and `parse_image()` - `internal/entity/models/paddleocr.go`: Add `ocrResults` struct and fallback extraction in Go OCR handler ### 2. Image parsing not integrated into picture chunker The `parse_image()` method existed in PaddleOCRParser but was never called from `rag/app/picture.py` (the module that handles image file uploads). Users configuring PaddleOCR as their layout recognizer would still get local deepdoc OCR for images. **Changes:** - `rag/app/picture.py`: When `layout_recognize` is set to PaddleOCR, use `PaddleOCROcrModel.parse_image()` instead of local OCR. Falls back gracefully to local OCR on failure. ## Testing Verified end-to-end in Docker: - PaddleOCR-VL-1.6 PDF parsing: ✅ (10 text blocks with bbox) - PaddleOCR-VL-1.6 image parsing: ✅ (219 chars) - PP-OCRv6 PDF parsing with ocrResults fallback: ✅ (10 text blocks) - PP-OCRv6 image parsing with ocrResults fallback: ✅ (136 chars) ## Related PRs - #15967 (merged) - PaddleOCR async Job API refactoring + new models - #16086 (merged) - PaddleOCR image parsing support
2026-06-29 15:31:05 +08:00 · 2026-06-23 22:02:54 +08:00
parent b4a8a90c73
commit 017adf841f
3 changed files with 104 additions and 4 deletions
--- a/deepdoc/parser/paddleocr_parser.py
+++ b/deepdoc/parser/paddleocr_parser.py
@@ -380,6 +380,14 @@ class PaddleOCRParser(RAGFlowPdfParser):
                    if block_content.strip():
                        texts.append(block_content.strip())

+        # Fallback to ocrResults for models like PP-OCRv6
+        if not texts:
+            ocr_results = result.get("ocrResults", [])
+            for ocr_result in ocr_results:
+                pruned = ocr_result.get("prunedResult", {})
+                rec_texts = pruned.get("rec_texts", [])
+                texts.extend(t.strip() for t in rec_texts if t.strip())
+
        if callback:
            callback(0.9, f"[PaddleOCR] image done, blocks: {len(texts)}")

@@ -556,11 +564,13 @@ class PaddleOCRParser(RAGFlowPdfParser):
            callback(0.8, "[PaddleOCR] result received")

        # Extract raw result (preserving prunedResult with bbox info)
-        combined_result: dict[str, Any] = {"layoutParsingResults": []}
+        combined_result: dict[str, Any] = {"layoutParsingResults": [], "ocrResults": []}
        for line_obj in jsonl_data:
            result = line_obj.get("result", {})
            layout_results = result.get("layoutParsingResults", [])
            combined_result["layoutParsingResults"].extend(layout_results)
+            ocr_results = result.get("ocrResults", [])
+            combined_result["ocrResults"].extend(ocr_results)

        return combined_result

@@ -571,6 +581,26 @@ class PaddleOCRParser(RAGFlowPdfParser):
        if algorithm in SUPPORTED_PADDLEOCR_ALGORITHMS:
            layout_parsing_results = result.get("layoutParsingResults", [])

+            # Fallback to ocrResults for models like PP-OCRv6 that only return text recognition
+            if not layout_parsing_results:
+                ocr_results = result.get("ocrResults", [])
+                for page_idx, ocr_result in enumerate(ocr_results):
+                    pruned = ocr_result.get("prunedResult", {})
+                    rec_texts = pruned.get("rec_texts", [])
+                    rec_boxes = pruned.get("rec_boxes", [])
+                    for i, text in enumerate(rec_texts):
+                        text = text.strip()
+                        if not text:
+                            continue
+                        if i < len(rec_boxes):
+                            box = rec_boxes[i]
+                            left, top, right, bottom = box[0], box[1], box[2], box[3]
+                        else:
+                            left, top, right, bottom = 0, 0, 0, 0
+                        tag = f"@@{page_idx + 1}\t{left // self._ZOOMIN}\t{right // self._ZOOMIN}\t{top // self._ZOOMIN}\t{bottom // self._ZOOMIN}##"
+                        sections.append((text, tag))
+                return sections
+
            for page_idx, layout_result in enumerate(layout_parsing_results):
                pruned_result = layout_result.get("prunedResult", {})
                parsing_res_list = pruned_result.get("parsing_res_list", [])