feat(paddleocr): add image parsing support with async Job API (#16086)

## Summary

Add image parsing capability to PaddleOCR integration, building on top
of #15967 (async Job API migration).

## Changes

### `deepdoc/parser/paddleocr_parser.py`
- Add `parse_image()` method that uses the same async Job API flow as
`parse_pdf()`
- Extracts text from `layoutParsingResults` → `prunedResult` →
`parsing_res_list`
- Returns concatenated block content as a single string

### `rag/llm/ocr_model.py`
- Add `parse_image()` wrapper to `PaddleOCROcrModel` with availability
check and logging

## Relationship to other PRs

- **Depends on**: #15967 (async Job API migration) — this PR is based on
that branch
- **Replaces**: #14826 (original image processing PR based on old sync
API)

## Notes

This PR uses `base_url` and the async Job API (submit → poll → fetch)
consistent with #15967, rather than the old `api_url` + sync POST
pattern from #14826.
This commit is contained in:
Rander
2026-06-16 19:34:38 +08:00
committed by GitHub
parent 1235da7093
commit 62698725ca
2 changed files with 73 additions and 0 deletions

View File

@@ -148,6 +148,16 @@ class PaddleOCROcrModel(Base, PaddleOCRParser):
sections, tables = PaddleOCRParser.parse_pdf(self, filepath=filepath, binary=binary, callback=callback, parse_method=parse_method, **kwargs)
return sections, tables
def parse_image(self, filepath: str, binary=None, callback=None, **kwargs) -> str:
ok, reason = self.check_available()
if not ok:
raise RuntimeError(f"PaddleOCR server not accessible: {reason}")
logging.info(f"PaddleOCR parse_image start: {filepath}")
result = PaddleOCRParser.parse_image(self, filepath=filepath, binary=binary, callback=callback, **kwargs)
logging.info(f"PaddleOCR parse_image done: {filepath}, text length: {len(result)}")
return result
class OpenDataLoaderOcrModel(Base, OpenDataLoaderParser):
_FACTORY_NAME = "OpenDataLoader"