Files
ragflow/deepdoc/server/README.md
Jack 304d9e02bb Refactor: migrate pdf_parser.py to golang (#16323)
### What problem does this PR solve?

Http API based on onnx model.
pdf_parser.py to golang

### Type of change

- [x] Refactoring
2026-06-25 20:16:16 +08:00

4.4 KiB
Raw Permalink Blame History

OSS DeepDoc HTTP API Service

Serves DLA (Document Layout Analysis), OCR (Optical Character Recognition), and TSR (Table Structure Recognition) models via a unified HTTP API using LitServe and OSS ONNX Runtime models.

Quick Start

# Build
docker build -f Dockerfile_deepdoc_oss -t deepdoc_oss:latest .

# Run (CPU only; no GPU required)
docker run -p 9390:9390 deepdoc_oss:latest

# Or via docker compose
docker compose -f docker/docker-compose.yml up -d

The service listens on port 9390 by default. Pass --port to change it:

python deepdoc/server/deepdoc_server.py --port 9000 --model-dir /path/to/models

Endpoints

All prediction endpoints accept JPEG images via multipart/form-data. The form field for file uploads is named request.

Method Path Description
GET /health Liveness probe. Returns ok.
GET /model Model metadata. Returns {"model":"oss","version":"1.0"}.
POST /predict/dla Document Layout Analysis.
POST /predict/tsr Table Structure Recognition.
POST /predict/ocr OCR — use form field operator=det for detection or operator=rec for recognition.

POST /predict/dla

Analyzes a full page image and returns labelled layout regions.

Request

curl -X POST http://localhost:9390/predict/dla \
  -F "request=@page.jpg;type=image/jpeg"

Response

{
  "bboxes": [
    [x0, y0, x1, y1, score, class_id],
    ...
  ]
}
class_id Label
0 title
1 text
2 reference
3 figure
4 figure caption
5 table
6 table caption
8 equation

The OSS model uses 8 unique class IDs. IDs 7 and 9 are reserved for compatibility with the SaaS label scheme but are never produced by the OSS model.

POST /predict/tsr

Recognizes table structure from a cropped table image.

Request

curl -X POST http://localhost:9390/predict/tsr \
  -F "request=@table_crop.jpg;type=image/jpeg"

Response

{
  "bboxes": [
    [x0, y0, x1, y1, score, class_id],
    ...
  ]
}
class_id Label
0 table
1 table column
2 table row
3 table column header
4 table projected row header
5 table spanning cell

POST /predict/ocr

Two modes controlled by the operator form field.

Detection (operator=det)

Returns quadrilateral bounding boxes for detected text regions.

curl -X POST "http://localhost:9390/predict/ocr" \
  -F "operator=det" \
  -F "request=@page.jpg;type=image/jpeg"

Response (5-level nested array):

{
  "output": [
    [
      [
        [
          [[x0,y0],[x1,y1],[x2,y2],[x3,y3]],
          ...
        ]
      ]
    ]
  ]
}

Recognition (operator=rec)

Recognizes text within a cropped region.

curl -X POST "http://localhost:9390/predict/ocr" \
  -F "operator=rec" \
  -F "request=@char_crop.jpg;type=image/jpeg"

Response (4-level nested array):

{
  "output": [
    [
      [
        ["recognized text", 1.0],
        ...
      ]
    ]
  ]
}

Confidence is always 1.0 — the OSS recognition model does not return per-character confidence scores.

Error Responses

Scenario HTTP Status
Missing operator field (OCR) 400
Invalid operator value 400
Empty or corrupt image 400
Image exceeds 4096×4096 400
Internal inference error 500

Models

All ONNX models are from the InfiniFlow/deepdoc HuggingFace repository (Apache 2.0 license):

File Size Purpose
layout.onnx 75.7 MB DLA (YOLOv10)
det.onnx 4.7 MB OCR text detection (PP-OCRv4)
rec.onnx 10.8 MB OCR text recognition (PP-OCRv4)
tsr.onnx 12.2 MB TSR (PaddleDetection)
ocr.res 26 KB OCR character dictionary

Architecture

deepdoc/server/
├── deepdoc_server.py       # LitServe entry point
├── endpoints/            # LitAPI endpoints (HTTP layer)
│   ├── dla_endpoint.py
│   ├── tsr_endpoint.py
│   └── ocr_endpoint.py
└── adapters/             # Model wrappers (inference + format conversion)
    ├── dla_adapter.py
    ├── tsr_adapter.py
    └── ocr_adapter.py

Endpoints → Adapters → deepdoc/vision/ (reused OSS model classes) → ONNX Runtime.