ragflow/deepdoc/server/README.md

# OSS DeepDoc HTTP API Service

Serves DLA (Document Layout Analysis), OCR (Optical Character Recognition), and
TSR (Table Structure Recognition) models via a unified HTTP API using
[LitServe](https://github.com/Lightning-AI/litserve) and OSS ONNX Runtime models.

## Quick Start

```bash
# Build
docker build -f Dockerfile_deepdoc_oss -t deepdoc_oss:latest .

# Run (CPU only; no GPU required)
docker run -p 9390:9390 deepdoc_oss:latest

# Or via docker compose
docker compose -f docker/docker-compose.yml up -d
```

The service listens on port **9390** by default. Pass `--port` to change it:

```bash
python deepdoc/server/deepdoc_server.py --port 9000 --model-dir /path/to/models
```

## Endpoints

All prediction endpoints accept JPEG images via `multipart/form-data`. The form
field for file uploads is named `request`.

| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/health` | Liveness probe. Returns `ok`. |
| `GET` | `/model` | Model metadata. Returns `{"model":"oss","version":"1.0"}`. |
| `POST` | `/predict/dla` | Document Layout Analysis. |
| `POST` | `/predict/tsr` | Table Structure Recognition. |
| `POST` | `/predict/ocr` | OCR — use form field `operator=det` for detection or `operator=rec` for recognition. |

### `POST /predict/dla`

Analyzes a full page image and returns labelled layout regions.

**Request**

```
curl -X POST http://localhost:9390/predict/dla \
  -F "request=@page.jpg;type=image/jpeg"
```

**Response**

```json
{
  "bboxes": [
    [x0, y0, x1, y1, score, class_id],
    ...
  ]
}
```

| class_id | Label |
|:--------:|-------|
| 0 | title |
| 1 | text |
| 2 | reference |
| 3 | figure |
| 4 | figure caption |
| 5 | table |
| 6 | table caption |
| 8 | equation |

> The OSS model uses 8 unique class IDs. IDs 7 and 9 are reserved for
> compatibility with the SaaS label scheme but are never produced by the
> OSS model.

### `POST /predict/tsr`

Recognizes table structure from a cropped table image.

**Request**

```
curl -X POST http://localhost:9390/predict/tsr \
  -F "request=@table_crop.jpg;type=image/jpeg"
```

**Response**

```json
{
  "bboxes": [
    [x0, y0, x1, y1, score, class_id],
    ...
  ]
}
```

| class_id | Label |
|:--------:|-------|
| 0 | table |
| 1 | table column |
| 2 | table row |
| 3 | table column header |
| 4 | table projected row header |
| 5 | table spanning cell |

### `POST /predict/ocr`

Two modes controlled by the `operator` form field.

#### Detection (`operator=det`)

Returns quadrilateral bounding boxes for detected text regions.

```
curl -X POST "http://localhost:9390/predict/ocr" \
  -F "operator=det" \
  -F "request=@page.jpg;type=image/jpeg"
```

**Response** (5-level nested array):

```json
{
  "output": [
    [
      [
        [
          [[x0,y0],[x1,y1],[x2,y2],[x3,y3]],
          ...
        ]
      ]
    ]
  ]
}
```

#### Recognition (`operator=rec`)

Recognizes text within a cropped region.

```
curl -X POST "http://localhost:9390/predict/ocr" \
  -F "operator=rec" \
  -F "request=@char_crop.jpg;type=image/jpeg"
```

**Response** (4-level nested array):

```json
{
  "output": [
    [
      [
        ["recognized text", 1.0],
        ...
      ]
    ]
  ]
}
```

> Confidence is always `1.0` — the OSS recognition model does not return
> per-character confidence scores.

## Error Responses

| Scenario | HTTP Status |
|----------|:-----------:|
| Missing `operator` field (OCR) | 400 |
| Invalid `operator` value | 400 |
| Empty or corrupt image | 400 |
| Image exceeds 4096×4096 | 400 |
| Internal inference error | 500 |

## Models

All ONNX models are from the [InfiniFlow/deepdoc](https://huggingface.co/InfiniFlow/deepdoc)
HuggingFace repository (Apache 2.0 license):

| File | Size | Purpose |
|------|------|---------|
| `layout.onnx` | 75.7 MB | DLA (YOLOv10) |
| `det.onnx` | 4.7 MB | OCR text detection (PP-OCRv4) |
| `rec.onnx` | 10.8 MB | OCR text recognition (PP-OCRv4) |
| `tsr.onnx` | 12.2 MB | TSR (PaddleDetection) |
| `ocr.res` | 26 KB | OCR character dictionary |

## Architecture

```
deepdoc/server/
├── deepdoc_server.py       # LitServe entry point
├── endpoints/            # LitAPI endpoints (HTTP layer)
│   ├── dla_endpoint.py
│   ├── tsr_endpoint.py
│   └── ocr_endpoint.py
└── adapters/             # Model wrappers (inference + format conversion)
    ├── dla_adapter.py
    ├── tsr_adapter.py
    └── ocr_adapter.py
```

Endpoints → Adapters → `deepdoc/vision/` (reused OSS model classes) → ONNX Runtime.