ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 15:31:05 +08:00

Files

euvre fb3bd3de02 fix(deepdoc): add English caption patterns to fix missing figure/table numbering (#15481 )

### What problem does this PR solve?
## Problem

When parsing PDFs containing English figure/table captions (e.g. "Fig.
20", "Figure 20", "Table 20"), the `is_caption` method in
`TableStructureRecognizer` failed to recognize them as captions. This
caused figure numbering gaps in the parsed output (e.g. Fig. 19 → Fig.
21, skipping Fig. 20).

## Root Cause

The `is_caption` regex only matched Chinese caption formats:

```python
patt = [r"[图表]+[ 0-9:：]{2,}"]
```

When the layout recognizer also failed to assign a `caption` layout type
to a given text block, English captions were entirely missed.

## Fix

Added three case-insensitive English caption patterns to `is_caption` in
`deepdoc/vision/table_structure_recognizer.py`:

- `(?i)Fig\.?\s*\d+` — matches `Fig. 20`, `Fig 20`, `FIG. 20`, etc.
- `(?i)Figure\s+\d+` — matches `Figure 20`, `FIGURE 20`, etc.
- `(?i)Table\s+\d+` — matches `Table 20`, `TABLE 20`, etc.

## Files Changed

- `deepdoc/vision/table_structure_recognizer.py` — extended `is_caption`
regex patterns


- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: noob <yixiao121314@outlook.com>

2026-06-01 19:22:11 +08:00

__init__.py

fix: use context managers for file handles to prevent resource leaks (#13514 )

2026-03-11 16:47:06 +08:00

layout_recognizer.py

perf: lazy img_np loading and chunked parse_into_bboxes for large PDFs (#14385 )

2026-04-27 16:52:43 +08:00

ocr.py

fix: strip single quotes from synonym terms to prevent Infinity TokenError (#13969 )

2026-04-09 19:10:34 +08:00

operators.py

refactor(word): lazy-load DOCX images to reduce peak memory without changing output (#13233 )