ragflow/deepdoc/parser/pdf_parser.py at 89ba7abe30fbb9de0c65168836dfd801805955d6

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-04 18:45:38 +08:00

Files

FallingSnowFlake 1033a3ae26 Fix: improve PDF text type detection by expanding regex content (#11432 )

- Add whitespace validation to the PDF English text checking regex
- Reduce false negatives in English PDF content recognition

### What problem does this PR solve?

The core idea is to **expand the regex content used for English text
detection** so it can accommodate more valid characters commonly found
in English PDFs. The modifications include:

- Adding support for **space** in the regex.
- Ensuring the update does not reduce existing detection accuracy.

### Type of change

- [✅] Bug Fix (non-breaking change which fixes an issue)

2025-11-21 14:33:29 +08:00

58 KiB

Raw Blame History

View Raw

58 KiB Raw Blame History

58 KiB

Raw Blame History