mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-06-29 23:41:12 +08:00
### What problem does this PR solve?
`is_english()` in `rag/nlp/__init__.py` compiles a **single-character**
regex class and `fullmatch`es it against each item:
```python
pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]") # no quantifier
...
eng = sum(1 for t in texts if pattern.fullmatch(t.strip()))
```
For a **string** argument the text is first split into single characters
(`texts = list(texts)`), so each `fullmatch` sees one character and
works. But for a **list** argument each item is a whole multi-character
string, and `fullmatch` of a one-character pattern against a
multi-character string always fails — so `is_english()` returns `False`
for **any** list, regardless of content.
```python
is_english("This is English") # True (ok)
is_english(["The quick brown fox jumps.", "Hello world."]) # False (bug — should be True)
is_english(["这是中文。"]) # False (right answer, wrong reason)
```
Many call sites pass lists and were therefore silently always-`False`,
e.g.:
- `rag/llm/chat_model.py:1088`, `rag/llm/cv_model.py:168,1155` —
`is_english([ans])` when an answer is truncated at `max_tokens`, so an
English reply gets the Chinese "······由于长度的原因,回答被截断了,要继续吗?" continuation
suffix instead of the English one.
- `rag/app/book.py` — `remove_contents_table(...,
eng=is_english([...sections...]))`, so English books have their contents
table stripped in Chinese mode.
- `common/doc_store/es_conn_base.py:339`,
`rag/utils/opensearch_conn.py:733` — `is_english(txt.split())` in
highlight handling.
- plus `rag/app/qa.py`, `rag/flow/parser/utils.py`,
`common/doc_store/infinity_conn_base.py`.
### Fix
Add a `+` quantifier so an all-English multi-character item matches:
```python
pattern = re.compile(r"[`a-zA-Z0-9\s.,':;/\"?<>!\(\)\-]+")
```
The string path is unchanged (single characters still match) and
non-English lists still return `False`. Adds
`test/unit_test/rag/test_is_english.py`; the two list cases fail before
this change and pass after.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
Used the Claude CLI while working on this.