mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-06-29 15:31:05 +08:00
fix: strip single quotes from synonym terms to prevent Infinity TokenError (#13969)
Fixes #13823 ## Problem When querying with words like `cat`, RAGFlow's query expansion system looks up synonyms via WordNet, which can return terms containing single quotes (e.g., `cat-o'-nine-tails`). When using Infinity as the document store, these unescaped single quotes in the query string cause a `TokenError` because Infinity's lexer treats `'` as a string delimiter. ``` TokenError: Error tokenizing ' OR "big cat" OR "computerized tomography")^0.7)': Missing ' from 1:531 ``` ## Solution Strip single quotes from synonym terms before they are inserted into query expressions, consistent with how single quotes are already stripped from the input query text (line 51 of `query.py`): - **`common/query_base.py`**: In `sub_special_char()`, strip `'` before escaping other special characters. This fixes the Chinese text processing path and the `paragraph()` method. - **`rag/nlp/query.py`**: In the English text path, strip `'` from tokenized synonym terms. - **`memory/services/query.py`**: Same fix for the memory query English text path. ## Testing The fix can be verified by: 1. Using Infinity as the document store (`DOC_ENGINE=infinity`) 2. Creating a dataset and running a retrieval test with the keyword `cat` 3. Confirming no `TokenError` is raised and results are returned normally <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Enhanced special character handling in query processing and synonym expansion by properly sanitizing single quotes before text processing. * Simplified OCR detection output by removing timing metadata while preserving core detection accuracy. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ximi <octo-patch@github.com>
This commit is contained in:
@@ -72,7 +72,9 @@ class MsgTextQuery(QueryBase):
|
||||
syns = []
|
||||
for tk, w in tks_w[:256]:
|
||||
syn = self.syn.lookup(tk)
|
||||
syn = rag_tokenizer.tokenize(" ".join(syn)).split()
|
||||
# Strip single quotes to avoid Infinity lexer TokenError
|
||||
# (e.g. WordNet returns "cat-o'-nine-tails" for "cat")
|
||||
syn = re.sub(r"'", "", rag_tokenizer.tokenize(" ".join(syn))).split()
|
||||
keywords.extend(syn)
|
||||
syn = ["\"{}\"^{:.4f}".format(s, w / 4.) for s in syn if s.strip()]
|
||||
syns.append(" ".join(syn))
|
||||
|
||||
Reference in New Issue
Block a user