## Summary
Closes#15720
`FulltextQueryer.paragraph` normalized its `content_tks` token string
with `[c.strip() for c in content_tks.strip() ...]`, which iterates the
string **character by character** — `"machine learning model"` becomes
20 single characters instead of 3 tokens. Those single chars are fed to
`tw.weights(..., preprocess=False)`, producing meaningless term weights
and a garbage `MatchTextExpr`.
`paragraph()` backs `Dealer.tag_content` (the KB auto-tagging feature),
so tag retrieval/scoring is silently broken for tag-enabled knowledge
bases. Every other method in this file tokenizes with `.split()` — this
is a `.strip()`-vs-`.split()` typo.
## Change
- `rag/nlp/query.py` — change `content_tks.strip()` to
`content_tks.split()` in the `paragraph` token-normalization line.
## Why it's safe
- The caller passes a space-separated token string; `.split()` recovers
the real tokens, matching the contract of `tw.weights` and the
`.split()` tokenization used by the sibling methods (`similarity`,
`question`).
- No behavior depends on the per-character expansion.
## Verification
- `python -m py_compile rag/nlp/query.py` — OK.
- Demonstrated: `"machine learning model"` → 20 single-character entries
before, 3 real tokens after. No test references `paragraph`.
Co-authored-by: seekmistar01 <seekmistar01@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Fixes#13823
## Problem
When querying with words like `cat`, RAGFlow's query expansion system
looks up synonyms via WordNet, which can return terms containing single
quotes (e.g., `cat-o'-nine-tails`). When using Infinity as the document
store, these unescaped single quotes in the query string cause a
`TokenError` because Infinity's lexer treats `'` as a string delimiter.
```
TokenError: Error tokenizing ' OR "big cat" OR "computerized tomography")^0.7)': Missing ' from 1:531
```
## Solution
Strip single quotes from synonym terms before they are inserted into
query expressions, consistent with how single quotes are already
stripped from the input query text (line 51 of `query.py`):
- **`common/query_base.py`**: In `sub_special_char()`, strip `'` before
escaping other special characters. This fixes the Chinese text
processing path and the `paragraph()` method.
- **`rag/nlp/query.py`**: In the English text path, strip `'` from
tokenized synonym terms.
- **`memory/services/query.py`**: Same fix for the memory query English
text path.
## Testing
The fix can be verified by:
1. Using Infinity as the document store (`DOC_ENGINE=infinity`)
2. Creating a dataset and running a retrieval test with the keyword
`cat`
3. Confirming no `TokenError` is raised and results are returned
normally
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Enhanced special character handling in query processing and synonym
expansion by properly sanitizing single quotes before text processing.
* Simplified OCR detection output by removing timing metadata while
preserving core detection accuracy.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: ximi <octo-patch@github.com>
### What problem does this PR solve?
Fix special characters in matching text of search(). We should escape
some special characters(such as ?, *,:) before passing to matching_text
of search()
Fix https://github.com/infiniflow/ragflow/issues/13729
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
In paragraph() of class FulltextQueryer, "len(keywords) / 10" should be
rounded to integer before set to minimum_should_match.
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Manage message and use in agent.
Issue #4213
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Add OceanBase doc engine. Close#5350
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Added "怎么办" to the regex pattern in rmWWW method to improve query
cleaning by removing this common question phrase along with other
question words.
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Edit chunk shall update instead of insert it. Close#3679
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?
Use consistent log file names, introduced initLogger
### Type of change
- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
### What problem does this PR solve?
Integration with Infinity
- Replaced ELASTICSEARCH with dataStoreConn
- Renamed deleteByQuery with delete
- Renamed bulk to upsertBulk
- getHighlight, getAggregation
- Fix KGSearch.search
- Moved Dealer.sql_retrieval to es_conn.py
### Type of change
- [x] Refactoring
### What problem does this PR solve?
_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._
### Type of change
- [x] Refactoring
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?
#724#162
### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
feat: add rerank models to the project #724#162
### Type of change
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?
Optimize task broker and executor for reduce memory usage and deployment
complexity.
### Type of change
- [x] Performance Improvement
- [x] Refactoring
### Change Log
- Enhance redis utils for message queue(use stream)
- Modify task broker logic via message queue (1.get parse event from
message queue 2.use ThreadPoolExecutor async executor )
- Modify the table column name of document and task (process_duation ->
process_duration maybe just a spelling mistake)
- Reformat some code style(just what i see)
- Add requirement_dev.txt for developer
- Add redis container on docker compose
---------
Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>