mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-06-29 15:31:05 +08:00
## Summary Closes #15720 `FulltextQueryer.paragraph` normalized its `content_tks` token string with `[c.strip() for c in content_tks.strip() ...]`, which iterates the string **character by character** — `"machine learning model"` becomes 20 single characters instead of 3 tokens. Those single chars are fed to `tw.weights(..., preprocess=False)`, producing meaningless term weights and a garbage `MatchTextExpr`. `paragraph()` backs `Dealer.tag_content` (the KB auto-tagging feature), so tag retrieval/scoring is silently broken for tag-enabled knowledge bases. Every other method in this file tokenizes with `.split()` — this is a `.strip()`-vs-`.split()` typo. ## Change - `rag/nlp/query.py` — change `content_tks.strip()` to `content_tks.split()` in the `paragraph` token-normalization line. ## Why it's safe - The caller passes a space-separated token string; `.split()` recovers the real tokens, matching the contract of `tw.weights` and the `.split()` tokenization used by the sibling methods (`similarity`, `question`). - No behavior depends on the per-character expansion. ## Verification - `python -m py_compile rag/nlp/query.py` — OK. - Demonstrated: `"machine learning model"` → 20 single-character entries before, 3 real tokens after. No test references `paragraph`. Co-authored-by: seekmistar01 <seekmistar01@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>