mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-07-01 00:05:43 +08:00
fix(nlp): tokenize content_tks by whitespace in FulltextQueryer.paragraph (#15721)
## Summary Closes #15720 `FulltextQueryer.paragraph` normalized its `content_tks` token string with `[c.strip() for c in content_tks.strip() ...]`, which iterates the string **character by character** — `"machine learning model"` becomes 20 single characters instead of 3 tokens. Those single chars are fed to `tw.weights(..., preprocess=False)`, producing meaningless term weights and a garbage `MatchTextExpr`. `paragraph()` backs `Dealer.tag_content` (the KB auto-tagging feature), so tag retrieval/scoring is silently broken for tag-enabled knowledge bases. Every other method in this file tokenizes with `.split()` — this is a `.strip()`-vs-`.split()` typo. ## Change - `rag/nlp/query.py` — change `content_tks.strip()` to `content_tks.split()` in the `paragraph` token-normalization line. ## Why it's safe - The caller passes a space-separated token string; `.split()` recovers the real tokens, matching the contract of `tw.weights` and the `.split()` tokenization used by the sibling methods (`similarity`, `question`). - No behavior depends on the per-character expansion. ## Verification - `python -m py_compile rag/nlp/query.py` — OK. - Demonstrated: `"machine learning model"` → 20 single-character entries before, 3 real tokens after. No test references `paragraph`. Co-authored-by: seekmistar01 <seekmistar01@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -223,7 +223,7 @@ class FulltextQueryer(QueryBase):
|
||||
|
||||
def paragraph(self, content_tks: str, keywords: list = [], keywords_topn=30):
|
||||
if isinstance(content_tks, str):
|
||||
content_tks = [c.strip() for c in content_tks.strip() if c.strip()]
|
||||
content_tks = [c.strip() for c in content_tks.split() if c.strip()]
|
||||
tks_w = self.tw.weights(content_tks, preprocess=False)
|
||||
|
||||
origin_keywords = keywords.copy()
|
||||
|
||||
Reference in New Issue
Block a user