From 68b93605362f88d4a71e40ff293ffd7bd9fb386d Mon Sep 17 00:00:00 2001 From: seekmistar01 Date: Mon, 8 Jun 2026 02:16:30 -0700 Subject: [PATCH] fix(nlp): tokenize content_tks by whitespace in FulltextQueryer.paragraph (#15721) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Closes #15720 `FulltextQueryer.paragraph` normalized its `content_tks` token string with `[c.strip() for c in content_tks.strip() ...]`, which iterates the string **character by character** — `"machine learning model"` becomes 20 single characters instead of 3 tokens. Those single chars are fed to `tw.weights(..., preprocess=False)`, producing meaningless term weights and a garbage `MatchTextExpr`. `paragraph()` backs `Dealer.tag_content` (the KB auto-tagging feature), so tag retrieval/scoring is silently broken for tag-enabled knowledge bases. Every other method in this file tokenizes with `.split()` — this is a `.strip()`-vs-`.split()` typo. ## Change - `rag/nlp/query.py` — change `content_tks.strip()` to `content_tks.split()` in the `paragraph` token-normalization line. ## Why it's safe - The caller passes a space-separated token string; `.split()` recovers the real tokens, matching the contract of `tw.weights` and the `.split()` tokenization used by the sibling methods (`similarity`, `question`). - No behavior depends on the per-character expansion. ## Verification - `python -m py_compile rag/nlp/query.py` — OK. - Demonstrated: `"machine learning model"` → 20 single-character entries before, 3 real tokens after. No test references `paragraph`. Co-authored-by: seekmistar01 Co-authored-by: Claude Opus 4.8 --- rag/nlp/query.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rag/nlp/query.py b/rag/nlp/query.py index db04eb3753..aefc15ed4e 100644 --- a/rag/nlp/query.py +++ b/rag/nlp/query.py @@ -223,7 +223,7 @@ class FulltextQueryer(QueryBase): def paragraph(self, content_tks: str, keywords: list = [], keywords_topn=30): if isinstance(content_tks, str): - content_tks = [c.strip() for c in content_tks.strip() if c.strip()] + content_tks = [c.strip() for c in content_tks.split() if c.strip()] tks_w = self.tw.weights(content_tks, preprocess=False) origin_keywords = keywords.copy()