From 68b93605362f88d4a71e40ff293ffd7bd9fb386d Mon Sep 17 00:00:00 2001
From: seekmistar01 <seekmistar@proton.me>
Date: Mon, 8 Jun 2026 02:16:30 -0700
Subject: [PATCH] fix(nlp): tokenize content_tks by whitespace in
 FulltextQueryer.paragraph (#15721)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary
Closes #15720

`FulltextQueryer.paragraph` normalized its `content_tks` token string
with `[c.strip() for c in content_tks.strip() ...]`, which iterates the
string **character by character** — `"machine learning model"` becomes
20 single characters instead of 3 tokens. Those single chars are fed to
`tw.weights(..., preprocess=False)`, producing meaningless term weights
and a garbage `MatchTextExpr`.

`paragraph()` backs `Dealer.tag_content` (the KB auto-tagging feature),
so tag retrieval/scoring is silently broken for tag-enabled knowledge
bases. Every other method in this file tokenizes with `.split()` — this
is a `.strip()`-vs-`.split()` typo.

## Change
- `rag/nlp/query.py` — change `content_tks.strip()` to
`content_tks.split()` in the `paragraph` token-normalization line.

## Why it's safe
- The caller passes a space-separated token string; `.split()` recovers
the real tokens, matching the contract of `tw.weights` and the
`.split()` tokenization used by the sibling methods (`similarity`,
`question`).
- No behavior depends on the per-character expansion.

## Verification
- `python -m py_compile rag/nlp/query.py` — OK.
- Demonstrated: `"machine learning model"` → 20 single-character entries
before, 3 real tokens after. No test references `paragraph`.

Co-authored-by: seekmistar01 <seekmistar01@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
---
 rag/nlp/query.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/rag/nlp/query.py b/rag/nlp/query.py
index db04eb3753..aefc15ed4e 100644
--- a/rag/nlp/query.py
+++ b/rag/nlp/query.py
@@ -223,7 +223,7 @@ class FulltextQueryer(QueryBase):
 
     def paragraph(self, content_tks: str, keywords: list = [], keywords_topn=30):
         if isinstance(content_tks, str):
-            content_tks = [c.strip() for c in content_tks.strip() if c.strip()]
+            content_tks = [c.strip() for c in content_tks.split() if c.strip()]
         tks_w = self.tw.weights(content_tks, preprocess=False)
 
         origin_keywords = keywords.copy()