ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Files

seekmistar01 68b9360536 fix(nlp): tokenize content_tks by whitespace in FulltextQueryer.paragraph (#15721 )

## Summary
Closes #15720

`FulltextQueryer.paragraph` normalized its `content_tks` token string
with `[c.strip() for c in content_tks.strip() ...]`, which iterates the
string **character by character** — `"machine learning model"` becomes
20 single characters instead of 3 tokens. Those single chars are fed to
`tw.weights(..., preprocess=False)`, producing meaningless term weights
and a garbage `MatchTextExpr`.

`paragraph()` backs `Dealer.tag_content` (the KB auto-tagging feature),
so tag retrieval/scoring is silently broken for tag-enabled knowledge
bases. Every other method in this file tokenizes with `.split()` — this
is a `.strip()`-vs-`.split()` typo.

## Change
- `rag/nlp/query.py` — change `content_tks.strip()` to
`content_tks.split()` in the `paragraph` token-normalization line.

## Why it's safe
- The caller passes a space-separated token string; `.split()` recovers
the real tokens, matching the contract of `tw.weights` and the
`.split()` tokenization used by the sibling methods (`similarity`,
`question`).
- No behavior depends on the per-character expansion.

## Verification
- `python -m py_compile rag/nlp/query.py` — OK.
- Demonstrated: `"machine learning model"` → 20 single-character entries
before, 3 real tokens after. No test references `paragraph`.

Co-authored-by: seekmistar01 <seekmistar01@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 17:16:30 +08:00

__init__.py

Feat: Refact pipeline (#13826 )

2026-04-03 19:26:45 +08:00

query.py

fix(nlp): tokenize content_tks by whitespace in FulltextQueryer.paragraph (#15721 )

2026-06-08 17:16:30 +08:00

rag_tokenizer.py

Support operator constraints in semi-automatic metadata filtering (#12956 )

2026-02-03 11:11:34 +08:00

search.py

fix(rerank): normalize reranker scores onto a single scale before hybrid blend (#15429 )