Commit Graph

64 Commits

Author SHA1 Message Date
seekmistar01
68b9360536 fix(nlp): tokenize content_tks by whitespace in FulltextQueryer.paragraph (#15721)
## Summary
Closes #15720

`FulltextQueryer.paragraph` normalized its `content_tks` token string
with `[c.strip() for c in content_tks.strip() ...]`, which iterates the
string **character by character** — `"machine learning model"` becomes
20 single characters instead of 3 tokens. Those single chars are fed to
`tw.weights(..., preprocess=False)`, producing meaningless term weights
and a garbage `MatchTextExpr`.

`paragraph()` backs `Dealer.tag_content` (the KB auto-tagging feature),
so tag retrieval/scoring is silently broken for tag-enabled knowledge
bases. Every other method in this file tokenizes with `.split()` — this
is a `.strip()`-vs-`.split()` typo.

## Change
- `rag/nlp/query.py` — change `content_tks.strip()` to
`content_tks.split()` in the `paragraph` token-normalization line.

## Why it's safe
- The caller passes a space-separated token string; `.split()` recovers
the real tokens, matching the contract of `tw.weights` and the
`.split()` tokenization used by the sibling methods (`similarity`,
`question`).
- No behavior depends on the per-character expansion.

## Verification
- `python -m py_compile rag/nlp/query.py` — OK.
- Demonstrated: `"machine learning model"` → 20 single-character entries
before, 3 real tokens after. No test references `paragraph`.

Co-authored-by: seekmistar01 <seekmistar01@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 17:16:30 +08:00
Ramin M.
765cdc2ec2 [Bug]: REDIS error #12870 (#13875)
Fix for: [Bug]: REDIS error #12870
2026-05-12 09:31:47 +08:00
Octopus
c2ce49e037 fix: strip single quotes from synonym terms to prevent Infinity TokenError (#13969)
Fixes #13823

## Problem

When querying with words like `cat`, RAGFlow's query expansion system
looks up synonyms via WordNet, which can return terms containing single
quotes (e.g., `cat-o'-nine-tails`). When using Infinity as the document
store, these unescaped single quotes in the query string cause a
`TokenError` because Infinity's lexer treats `'` as a string delimiter.

```
TokenError: Error tokenizing ' OR "big cat" OR "computerized tomography")^0.7)': Missing ' from 1:531
```

## Solution

Strip single quotes from synonym terms before they are inserted into
query expressions, consistent with how single quotes are already
stripped from the input query text (line 51 of `query.py`):

- **`common/query_base.py`**: In `sub_special_char()`, strip `'` before
escaping other special characters. This fixes the Chinese text
processing path and the `paragraph()` method.
- **`rag/nlp/query.py`**: In the English text path, strip `'` from
tokenized synonym terms.
- **`memory/services/query.py`**: Same fix for the memory query English
text path.

## Testing

The fix can be verified by:
1. Using Infinity as the document store (`DOC_ENGINE=infinity`)
2. Creating a dataset and running a retrieval test with the keyword
`cat`
3. Confirming no `TokenError` is raised and results are returned
normally

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced special character handling in query processing and synonym
expansion by properly sanitizing single quotes before text processing.
* Simplified OCR detection output by removing timing metadata while
preserving core detection accuracy.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: ximi <octo-patch@github.com>
2026-04-09 19:10:34 +08:00
qinling0210
0462c20113 Fix special characters in matching text of search() (#13852)
### What problem does this PR solve?

Fix special characters in matching text of search(). We should escape
some special characters(such as ?, *,:) before passing to matching_text
of search()

Fix https://github.com/infiniflow/ragflow/issues/13729

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-03-30 18:47:10 +08:00
Kevin Hu
1262533b74 Feat: support verify to set llm key and boost bigrams. (#12980)
#12863

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2026-02-05 19:19:09 +08:00
qinling0210
828ae1e82f Round float value of minimum_should_match (#12688)
### What problem does this PR solve?

In paragraph() of class FulltextQueryer, "len(keywords) / 10" should be
rounded to integer before set to minimum_should_match.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2026-01-19 11:39:33 +08:00
Jin Hai
01f0ced1e6 Fix IDE warnings (#12281)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-12-29 12:01:18 +08:00
Lynn
6e9691a419 Feat: message manage (#12196)
### What problem does this PR solve?

Manage message and use in agent.

Issue #4213 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-12-25 21:18:13 +08:00
He Wang
38234aca53 feat: add OceanBase doc engine (#11228)
### What problem does this PR solve?

Add OceanBase doc engine. Close #5350

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-11-20 10:00:14 +08:00
Jin Hai
296476ab89 Refactor function name (#11210)
### What problem does this PR solve?

As title

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2025-11-12 19:00:15 +08:00
Liu An
9e323a9351 Feat(nlp): add "怎么办" pattern to question word removal (#10284)
### What problem does this PR solve?

Added "怎么办" to the regex pattern in rmWWW method to improve query
cleaning by removing this common question phrase along with other
question words.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-09-25 16:47:56 +08:00
Zhichang Yu
342a04ec8a Added infinity rank_feature support (#9044)
### What problem does this PR solve?

Added infinity rank_feature support

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-07-29 09:14:23 +08:00
Sol
0d7cfce6e1 Update rag/nlp/query.py (#7816)
### What problem does this PR solve?
Fix tokenizer resulting in low recall

![37743d3a495f734aa69f1e173fa77457](https://github.com/user-attachments/assets/1394757e-8fcb-4f87-96af-a92716144884)

![4aba633a17f34269a4e17e84fafb34c4](https://github.com/user-attachments/assets/a1828e32-3e17-4394-a633-ba3f09bd506d)

![image](https://github.com/user-attachments/assets/61308f32-2a4f-44d5-a034-d65bbec554ef)



### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2025-05-23 17:13:37 +08:00
Kevin Hu
a14865e6bb Fix: empty query issue. (#7551)
### What problem does this PR solve?

#5214

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-05-09 12:20:19 +08:00
Kevin Hu
c7310f7fb2 Refa: similarity calculations. (#7381)
### What problem does this PR solve?


### Type of change

- [x] Refactoring
2025-04-28 19:17:11 +08:00
Kevin Hu
0758c04941 Refa: token similarity calculations. (#6614)
### What problem does this PR solve?

#6507

### Type of change

- [x] Performance Improvement
2025-03-28 09:33:08 +08:00
Kevin Hu
15736c57c3 Fix: empty query issue. (#5830)
### What problem does this PR solve?

#5214

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-03-10 13:56:56 +08:00
Kevin Hu
4f40f685d9 Code refactor (#5371)
### What problem does this PR solve?

#5173

### Type of change

- [x] Refactoring
2025-02-26 15:40:52 +08:00
Kevin Hu
53b9e7b52f Add tavily as web searh tool. (#5349)
### What problem does this PR solve?

#5198

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-02-26 10:21:04 +08:00
Kevin Hu
cdb3e6434a Fix empty question issue. (#5225)
### What problem does this PR solve?

#5241

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-21 15:47:39 +08:00
Kevin Hu
6f2c3a3c3c Fix too long query exception. (#4729)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2025-02-06 10:11:52 +08:00
Kevin Hu
c5da3cdd97 Tagging (#4426)
### What problem does this PR solve?

#4367

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2025-01-09 17:07:21 +08:00
Kevin Hu
f948c0d9f1 Clean query. (#4259)
### What problem does this PR solve?

#4239

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-27 14:25:03 +08:00
Kevin Hu
927873bfa6 Fix syn error. (#3953)
### What problem does this PR solve?

Close #3696
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-12-10 10:54:54 +08:00
Zhichang Yu
0d68a6cd1b Fix errors detected by Ruff (#3918)
### What problem does this PR solve?

Fix errors detected by Ruff

### Type of change

- [x] Refactoring
2024-12-08 14:21:12 +08:00
Kevin Hu
56f473b680 Feat: Add question parameter to edit chunk modal (#3875)
### What problem does this PR solve?

Close #3873

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-12-05 14:51:19 +08:00
Kevin Hu
1b817a5b4c Refine synonym query. (#3855)
### What problem does this PR solve?

### Type of change

- [x] Performance Improvement
2024-12-04 17:20:12 +08:00
Zhichang Yu
bc701d7b4c Edit chunk shall update instead of insert it (#3709)
### What problem does this PR solve?

Edit chunk shall update instead of insert it. Close #3679 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-28 13:00:38 +08:00
Kevin Hu
57208d8e53 Fix batch size issue. (#3675)
### What problem does this PR solve?

#3657

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-27 18:06:43 +08:00
Kevin Hu
ca9e97d2f2 Enlarge the term weight difference (#3435)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2024-11-15 15:41:50 +08:00
Kevin Hu
48e060aa53 rm es query escape chars (#3428)
### What problem does this PR solve?

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-15 13:19:07 +08:00
Kevin Hu
a1ba228bc2 fix: empty token bug (#3424)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-15 10:33:03 +08:00
Kevin Hu
220aaddc62 fix: synonym bug (#3423)
### What problem does this PR solve?


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-11-15 10:14:51 +08:00
Zhichang Yu
30f6421760 Use consistent log file names, introduced initLogger (#3403)
### What problem does this PR solve?

Use consistent log file names, introduced initLogger

### Type of change

- [ ] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [x] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
2024-11-14 17:13:48 +08:00
Kevin Hu
91332fa0f8 Refine english synonym (#3371)
### What problem does this PR solve?

#3361

### Type of change

- [x] Performance Improvement
2024-11-13 12:58:37 +08:00
Zhichang Yu
f4c52371ab Integration with Infinity (#2894)
### What problem does this PR solve?

Integration with Infinity

- Replaced ELASTICSEARCH with dataStoreConn
- Renamed deleteByQuery with delete
- Renamed bulk to upsertBulk
- getHighlight, getAggregation
- Fix KGSearch.search
- Moved Dealer.sql_retrieval to es_conn.py


### Type of change

- [x] Refactoring
2024-11-12 14:59:41 +08:00
Kevin Hu
d88f0d43ea make language judgement robuster (#3287)
### What problem does this PR solve?



### Type of change

- [x] Performance Improvement
2024-11-08 12:48:11 +08:00
Kevin Hu
55953819c1 accelerate term weight calculation (#3206)
### What problem does this PR solve?



### Type of change

- [x] Performance Improvement
2024-11-05 13:11:26 +08:00
Kevin Hu
b164116277 refine token similarity (#2824)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2024-10-14 13:33:18 +08:00
Kevin Hu
54342ae0a2 boost highlight performace (#2419)
### What problem does this PR solve?

#2415

### Type of change

- [x] Performance Improvement
2024-09-13 18:10:32 +08:00
Kevin Hu
5a2c542ce2 make term similarity robust (#2212)
### What problem does this PR solve?


### Type of change

- [x] Performance Improvement
2024-09-03 14:30:07 +08:00
Kevin Hu
6d232f1bdb enable 3 char words to finegrind tokenize (#2210)
### What problem does this PR solve?


### Type of change


- [x] Performance Improvement
2024-09-03 13:37:32 +08:00
Kevin Hu
642006c8e2 filter out + in es query (#2046)
### What problem does this PR solve?

#2028

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ]
2024-08-22 10:02:04 +08:00
KevinHuSh
e35f7610e7 fix too long query exception (#1195)
### What problem does this PR solve?

#1161 
### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-06-18 09:50:59 +08:00
KevinHuSh
4454ba7a1e add self-rag (#1070)
### What problem does this PR solve?

#1069 

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-06-06 11:13:39 +08:00
Jin Hai
9ed0e50f6b Update info (#1005)
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
2024-05-31 09:53:04 +08:00
KevinHuSh
758eb03ccb fix jina adding issure and term weight refinement (#974)
### What problem does this PR solve?

#724 #162

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
2024-05-29 19:38:57 +08:00
KevinHuSh
614defec21 add rerank model (#969)
### What problem does this PR solve?

feat: add rerank models to the project #724 #162

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
2024-05-29 16:50:02 +08:00
KevinHuSh
2b36283712 fix english query bug (#840)
### What problem does this PR solve?

#834 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
2024-05-20 12:23:51 +08:00
Fakai Zhao
de839fc3f0 optimize srv broker and executor logic (#630)
### What problem does this PR solve?

Optimize task broker and executor for reduce memory usage and deployment
complexity.

### Type of change
- [x] Performance Improvement
- [x] Refactoring

### Change Log
- Enhance redis utils for message queue(use stream)
- Modify task broker logic via message queue (1.get parse event from
message queue 2.use ThreadPoolExecutor async executor )
- Modify the table column name of document and task (process_duation ->
process_duration maybe just a spelling mistake)
- Reformat some code style(just what i see)
- Add requirement_dev.txt for developer
- Add redis container on docker compose

---------

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
2024-05-07 11:43:33 +08:00