ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-02 08:45:42 +08:00

Author	SHA1	Message	Date
seekmistar01	68b9360536	fix(nlp): tokenize content_tks by whitespace in FulltextQueryer.paragraph (#15721 ) ## Summary Closes #15720 `FulltextQueryer.paragraph` normalized its `content_tks` token string with `[c.strip() for c in content_tks.strip() ...]`, which iterates the string character by character — `"machine learning model"` becomes 20 single characters instead of 3 tokens. Those single chars are fed to `tw.weights(..., preprocess=False)`, producing meaningless term weights and a garbage `MatchTextExpr`. `paragraph()` backs `Dealer.tag_content` (the KB auto-tagging feature), so tag retrieval/scoring is silently broken for tag-enabled knowledge bases. Every other method in this file tokenizes with `.split()` — this is a `.strip()`-vs-`.split()` typo. ## Change - `rag/nlp/query.py` — change `content_tks.strip()` to `content_tks.split()` in the `paragraph` token-normalization line. ## Why it's safe - The caller passes a space-separated token string; `.split()` recovers the real tokens, matching the contract of `tw.weights` and the `.split()` tokenization used by the sibling methods (`similarity`, `question`). - No behavior depends on the per-character expansion. ## Verification - `python -m py_compile rag/nlp/query.py` — OK. - Demonstrated: `"machine learning model"` → 20 single-character entries before, 3 real tokens after. No test references `paragraph`. Co-authored-by: seekmistar01 <seekmistar01@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-08 17:16:30 +08:00
Ramin M.	765cdc2ec2	[Bug]: REDIS error #12870 (#13875 ) Fix for: [Bug]: REDIS error #12870	2026-05-12 09:31:47 +08:00
Octopus	c2ce49e037	fix: strip single quotes from synonym terms to prevent Infinity TokenError (#13969 ) Fixes #13823 ## Problem When querying with words like `cat`, RAGFlow's query expansion system looks up synonyms via WordNet, which can return terms containing single quotes (e.g., `cat-o'-nine-tails`). When using Infinity as the document store, these unescaped single quotes in the query string cause a `TokenError` because Infinity's lexer treats `'` as a string delimiter. ``` TokenError: Error tokenizing ' OR "big cat" OR "computerized tomography")^0.7)': Missing ' from 1:531 ``` ## Solution Strip single quotes from synonym terms before they are inserted into query expressions, consistent with how single quotes are already stripped from the input query text (line 51 of `query.py`): - `common/query_base.py`: In `sub_special_char()`, strip `'` before escaping other special characters. This fixes the Chinese text processing path and the `paragraph()` method. - `rag/nlp/query.py`: In the English text path, strip `'` from tokenized synonym terms. - `memory/services/query.py`: Same fix for the memory query English text path. ## Testing The fix can be verified by: 1. Using Infinity as the document store (`DOC_ENGINE=infinity`) 2. Creating a dataset and running a retrieval test with the keyword `cat` 3. Confirming no `TokenError` is raised and results are returned normally <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Enhanced special character handling in query processing and synonym expansion by properly sanitizing single quotes before text processing. * Simplified OCR detection output by removing timing metadata while preserving core detection accuracy. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: ximi <octo-patch@github.com>	2026-04-09 19:10:34 +08:00
qinling0210	0462c20113	Fix special characters in matching text of search() (#13852 ) ### What problem does this PR solve? Fix special characters in matching text of search(). We should escape some special characters(such as ?, *,:) before passing to matching_text of search() Fix https://github.com/infiniflow/ragflow/issues/13729 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-30 18:47:10 +08:00
Kevin Hu	1262533b74	Feat: support verify to set llm key and boost bigrams. (#12980 ) #12863 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-02-05 19:19:09 +08:00
qinling0210	828ae1e82f	Round float value of minimum_should_match (#12688 ) ### What problem does this PR solve? In paragraph() of class FulltextQueryer, "len(keywords) / 10" should be rounded to integer before set to minimum_should_match. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-19 11:39:33 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Lynn	6e9691a419	Feat: message manage (#12196 ) ### What problem does this PR solve? Manage message and use in agent. Issue #4213 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-25 21:18:13 +08:00
He Wang	38234aca53	feat: add OceanBase doc engine (#11228 ) ### What problem does this PR solve? Add OceanBase doc engine. Close #5350 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-11-20 10:00:14 +08:00
Jin Hai	296476ab89	Refactor function name (#11210 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-12 19:00:15 +08:00
Liu An	9e323a9351	Feat(nlp): add "怎么办" pattern to question word removal (#10284 ) ### What problem does this PR solve? Added "怎么办" to the regex pattern in rmWWW method to improve query cleaning by removing this common question phrase along with other question words. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-09-25 16:47:56 +08:00
Zhichang Yu	342a04ec8a	Added infinity rank_feature support (#9044 ) ### What problem does this PR solve? Added infinity rank_feature support ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-29 09:14:23 +08:00
Sol	0d7cfce6e1	Update rag/nlp/query.py (#7816 ) ### What problem does this PR solve? Fix tokenizer resulting in low recall ![37743d3a495f734aa69f1e173fa77457](https://github.com/user-attachments/assets/1394757e-8fcb-4f87-96af-a92716144884) ![4aba633a17f34269a4e17e84fafb34c4](https://github.com/user-attachments/assets/a1828e32-3e17-4394-a633-ba3f09bd506d) ![image](https://github.com/user-attachments/assets/61308f32-2a4f-44d5-a034-d65bbec554ef) ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-05-23 17:13:37 +08:00
Kevin Hu	a14865e6bb	Fix: empty query issue. (#7551 ) ### What problem does this PR solve? #5214 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-05-09 12:20:19 +08:00
Kevin Hu	c7310f7fb2	Refa: similarity calculations. (#7381 ) ### What problem does this PR solve? ### Type of change - [x] Refactoring	2025-04-28 19:17:11 +08:00
Kevin Hu	0758c04941	Refa: token similarity calculations. (#6614 ) ### What problem does this PR solve? #6507 ### Type of change - [x] Performance Improvement	2025-03-28 09:33:08 +08:00
Kevin Hu	15736c57c3	Fix: empty query issue. (#5830 ) ### What problem does this PR solve? #5214 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-03-10 13:56:56 +08:00
Kevin Hu	4f40f685d9	Code refactor (#5371 ) ### What problem does this PR solve? #5173 ### Type of change - [x] Refactoring	2025-02-26 15:40:52 +08:00
Kevin Hu	53b9e7b52f	Add tavily as web searh tool. (#5349 ) ### What problem does this PR solve? #5198 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-02-26 10:21:04 +08:00
Kevin Hu	cdb3e6434a	Fix empty question issue. (#5225 ) ### What problem does this PR solve? #5241 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-21 15:47:39 +08:00
Kevin Hu	6f2c3a3c3c	Fix too long query exception. (#4729 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-02-06 10:11:52 +08:00
Kevin Hu	c5da3cdd97	Tagging (#4426 ) ### What problem does this PR solve? #4367 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-01-09 17:07:21 +08:00
Kevin Hu	f948c0d9f1	Clean query. (#4259 ) ### What problem does this PR solve? #4239 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-12-27 14:25:03 +08:00
Kevin Hu	927873bfa6	Fix syn error. (#3953 ) ### What problem does this PR solve? Close #3696 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-12-10 10:54:54 +08:00
Zhichang Yu	0d68a6cd1b	Fix errors detected by Ruff (#3918 ) ### What problem does this PR solve? Fix errors detected by Ruff ### Type of change - [x] Refactoring	2024-12-08 14:21:12 +08:00
Kevin Hu	56f473b680	Feat: Add question parameter to edit chunk modal (#3875 ) ### What problem does this PR solve? Close #3873 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-12-05 14:51:19 +08:00
Kevin Hu	1b817a5b4c	Refine synonym query. (#3855 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2024-12-04 17:20:12 +08:00
Zhichang Yu	bc701d7b4c	Edit chunk shall update instead of insert it (#3709 ) ### What problem does this PR solve? Edit chunk shall update instead of insert it. Close #3679 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-11-28 13:00:38 +08:00
Kevin Hu	57208d8e53	Fix batch size issue. (#3675 ) ### What problem does this PR solve? #3657 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-11-27 18:06:43 +08:00
Kevin Hu	ca9e97d2f2	Enlarge the term weight difference (#3435 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2024-11-15 15:41:50 +08:00
Kevin Hu	48e060aa53	rm es query escape chars (#3428 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-11-15 13:19:07 +08:00
Kevin Hu	a1ba228bc2	fix: empty token bug (#3424 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-11-15 10:33:03 +08:00
Kevin Hu	220aaddc62	fix: synonym bug (#3423 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-11-15 10:14:51 +08:00
Zhichang Yu	30f6421760	Use consistent log file names, introduced initLogger (#3403 ) ### What problem does this PR solve? Use consistent log file names, introduced initLogger ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [x] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2024-11-14 17:13:48 +08:00
Kevin Hu	91332fa0f8	Refine english synonym (#3371 ) ### What problem does this PR solve? #3361 ### Type of change - [x] Performance Improvement	2024-11-13 12:58:37 +08:00
Zhichang Yu	f4c52371ab	Integration with Infinity (#2894 ) ### What problem does this PR solve? Integration with Infinity - Replaced ELASTICSEARCH with dataStoreConn - Renamed deleteByQuery with delete - Renamed bulk to upsertBulk - getHighlight, getAggregation - Fix KGSearch.search - Moved Dealer.sql_retrieval to es_conn.py ### Type of change - [x] Refactoring	2024-11-12 14:59:41 +08:00
Kevin Hu	d88f0d43ea	make language judgement robuster (#3287 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2024-11-08 12:48:11 +08:00
Kevin Hu	55953819c1	accelerate term weight calculation (#3206 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2024-11-05 13:11:26 +08:00
Kevin Hu	b164116277	refine token similarity (#2824 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2024-10-14 13:33:18 +08:00
Kevin Hu	54342ae0a2	boost highlight performace (#2419 ) ### What problem does this PR solve? #2415 ### Type of change - [x] Performance Improvement	2024-09-13 18:10:32 +08:00
Kevin Hu	5a2c542ce2	make term similarity robust (#2212 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2024-09-03 14:30:07 +08:00
Kevin Hu	6d232f1bdb	enable 3 char words to finegrind tokenize (#2210 ) ### What problem does this PR solve? ### Type of change - [x] Performance Improvement	2024-09-03 13:37:32 +08:00
Kevin Hu	642006c8e2	filter out + in es query (#2046 ) ### What problem does this PR solve? #2028 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ]	2024-08-22 10:02:04 +08:00
KevinHuSh	e35f7610e7	fix too long query exception (#1195 ) ### What problem does this PR solve? #1161 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-06-18 09:50:59 +08:00
KevinHuSh	4454ba7a1e	add self-rag (#1070 ) ### What problem does this PR solve? #1069 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-06-06 11:13:39 +08:00
Jin Hai	9ed0e50f6b	Update info (#1005 ) ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2024-05-31 09:53:04 +08:00
KevinHuSh	758eb03ccb	fix jina adding issure and term weight refinement (#974 ) ### What problem does this PR solve? #724 #162 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2024-05-29 19:38:57 +08:00
KevinHuSh	614defec21	add rerank model (#969 ) ### What problem does this PR solve? feat: add rerank models to the project #724 #162 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2024-05-29 16:50:02 +08:00
KevinHuSh	2b36283712	fix english query bug (#840 ) ### What problem does this PR solve? #834 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2024-05-20 12:23:51 +08:00
Fakai Zhao	de839fc3f0	optimize srv broker and executor logic (#630 ) ### What problem does this PR solve? Optimize task broker and executor for reduce memory usage and deployment complexity. ### Type of change - [x] Performance Improvement - [x] Refactoring ### Change Log - Enhance redis utils for message queue(use stream) - Modify task broker logic via message queue (1.get parse event from message queue 2.use ThreadPoolExecutor async executor ) - Modify the table column name of document and task (process_duation -> process_duration maybe just a spelling mistake) - Reformat some code style(just what i see) - Add requirement_dev.txt for developer - Add redis container on docker compose --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2024-05-07 11:43:33 +08:00

1 2

64 Commits