ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-05 19:08:38 +08:00

Author	SHA1	Message	Date
Rintaro	453ade288c	fix(opensearch): keep "id" in _source on insert so document metadata isn't empty (#15473 ) ### What problem does this PR solve? Follow-up to #15393. After #15393 fixed the OpenSearch `search()` signature and the doc-meta mapping, document metadata still renders as "0 fields" for every document on the OpenSearch backend (`DOC_ENGINE=opensearch`). Root cause. `OSConnection.insert()` pops `id` out of the document before indexing: meta_id = d_copy.pop("id", "") # id used as _id, then DROPPED from _source so the stored `_source` never contains an `id` field. But the doc-meta read path filters and sorts on that field: - `DocMetadataService.get_metadata_for_documents()` builds `condition = {"kb_id": kb_id, "id": doc_ids}` -> `OSConnection.search()` emits `Q("terms", id=doc_ids)` (a term query on the `id` field), and - `_search_metadata()` sorts with `order_by.asc("id")`. With `id` absent from `_source`, the terms filter matches nothing, so `get_metadata_for_documents()` returns an empty map and the UI shows "0 fields" -- even though the metadata was written correctly (it is visible via a kb_id-only query). `ESConnection.insert()` already keeps `id` (`d_copy.get("id", "")`) with the comment "also keep 'id' as a regular field for sorting". This is a plain OpenSearch-only divergence (`pop()` vs `get()`). ### Fix Mirror Elasticsearch: use `get("id")` instead of `pop("id")` so `id` survives in `_source`. The doc-meta mapping already declares `id` as `keyword`, so the field is searchable/sortable once populated. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Affected backends OpenSearch only. Elasticsearch already keeps `id`; Infinity / OceanBase unaffected. ### How to reproduce 1. `DOC_ENGINE=opensearch`, create a KB, upload/parse a document, set metadata. 2. Open the document list -> every document shows "0 fields" (the metadata exists in the `ragflow_doc_meta_` index but its `_source` has no `id` field). ### Risk & backward compatibility `insert()` is shared with the main chunk index; keeping `id` in `_source` brings OpenSearch in line with Elasticsearch (which already does this), so it is parity, not new behavior. No default / ES / Infinity / OceanBase behavior change. Note: affects new inserts only. Existing `ragflow_doc_meta_` indices created before this change have no `id` in `_source`; re-sync metadata, or backfill once with `_update_by_query` (`ctx._source.id = ctx._id`). ### Test plan - [ ] OpenSearch: after the fix the document list shows correct metadata field counts (not "0 fields"); metadata filter/sort by id works. - [ ] Elasticsearch regression: unchanged.	2026-06-08 17:31:04 +08:00
Danut Matei	e2b0da9eea	fix(opensearch): keep the BM25 leg in hybrid search (#15760 ) ### What problem does this PR solve? Fixes the OpenSearch side of #10747: hybrid search drops the keyword (BM25) leg and ends up doing plain vector search. When a search has both a text and a vector leg, `OSConnection.search()` throws the text query away: del q["query"] q["query"] = {"knn": knn_query} The text clause only stays on as a filter inside the knn query, so it narrows the candidate set but doesn't count towards scoring. So hybrid search on OpenSearch behaves like plain vector search, unlike the Elasticsearch backend. What I changed: - when both legs are present, send a real hybrid query `{"hybrid": {"queries": [bm25, {"knn": ...}]}}` and let a normalization-processor search pipeline score and combine the two legs - only the actual filters (kb_id, available_int, ...) go in the knn filter, not the text must clause - create the pipeline on startup if it's missing, so there's no separate provisioning step. name and weights can be set under `os:` in service_conf.yaml, or via `OS_HYBRID_PIPELINE`; defaults are `ragflow_hybrid_pipeline` and `[0.5, 0.5]` - normalization-processor needs OpenSearch 2.10+. on older clusters, or when the pipeline can't be created, log a warning and fall back to vector-only instead of pointing at a pipeline that doesn't exist This is only the hybrid-search fix; `create_doc_meta_idx` is already on main. Testing (there's no OpenSearch path in CI): added a unit test (`test/unit_test/rag/utils/test_opensearch_hybrid_search.py`, no services needed) that checks the query built in each case — hybrid + pipeline param for text+vector, plain knn for vector-only, plain bool for text-only, the knn filter never carrying the text query_string, and the vector-only fallback when the pipeline isn't available. Also ran it against a real OpenSearch 2.19.1 container with a doc that matches the keyword but sits outside the knn top-k: pure knn returns `['D1','D2','D5']` (keyword doc missing), the hybrid query returns `['A','D1','D2','D5']` (keyword doc present). ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Signed-off-by: Danut Matei <matei.danut.dm@gmail.com>	2026-06-08 16:17:47 +08:00
Rintaro	11af34a895	fix(opensearch): repair document-metadata path broken by #14577 (#15393 ) ### What problem does this PR solve? Document metadata is completely broken on the OpenSearch backend (`DOC_ENGINE=opensearch`). Both failures were introduced by #14577, which added a doc-metadata dispatch surface but only validated it against Elasticsearch. 1. Index creation rejected (`mapper_parsing_exception`). `OSConnection.create_doc_meta_idx` feeds `conf/doc_meta_es_mapping.json` verbatim to OpenSearch. That file declares a top-level `"dynamic": "runtime"`. Runtime fields are Elasticsearch-only; OpenSearch cannot parse the value: mapper_parsing_exception: Could not convert [dynamic.dynamic] to boolean (400) 2. `search()` signature mismatch (`TypeError`). `DocMetadataService` (added by #14577) calls `docStoreConn.search(...)` with snake_case kwargs (`select_fields=`, `index_names=`, `knowledgebase_ids=`, …), matching `ESConnection.search`. But `OSConnection.search` still uses camelCase parameters (`selectFields`, `indexNames`, `knowledgebaseIds`, …): TypeError: OSConnection.search() got an unexpected keyword argument 'select_fields' The UI then shows "0 fields" for every document on OpenSearch. ### Fix 1. In `OSConnection.create_doc_meta_idx`, normalize a top-level `"dynamic": "runtime"` to `True` for the OpenSearch request only. The shared mapping file is left untouched, so the Elasticsearch backend keeps its runtime-field behavior. Dynamic field discovery is preserved on OpenSearch. 2. Rename the `OSConnection.search()` parameters (and their in-method local uses) from camelCase to snake_case so they match `ESConnection.search()` and the `DocMetadataService` call sites. The change is confined to `search()`; `get/insert/update/delete` keep their existing positional signatures (they are called positionally from `rag/nlp/search.py`). ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Affected backends OpenSearch only. Elasticsearch, Infinity and OceanBase are untouched. ### How to reproduce 1. `DOC_ENGINE=opensearch`, restart the stack. 2. Upload/parse a document, then open the dataset's document list / set metadata. - Before: index creation 400s (`Could not convert [dynamic.dynamic]`), and/or `TypeError ... 'select_fields'`; document metadata shows 0 fields. ### Risk & backward compatibility - ES default deployment: no change. `doc_meta_es_mapping.json` is not modified, so ES still receives `"dynamic": "runtime"`. - `search()` rename is internal; the only kwarg caller (`DocMetadataService`) already uses the snake_case names this PR aligns to. ### Test plan - [ ] `DOC_ENGINE=opensearch`: per-tenant `ragflow_doc_meta_*` index is created (no `mapper_parsing_exception`); document metadata reads/writes work. - [ ] `DOC_ENGINE=elasticsearch` regression: doc-meta index still created with runtime mapping; metadata unchanged.	2026-05-29 21:49:36 +08:00
Rintaro	3dfc16973c	fix(opensearch): implement get_scores for KNN second-pass scoring (#15390 ) ### What problem does this PR solve? On the OpenSearch backend (`DOC_ENGINE=opensearch`), every retrieval that performs the KNN second-pass scoring crashes with: AttributeError: 'OSConnection' object has no attribute 'get_scores' Root cause. #14970 ("Refactor: Drop the vector fetch for ES") added a `get_scores()` helper to `ESConnectionBase` (`common/doc_store/es_conn_base.py`) and introduced `Dealer._knn_scores()` in `rag/nlp/search.py`, which calls `self.dataStore.get_scores(res)`. `search.py` routes Infinity and OceanBase to their own similarity paths via `DOC_ENGINE_INFINITY` / `DOC_ENGINE_OCEANBASE`, but OpenSearch sets neither flag, so it falls into the Elasticsearch branch and calls `get_scores`. `OSConnection` (which subclasses `DocStoreConnection` directly, not `ESConnectionBase`) never received that method, so any vector-search hit triggers the crash. It reproduces with any normal embedding (e.g. 1024-dim mistral-embed) as soon as a KNN query returns hits. ### Fix Add `OSConnection.get_scores()`, mirroring `ESConnectionBase.get_scores()`. OpenSearch hit headers expose `_score` exactly like Elasticsearch (the existing `OSConnection.__getSource` already reads `d["_score"]`), so the implementation is identical. Scope note: Infinity and OceanBase deliberately do not use `get_scores` (#14970 routes them elsewhere), so this fix is intentionally limited to the OpenSearch backend, which is the only one reaching the ES KNN-score path. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Affected backends OpenSearch only. Elasticsearch already implements `get_scores`; Infinity / OceanBase are routed away from it. ### How to reproduce 1. `DOC_ENGINE=opensearch` (docker `.env`), restart the stack. 2. Create a knowledge base with any dense embedding model and parse a document. 3. Run a retrieval / chat over that KB -> 500 with the AttributeError above. ### Risk & backward compatibility None for the default Elasticsearch deployment -- the change only adds a method to `OSConnection`. No default values or ES/Infinity/OceanBase behavior change. ### Test plan - [ ] With `DOC_ENGINE=opensearch`, retrieval over a KB returns scored chunks (no AttributeError). - [ ] `DOC_ENGINE=elasticsearch` regression: retrieval unchanged. - [ ] Empty-result path: `_knn_scores` early-returns `{}` (guarded), get_scores handles an empty `hits` list gracefully.	2026-05-29 21:49:15 +08:00
tmimmanuel	663fc1d42c	fix(opensearch): implement doc-meta dispatch surface on OSConnection (#14577 ) ### What problem does this PR solve? Fixes #14570. On OpenSearch backends (`DOC_ENGINE=opensearch`) every document-metadata write failed with `'OSConnection' object has no attribute 'create_doc_meta_idx'`, so both `PATCH /api/v1/datasets/{ds}/documents/{doc}` with `meta_fields` and `POST /api/v1/datasets/{ds}/metadata/update` were unusable while every other document operation (retrieval, parsing, name update, chunk management) worked correctly on the same OpenSearch cluster. The bug runs deeper than the missing method name in the error message suggests. `DocMetadataService` also reached into `settings.docStoreConn.es.*` directly for the index refresh, the scripted partial update, and the count call, which means that even after adding `create_doc_meta_idx` to `OSConnection` the very next call in the same metadata flow would still raise `AttributeError` because `OSConnection` exposes `self.os` rather than `self.es`. Fixing only the reported symptom would have moved the failure one line down without restoring the feature. This PR adds a uniform document-metadata dispatch surface to both connection classes so they present the same abstract API, and routes the service layer through that surface via `getattr` guards instead of poking at backend-specific attributes. The four new methods on `OSConnection` and `ESConnectionBase` are `create_doc_meta_idx`, `refresh_idx`, `count_idx`, and `replace_meta_fields`. `OSConnection.create_doc_meta_idx` reuses the existing `conf/doc_meta_es_mapping.json` schema in the OpenSearch `body=` form because OpenSearch and Elasticsearch share the same index-creation payload, and `replace_meta_fields` emits a full scripted assignment (`ctx._source.meta_fields = params.meta_fields`) on both backends so removed keys actually disappear instead of being preserved by deep-merge semantics. The `getattr`-guarded dispatch in `DocMetadataService` keeps the existing fall-through paths intact for Infinity and OceanBase, which continue to rely on their search-based count fallback and on the delete-then-insert metadata replacement they used before, so this change is strictly additive for those two backends. Verification: `pytest test/unit_test/rag/utils/test_opensearch_doc_meta.py` runs 16 new unit tests that pass locally and pin the `OSConnection` dispatch surface, the `create_doc_meta_idx` short-circuit when the index already exists, the mapping-file payload routing, the `IndicesClient.create` failure path, the `refresh_idx` and `count_idx` success and error sentinels, and the full-assignment script emitted by `replace_meta_fields`. The test module stubs `common.settings` and `rag.nlp` at import time so the suite runs without the heavy backend SDKs that the rest of the repository pulls in transitively. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: tmimmanuel <tmimmanuel@users.noreply.github.com>	2026-05-11 17:04:28 +08:00
as-ondewo	6fb8c31c22	Fix: Document parse status set to DONE before chunks are retrievable (#13352 ) ### What problem does this PR solve? The document parse status was set to DONE before the document chunks were actually retrievable from Elasticsearch/Opensearch because it did not wait for the index refresh. This meant that it was possible that the document parse status returned by the API was DONE but when trying to retrieve chunks there were none. Since the index refreshes every 1 second this was quite likely to happen when wait for document parsing by polling with a short interval and then immediately trying to retrieve chunks once the status was DONE. I fixed this bug and added a test case that would have caught it. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-11 16:04:08 +08:00
MkDev11	cfee2bc9db	feat: Auto-adjust chunk recall weights based on user feedback (#12689 ) ### What problem does this PR solve? Implements automatic adjustment of knowledge base chunk recall weights based on user feedback (upvotes/downvotes). When users upvote or downvote a response, the system locates the corresponding knowledge snippets and adjusts their recall weight to improve future retrieval quality. Closes #12670 How it works: 1. User upvotes/downvotes a response via `POST /thumbup` 2. System extracts chunk IDs from the conversation reference 3. For each referenced chunk: - Reads current `pagerank_fea` value from document store - Increments (+1) for upvote or decrements (-1) for downvote - Clamps weight to [0, 100] range - Updates chunk in ES/Infinity/OceanBase 4. Future retrievals score these chunks higher/lower based on accumulated feedback Files changed: - `api/db/services/chunk_feedback_service.py` - New service for updating chunk pagerank weights - `api/apps/conversation_app.py` - Integrated feedback service into thumbup endpoint - `test/testcases/test_web_api/test_chunk_feedback/` - Unit tests ### Type of change - [x] New Feature (non-breaking change which adds functionality) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Chat message feedback now updates per-chunk relevance weights (feature-flag gated), with configurable weighting and atomic updates across storage backends. * Bug Fixes * Stricter validation for message feedback inputs and more robust handling of feedback transitions. * Tests * Expanded test coverage for chunk-feedback behavior, weighting strategies, storage backends, and thumb-flip scenarios. * Chores * CI workflow extended to run the new chunk-feedback web API tests. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: mkdev11 <YOUR_GITHUB_ID+MkDev11@users.noreply.github.com> Co-authored-by: mkdev11 <MkDev11@users.noreply.github.com>	2026-04-08 09:52:18 +08:00
Phives	87305cb08c	fix: close file handles when loading JSON mapping in doc store connectors (#12904 ) What problem does this PR solve? When loading JSON mapping/schema files, the code used json.load(open(path)) without closing the file. The file handle stayed open until garbage collection, which can leak file descriptors under load (e.g. repeated reconnects or migrations). Type of change [x] Bug Fix (non-breaking change which fixes an issue) Change Replaced json.load(open(...)) with a context manager so the file is closed after loading: with open(fp_mapping, "r") as f: ... = json.load(f) Files updated rag/utils/opensearch_conn.py – mapping load (1 place) common/doc_store/es_conn_base.py – mapping load + doc_meta_mapping load (2 places) common/doc_store/infinity_conn_base.py – schema loads in _migrate_db, doc metadata table creation, and SQL field mapping (4 places) Behavior is unchanged; only resource handling is fixed. Co-authored-by: Gittensor Miner <miner@gittensor.io>	2026-01-30 14:07:51 +08:00
Stephen Hu	3a8c848af5	Fix:OSConnection.create_idx 4 arguments (#12862 ) ### What problem does this PR solve? https://github.com/infiniflow/ragflow/issues/12858 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-01-28 12:41:01 +08:00
Vedant Madane	ac936005e6	fix: ensure deleted chunks are not returned in retrieval (#12520 ) (#12546 ) ## Summary Fixes #12520 - Deleted chunks should not appear in retrieval/reference results. ## Changes ### Core Fix - api/apps/chunk_app.py: Include \doc_id\ in delete condition to properly scope the delete operation ### Improved Error Handling - api/db/services/document_service.py: Better separation of concerns with individual try-catch blocks and proper logging for each cleanup operation ### Doc Store Updates - rag/utils/es_conn.py: Updated delete query construction to support compound conditions - rag/utils/opensearch_conn.py: Same updates for OpenSearch compatibility ### Tests - test/testcases/.../test_retrieval_chunks.py: Added \TestDeletedChunksNotRetrievable\ class with regression tests - test/unit/test_delete_query_construction.py: Unit tests for delete query construction ## Testing - Added regression tests that verify deleted chunks are not returned by retrieval API - Tests cover single chunk deletion and batch deletion scenarios	2026-01-15 14:45:55 +08:00
Jin Hai	01f0ced1e6	Fix IDE warnings (#12281 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-12-29 12:01:18 +08:00
Lynn	6e9691a419	Feat: message manage (#12196 ) ### What problem does this PR solve? Manage message and use in agent. Issue #4213 ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2025-12-25 21:18:13 +08:00
Jin Hai	296476ab89	Refactor function name (#11210 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-12 19:00:15 +08:00
Jin Hai	f98b24c9bf	Move api.settings to common.settings (#11036 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-06 09:36:38 +08:00
Jin Hai	02d10f8eda	Move var from rag.settings to common.globals (#11022 ) ### What problem does this PR solve? As title. ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-05 15:48:50 +08:00
Jin Hai	44f2d6f5da	Move 'get_project_base_directory' to common directory (#10940 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-02 21:05:28 +08:00
Jin Hai	6447b737ab	Move singleton to common directory (#10935 ) ### What problem does this PR solve? As title ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-11-02 12:24:08 +08:00
Jin Hai	766d900a41	Refactor: rename rmSpace to remove_redundant_spaces (#10796 ) ### What problem does this PR solve? - rename rmSpace to remove_redundant_spaces - move clean_markdown_block to common module - add unit tests for remove_redundant_spaces and clean_markdown_block ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2025-10-28 09:46:32 +08:00
pyyuhao	49f3f26622	Bug fix: OpenSearch chunk update some api error (#9032 ) ### What problem does this PR solve? Fix a small non-blocking main workflow bug about chunk update When OpenSearch is the doc engine. When you wanna enable/disable a chunk in the web-page “Knowledge Base / Dataset / Chunk”, the bug ocurred. <img width="2388" height="662" alt="image" src="https://github.com/user-attachments/assets/575987a0-c929-4589-bfa0-ba54e137cfd9" /> The reaseon why it ocurred is that some api params between OpenSearch and ES differs. It functioned well no matter enable/disable/rewrite the chunk after I fixed. I also checked the result when using the chat web-page. <img width="2394" height="660" alt="image" src="https://github.com/user-attachments/assets/8b899dc6-d769-4e80-8dd8-ad0fbbca5f78" /> I will still focus on vector-database espeically OpenSearch. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: 张雨豪 <zhangyh80@chinatelecom.cn> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>	2025-07-25 09:57:24 +08:00
Kevin Hu	fffb7c0bba	Fix: anthropic llm issue. (#8633 ) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2025-07-02 18:37:34 +08:00

20 Commits