6 Commits

Author SHA1 Message Date
Danut Matei
e2b0da9eea fix(opensearch): keep the BM25 leg in hybrid search (#15760)
### What problem does this PR solve?

Fixes the OpenSearch side of #10747: hybrid search drops the keyword
(BM25) leg and
ends up doing plain vector search.

When a search has both a text and a vector leg, `OSConnection.search()`
throws the text
query away:

    del q["query"]
    q["query"] = {"knn": knn_query}

The text clause only stays on as a filter inside the knn query, so it
narrows the
candidate set but doesn't count towards scoring. So hybrid search on
OpenSearch behaves
like plain vector search, unlike the Elasticsearch backend.

What I changed:

- when both legs are present, send a real hybrid query
`{"hybrid": {"queries": [bm25, {"knn": ...}]}}` and let a
normalization-processor
  search pipeline score and combine the two legs
- only the actual filters (kb_id, available_int, ...) go in the knn
filter, not the
  text must clause
- create the pipeline on startup if it's missing, so there's no separate
provisioning
step. name and weights can be set under `os:` in service_conf.yaml, or
via
`OS_HYBRID_PIPELINE`; defaults are `ragflow_hybrid_pipeline` and `[0.5,
0.5]`
- normalization-processor needs OpenSearch 2.10+. on older clusters, or
when the
pipeline can't be created, log a warning and fall back to vector-only
instead of
  pointing at a pipeline that doesn't exist

This is only the hybrid-search fix; `create_doc_meta_idx` is already on
main.

Testing (there's no OpenSearch path in CI): added a unit test
(`test/unit_test/rag/utils/test_opensearch_hybrid_search.py`, no
services needed) that
checks the query built in each case — hybrid + pipeline param for
text+vector, plain knn
for vector-only, plain bool for text-only, the knn filter never carrying
the text
query_string, and the vector-only fallback when the pipeline isn't
available. Also ran
it against a real OpenSearch 2.19.1 container with a doc that matches
the keyword but
sits outside the knn top-k: pure knn returns `['D1','D2','D5']` (keyword
doc missing),
the hybrid query returns `['A','D1','D2','D5']` (keyword doc present).

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: Danut Matei <matei.danut.dm@gmail.com>
2026-06-08 16:17:47 +08:00
kpdev
a4bc066f74 fix(rag): id2image parsing for hyphenated storage object keys (#15117) (#15118)
### What problem does this PR solve?

Fixes #15117.

Chunk images are stored with `img_id = f"{bucket}-{objname}"` in
`image2id()` (`rag/utils/base64_image.py`). When loading via
`id2image()`, the code used `image_id.split("-")` and required exactly
two segments. Object keys that contain hyphens (e.g. `page-1.jpg`)
produce more than two segments, so `id2image` returns `None` and chunk
image previews fail even though the blob exists.

This is the same parsing issue as #15115 (HTTP thumbnail route); this PR
fixes the indexing/retrieval path.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):

### Test plan

- [x] `pytest test/unit_test/rag/utils/test_base64_image.py`
- [ ] Manual: index a chunk with an `objname` containing hyphens and
confirm `img_id` resolves to an image in retrieval

Fixes #15117.
2026-06-02 10:52:51 +08:00
Jack
f0cb7a544b Refactor: Task Executor (#15154)
### What problem does this PR solve?

1. Break huge function into smaller pieces
2. Add unit test for the smaller pieces function
3. Layer-ed design
a. infra layer - task_context.py, recording_context.py,
write_operation_interceptor.py, ...
    b. service layer - *_service.py
    c. business layer - task_handler.py
4. Default behavior: use "refactor-ed version" - can switch to original
version by change env variable

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Performance Improvement

---------

Co-authored-by: Liu An <asiro@qq.com>
Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
2026-05-27 21:54:17 +08:00
CaptainTimon
2717ee283f feat(raptor): add Psi tree builder with original-space ranking and safe migration (#14679)
### What problem does this PR solve?

Closes #14674.

This PR improves RAPTOR configuration and tree construction while
preserving the existing RAPTOR behavior as the default.

RAPTOR currently builds summary layers with the original UMAP + GMM
clustering path. This PR keeps that default path, and adds:

- A hidden backend tree-builder option:
  - `tree_builder="raptor"`: default, existing RAPTOR behavior.
- `tree_builder="psi"`: rank-aware Psi-style tree builder using original
embedding-space cosine ranking.
- A user-facing clustering method option for the default RAPTOR builder:
  - `clustering_method="gmm"`: existing default.
- `clustering_method="ahc"`: agglomerative hierarchical clustering path.
- A RAPTOR UI setting for `Clustering method` and `Max cluster`.

### What changed

#### Backend

- Added `tree_builder` support for RAPTOR/Psi.
- Added `clustering_method` support for GMM/AHC.
- Kept existing RAPTOR + GMM as the default.
- Added Psi tree building from original-space cosine similarity.
- Added bucketed Psi building controls for large inputs:
  - `raptor.ext.psi_exact_max_leaves`
  - `raptor.ext.psi_bucket_size`
- Added method-aware RAPTOR summary metadata using existing
`extra.raptor_method`.
- Avoided adding a dedicated DB schema field for experimental method
tracking.
- Added cleanup/migration logic to avoid mixing stale RAPTOR summary
trees.
- Added defensive checks for Psi tree construction and summary failures.

#### Frontend/UI

- Added `Clustering method` in RAPTOR settings with `GMM` and `AHC`.
- Added/kept `Max cluster` in RAPTOR settings.
- Enlarged max cluster UI limit to `1024`, matching backend validation.
- Kept AHC editable even when a RAPTOR task has already finished.
- Fixed the UI save payload so `clustering_method` and `tree_builder`
are serialized through `parser_config.raptor.ext`, avoiding backend
validation errors for extra top-level RAPTOR fields.

Example saved RAPTOR config:

```json
{
  "raptor": {
    "max_cluster": 317,
    "ext": {
      "clustering_method": "ahc",
      "tree_builder": "raptor"
    }
  }
}

Co-authored-by: CaptainTimon <CaptainTimon@users.noreply.github.com>
2026-05-12 09:42:31 +08:00
tmimmanuel
663fc1d42c fix(opensearch): implement doc-meta dispatch surface on OSConnection (#14577)
### What problem does this PR solve?

Fixes #14570. On OpenSearch backends (`DOC_ENGINE=opensearch`) every
document-metadata write failed with `'OSConnection' object has no
attribute 'create_doc_meta_idx'`, so both `PATCH
/api/v1/datasets/{ds}/documents/{doc}` with `meta_fields` and `POST
/api/v1/datasets/{ds}/metadata/update` were unusable while every other
document operation (retrieval, parsing, name update, chunk management)
worked correctly on the same OpenSearch cluster.

The bug runs deeper than the missing method name in the error message
suggests. `DocMetadataService` also reached into
`settings.docStoreConn.es.*` directly for the index refresh, the
scripted partial update, and the count call, which means that even after
adding `create_doc_meta_idx` to `OSConnection` the very next call in the
same metadata flow would still raise `AttributeError` because
`OSConnection` exposes `self.os` rather than `self.es`. Fixing only the
reported symptom would have moved the failure one line down without
restoring the feature.

This PR adds a uniform document-metadata dispatch surface to both
connection classes so they present the same abstract API, and routes the
service layer through that surface via `getattr` guards instead of
poking at backend-specific attributes. The four new methods on
`OSConnection` and `ESConnectionBase` are `create_doc_meta_idx`,
`refresh_idx`, `count_idx`, and `replace_meta_fields`.
`OSConnection.create_doc_meta_idx` reuses the existing
`conf/doc_meta_es_mapping.json` schema in the OpenSearch `body=` form
because OpenSearch and Elasticsearch share the same index-creation
payload, and `replace_meta_fields` emits a full scripted assignment
(`ctx._source.meta_fields = params.meta_fields`) on both backends so
removed keys actually disappear instead of being preserved by deep-merge
semantics.

The `getattr`-guarded dispatch in `DocMetadataService` keeps the
existing fall-through paths intact for Infinity and OceanBase, which
continue to rely on their search-based count fallback and on the
delete-then-insert metadata replacement they used before, so this change
is strictly additive for those two backends.

Verification: `pytest
test/unit_test/rag/utils/test_opensearch_doc_meta.py` runs 16 new unit
tests that pass locally and pin the `OSConnection` dispatch surface, the
`create_doc_meta_idx` short-circuit when the index already exists, the
mapping-file payload routing, the `IndicesClient.create` failure path,
the `refresh_idx` and `count_idx` success and error sentinels, and the
full-assignment script emitted by `replace_meta_fields`. The test module
stubs `common.settings` and `rag.nlp` at import time so the suite runs
without the heavy backend SDKs that the rest of the repository pulls in
transitively.


### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: tmimmanuel <tmimmanuel@users.noreply.github.com>
2026-05-11 17:04:28 +08:00
Liu An
7715bad04e refactor: reorganize unit test files into appropriate directories (#13343)
### What problem does this PR solve?

Move test files from utils/ to their corresponding functional
directories:
- api/db/ for database related tests
- api/utils/ for API utility tests
- rag/utils/ for RAG utility tests

### Type of change

- [x] Refactoring
2026-03-04 11:02:56 +08:00