fix(opensearch): keep "id" in _source on insert so document metadata isn't empty (#15473)

### What problem does this PR solve?

Follow-up to #15393. After #15393 fixed the OpenSearch `search()`
signature and
the doc-meta mapping, document metadata still renders as **"0 fields"**
for every
document on the OpenSearch backend (`DOC_ENGINE=opensearch`).

**Root cause.** `OSConnection.insert()` pops `id` out of the document
before
indexing:

meta_id = d_copy.pop("id", "") # id used as _id, then DROPPED from
_source

so the stored `_source` never contains an `id` field. But the doc-meta
read path
filters and sorts on that field:

- `DocMetadataService.get_metadata_for_documents()` builds
`condition = {"kb_id": kb_id, "id": doc_ids}` -> `OSConnection.search()`
emits
  `Q("terms", id=doc_ids)` (a term query on the `id` field), and
- `_search_metadata()` sorts with `order_by.asc("id")`.

With `id` absent from `_source`, the terms filter matches nothing, so
`get_metadata_for_documents()` returns an empty map and the UI shows "0
fields"
-- even though the metadata was written correctly (it is visible via a
kb_id-only query).

`ESConnection.insert()` already keeps `id` (`d_copy.get("id", "")`) with
the
comment *"also keep 'id' as a regular field for sorting"*. This is a
plain
OpenSearch-only divergence (`pop()` vs `get()`).

### Fix

Mirror Elasticsearch: use `get("id")` instead of `pop("id")` so `id`
survives in
`_source`. The doc-meta mapping already declares `id` as `keyword`, so
the field
is searchable/sortable once populated.

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)

### Affected backends
OpenSearch only. Elasticsearch already keeps `id`; Infinity / OceanBase
unaffected.

### How to reproduce
1. `DOC_ENGINE=opensearch`, create a KB, upload/parse a document, set
metadata.
2. Open the document list -> every document shows "0 fields" (the
metadata exists
in the `ragflow_doc_meta_*` index but its `_source` has no `id` field).

### Risk & backward compatibility
`insert()` is shared with the main chunk index; keeping `id` in
`_source` brings
OpenSearch in line with Elasticsearch (which already does this), so it
is parity,
not new behavior. No default / ES / Infinity / OceanBase behavior
change.

Note: affects new inserts only. Existing `ragflow_doc_meta_*` indices
created
before this change have no `id` in `_source`; re-sync metadata, or
backfill once
with `_update_by_query` (`ctx._source.id = ctx._id`).

### Test plan
- [ ] OpenSearch: after the fix the document list shows correct metadata
field
      counts (not "0 fields"); metadata filter/sort by id works.
- [ ] Elasticsearch regression: unchanged.
This commit is contained in:
Rintaro
2026-06-08 18:31:04 +09:00
committed by GitHub
parent 14c460a525
commit 453ade288c

View File

@@ -501,7 +501,10 @@ class OSConnection(DocStoreConnection):
assert "_id" not in d
assert "id" in d
d_copy = copy.deepcopy(d)
meta_id = d_copy.pop("id", "")
# Use id as _id for uniqueness, but keep "id" in the document so the
# doc-meta read path (DocMetadataService filters on / sorts by the
# "id" field) can find it, mirroring ESConnection.insert().
meta_id = d_copy.get("id", "")
operations.append(
{"index": {"_index": indexName, "_id": meta_id}})
operations.append(d_copy)