Files
ragflow/conf/infinity_mapping.json

45 lines
3.0 KiB
JSON
Raw Normal View History

{
"id": {"type": "varchar", "default": ""},
"doc_id": {"type": "varchar", "default": ""},
"kb_id": {"type": "varchar", "default": "", "index_type": {"type": "secondary", "cardinality": "low"}},
"mom_id": {"type": "varchar", "default": ""},
"mom": {"type": "varchar", "default": ""},
"create_time": {"type": "varchar", "default": ""},
"create_timestamp_flt": {"type": "float", "default": 0.0},
"img_id": {"type": "varchar", "default": ""},
"docnm": {"type": "varchar", "default": "", "analyzer": ["rag-coarse", "rag-fine"], "comment": "docnm_kwd, title_tks, title_sm_tks"},
"name_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"tag_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"important_kwd_empty_count": {"type": "integer", "default": 0},
"important_keywords": {"type": "varchar", "default": "", "analyzer": ["rag-coarse", "rag-fine"], "comment": "important_kwd, important_tks"},
"questions": {"type": "varchar", "default": "", "analyzer": ["rag-coarse", "rag-fine"], "comment": "question_kwd, question_tks"},
"content": {"type": "varchar", "default": "", "analyzer": ["rag-coarse", "rag-fine"], "comment": "content_with_weight, content_ltks, content_sm_ltks"},
"authors": {"type": "varchar", "default": "", "analyzer": ["rag-coarse", "rag-fine"], "comment": "authors_tks, authors_sm_tks"},
"page_num_int": {"type": "varchar", "default": ""},
"top_int": {"type": "varchar", "default": ""},
"position_int": {"type": "varchar", "default": ""},
"weight_int": {"type": "integer", "default": 0},
"weight_flt": {"type": "float", "default": 0.0},
"chunk_order_int": {"type": "integer", "default": 0},
"rank_int": {"type": "integer", "default": 0},
"rank_flt": {"type": "float", "default": 0},
"available_int": {"type": "integer", "default": 1, "index_type": {"type": "secondary", "cardinality": "low"}},
"knowledge_graph_kwd": {"type": "varchar", "default": ""},
"entities_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"pagerank_fea": {"type": "integer", "default": 0},
"tag_feas": {"type": "varchar", "default": "", "analyzer": "rankfeatures"},
"from_entity_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"to_entity_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"entity_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"entity_type_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"source_id": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"n_hop_with_weight": {"type": "varchar", "default": ""},
"mom_with_weight": {"type": "varchar", "default": ""},
"removed_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"doc_type_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
"toc_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
feat: persist RAPTOR layer metadata on summary chunks (#13286) ## Summary RAPTOR's recursive clustering builds a `layers` list tracking `(start_idx, end_idx)` boundaries per level, but currently discards this information — only the flat `chunks` list is returned. This makes it impossible to distinguish leaf-level summaries from top-level ones. This PR: - Returns `(chunks, layers)` tuple from `raptor.py`'s `__call__` - Annotates each RAPTOR summary chunk with `raptor_layer_int` (1 = first summary level, 2 = summary-of-summaries, etc.) - Adds `raptor_layer_int` to `infinity_mapping.json` (Elasticsearch handles it via existing `*_int` dynamic template) ### Why this matters Downstream features need to know which RAPTOR layer a summary belongs to: - **Retrieving the top-level document summary** for entity extraction, search snippets, or document comparison - **Filtering by abstraction level** — users may want only high-level summaries or only leaf-level cluster summaries - **RAPTOR recall quality** — #10951 reports summaries not being recalled for definition queries; layer metadata enables targeted retrieval ### Changes | File | Change | LOC | |------|--------|-----| | `rag/raptor.py` | Return `(chunks, layers)` tuple | ~3 | | `rag/svr/task_executor.py` | Build `chunk_layer` mapping, set `raptor_layer_int` | ~12 | | `conf/infinity_mapping.json` | Add `raptor_layer_int` integer field | ~1 | ### Backward compatibility - **Additive only** — no existing fields or behavior changed - Existing RAPTOR chunks continue to work (they'll have `raptor_layer_int = 0` by default) - New RAPTOR chunks get layer metadata automatically ## Test plan - [ ] Parse a document with RAPTOR enabled, verify `raptor_layer_int` is set on indexed chunks - [ ] Verify `raptor_layer_int` values increase with abstraction level (layer 1 < layer 2 < ...) - [ ] Verify existing RAPTOR deletion (`delete by raptor_kwd`) still works - [ ] Verify Infinity backend accepts the new field Fixes #7488 Related: #4104, #11191, #10951 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>
2026-04-27 10:20:46 +08:00
"raptor_kwd": {"type": "varchar", "default": "", "analyzer": "whitespace-#"},
fix(infinity): declare `extra` field + serialize dict on write to unbreak RAPTOR (#14998) ### What problem does this PR solve? Fixes #14997. RAPTOR builds on the Infinity backend have been broken since v0.25.2 introduced the `extra` field in code (`rag/svr/task_executor.py:1011`) without declaring it in `conf/infinity_mapping.json`. Every RAPTOR job fails with: ``` infinity.common.InfinityException: (3013, 'Fail to bind the expression: extra@src/planner/expression_binder_impl.cpp:99') ``` The auto-migration in `common/doc_store/infinity_conn_base.py:_migrate_db()` adds any columns it finds in the mapping JSON to existing tables — so the only thing standing between users and a working RAPTOR build is that one missing declaration. OceanBase, ES, and OpenSearch were unaffected because they store `extra` as a native JSON type; only Infinity (which has a strict `varchar`/`integer`/`float` schema) needed the addition. ### The fix Two-part change: 1. **`conf/infinity_mapping.json`**: declare `"extra": {"type": "varchar", "default": ""}`. On next startup, `_migrate_db()` adds the column to all existing chunk tables — no manual DDL needed for upgrading installations. 2. **`rag/utils/infinity_conn.py` `insert()`**: serialize the `extra` dict to a JSON string at write time, since Infinity's `varchar` can't store a Python dict directly. Modelled on the existing `chunk_data` handling a few lines above. The read path (`rag/utils/raptor_utils.py:_as_extra_dict`) already normalises both dict and JSON-string inputs, so no read-side change is needed. Other backends are untouched — `task_executor.py` still writes the dict, and the OceanBase/ES/OpenSearch insert paths handle dicts natively. ### Verification Tested on a v0.25.4 deployment with the Infinity backend by applying the same two changes via mounted-volume override: - Confirmed `_migrate_db()` adds the `extra` column to all pre-existing chunk tables on startup (column visible via Infinity's `show_columns()`). - Triggered RAPTOR builds on four datasets (~21k chunks total) via `POST /api/v1/datasets/<id>/index?type=raptor`. - All four progressed past the previously-failing `get_raptor_chunk_methods()` call into actual entity-extraction and clustering work without the (3013) error. - GraphRAG builds (which can trigger the same path indirectly via `task_executor.py:857`) also progressed cleanly. ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue)
2026-05-19 15:10:03 +05:30
"raptor_layer_int": {"type": "integer", "default": 0},
"extra": {"type": "varchar", "default": ""}
}