ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Author	SHA1	Message	Date
Yufeng He	0d836afd34	fix: keep max pagerank for repeated n-hop edges (#15696 ) ## Summary Fixes #15695. The Python GraphRAG path already accumulates similarity when several N-hop paths produce the same edge, but PageRank was overwritten by the last path. That makes ranking depend on path order for repeated edges. This keeps the strongest PageRank seen for a repeated edge in the Python implementation: - `rag/graphrag/search.py` The similarity score still accumulates exactly as before. ## To verify - `python -m py_compile rag\graphrag\search.py` - `git diff --check` - `git diff --stat upstream/main` -> only `rag/graphrag/search.py` I originally included the Go implementation too, but removed it after maintainer feedback because the Go version is still under development and not released yet.	2026-06-11 20:53:11 +08:00
cleanjunc	88e4d6bddb	Fix: restore GraphRAG entity ranking by indexing pagerank and n-hop paths (#15797 ) ### Summary Closes #15795 Knowledge-graph queries rank entities by `pagerank * sim` in `KGSearch`, but the entity chunks written at index time stopped carrying the values that ranking depends on. `graph_node_to_chunk` only stored `entity_type`, `description`, and `source_id`, dropping the node `pagerank` and the n-hop neighbour paths, while `search.py` still read them back as `rank_flt` and `n_hop_with_weight`. The producer of these fields, `update_nodes_pagerank_nhop_neighbour`, was removed in #6513, but the read side in `KGSearch` was never updated. The result is that on every knowledge-graph query: - `pagerank` resolves to `0`, so the `pagerank * sim` sort key is `0` for every entity and selection falls back to arbitrary order. - Every displayed entity score is `0.00`. - The n-hop relation-enrichment block is dead code because `n_hop_ents` is always empty, leaving `merge_tuples` and `is_continuous_subsequence` orphaned. This PR restores the missing index-time fields so the documented `P(E\|Q) = pagerank * sim` ranking and the n-hop enrichment work again. What changed: - `graph_node_to_chunk` now writes `rank_flt` from the node pagerank and `n_hop_with_weight` from the recomputed n-hop neighbour paths. - Reintroduced the n-hop path computation (`n_neighbor`) in `rag/graphrag/utils.py`, reusing the previously orphaned `merge_tuples` / `is_continuous_subsequence` helpers, with a direction-agnostic edge-weight lookup for undirected graphs. `set_graph` computes the paths per added or updated node and passes them through. - `KGSearch` now selects `n_hop_with_weight` in the entity keyword search so Infinity and OceanBase return it (Elasticsearch and OpenSearch already read it from `_source`), and the read is hardened against missing keys or empty strings before `json.loads`. - Added the `n_hop_with_weight` column to OceanBase, including the `EXTRA_COLUMNS` migration entry so existing tables get it. The other engines already map both fields via dynamic templates or the Infinity mapping. Scope note: pagerank and n-hop are re-indexed for the added or updated nodes in each pass, consistent with the existing incremental indexing design. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Testing Added unit tests in `test/unit_test/rag/graphrag/test_graphrag_utils.py`: - `n_neighbor`: path and weight shape, one-hop vs two-hop, isolated nodes, missing weights, and direction-agnostic lookup. - `graph_node_to_chunk`: `rank_flt` populated from pagerank and defaulting to `0`, `n_hop_with_weight` serialized and defaulting to an empty list. ``` uv run pytest test/unit_test/rag/graphrag/ # 106 passed uv run ruff check rag/graphrag/ rag/utils/ob_conn.py ```	2026-06-09 20:50:45 +08:00
Jonathan Chang	c586292993	feat: Implement checkpoint/resume support for GraphRAG community extraction and entity resolution (#15523 ) ## Summary This PR adds checkpoint/resume support for the GraphRAG `extract_community` and `resolve_entities` stages. The implementation stores successful intermediate results in the document store so interrupted ingestion can resume without repeating already-completed LLM work. Checkpoints are loaded before each stage, reused when available, saved after successful batch/community processing, and cleaned up after the stage completes successfully. ## Related Issue Closes: #15518 ## Change Type - [x] Feature - [x] Bug fix - [x] Test - [ ] Refactor - [ ] Documentation - [ ] Breaking change ## Real Behavior Proof Validation commands run locally: ```bash uv run python -m py_compile \ rag/graphrag/checkpoints.py \ rag/graphrag/general/community_reports_extractor.py \ rag/graphrag/entity_resolution.py \ rag/graphrag/general/index.py \ test/unit_test/rag/graphrag/test_checkpoints.py ``` Result: ```text Passed ``` ```bash uv run pytest test/unit_test/rag/graphrag/test_checkpoints.py ``` Result: ```text 4 passed ``` ```bash uv run pytest \ test/unit_test/rag/graphrag/test_phase_markers.py \ test/unit_test/rag/graphrag/test_graphrag_utils.py \ test/unit_test/rag/graphrag/test_checkpoints.py ``` Result: ```text 95 passed ``` ```bash git diff --check ``` Result: ```text Passed ``` ## Checklist - [x] Implemented checkpoint/resume support for `extract_community`. - [x] Implemented checkpoint/resume support for `resolve_entities`. - [x] Avoided touching unrelated API behavior. - [x] Added unit tests for the new checkpoint helper logic. - [x] Verified Python syntax compilation. - [x] Ran related GraphRAG unit tests successfully. - [x] Ran `git diff --check`. - [ ] Ran full project test suite. --------- Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-06-09 15:34:47 +08:00
Wang Qi	10e8690890	GraphRAG - NER - spacy - fix spacy extraction (#14783 ) Fix spacy extraction	2026-06-01 13:05:54 +08:00
Lynn	dc4b82523b	Feat: tenant llm provider (#14595 ) ### What problem does this PR solve? Python implementation of the Go-based model_provider API suite. ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: bill <yibie_jingnian@163.com>	2026-05-29 17:39:41 +08:00
Wang Qi	a9ec78cb9c	Refactor: enahnce retry and timeout (#14983 ) ### What problem does this PR solve? 1. Enhance retry and timeout, and adjust the default timeout 2. NER: spacy do not batch chunks 3. extract _has_cancel_and_exit 4. enhance log messages ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2026-05-22 13:16:39 +08:00
Wang Qi	c5a46fda44	Fix: <asyncio.locks.Semaphore object at 0xabcd [locked]> is bound to a different event loop (#15100 ) Fix: <asyncio.locks.Semaphore object at 0xabcd [locked]> is bound to a different event loop	2026-05-21 19:23:41 +08:00
Wang Qi	13b422037f	Refactor: enhance graphrag - part 2 (#14972 ) ### What problem does this PR solve? 1. expose batch_chunk_token_size for configuration 2. retrieve chunks when build subgraph for the doc, not retreive all docs chunks at the begining 3. get all chunks for a document, used to be hard coded 10000 4. delete not used method run_graphrag ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring Follow on: #14617	2026-05-18 16:10:21 +08:00
Wang Qi	3838770e7a	GraphRAG feature - Part 1 - add spacy to extract entity and relation (#14670 ) ### What problem does this PR solve? GraphRAG feature - Part 1 - add spacy to extract entity and relation <img width="1621" height="1288" alt="image" src="https://github.com/user-attachments/assets/aadeddad-94da-46c6-adad-9c3784181f61" /> ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-11 12:59:59 +08:00
Preston Percival	e8f19aa338	feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238 ) This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases. Fixes #14236. ## 1. Fix concurrent merge crash Long GraphRAG runs would crash near the end of entity resolution with: ``` RuntimeError: dictionary keys changed during iteration ``` in `Extractor._merge_graph_nodes`. Two changes: - `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)` via `list(...)` before iterating, so concurrent `add_edge` / `remove_node` mutations on the shared `nx.Graph` cannot invalidate the iterator. Also tracks each redirected neighbour in `node0_neighbors` so a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting via `add_edge`. - `rag/graphrag/entity_resolution.py`: serialize the merge step with a dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix. ## 2. Don't wipe partial graph on pause Previously the pause / cancel UI path called `settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`, destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already added `load_subgraph_from_store`. After main was merged in (which deleted `api/apps/kb_app.py` per #14394), the pause path now lives on the new REST surface `DELETE /v1/datasets/<id>/<index_type>`: - `api/apps/services/dataset_api_service.py`: `delete_index` accepts a `wipe: bool = True` parameter. When `False` the doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour. - `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false\|0\|no\|off` from the query string and forwards it. - `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`: `unbindPipelineTask` appends `?wipe=false` when explicitly false. - The GraphRAG pause action in `web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe: false` for `KnowledgeGraph`; raptor is unchanged. UX impact: the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in `GenerateLogButton` (trash icon behind a confirmation modal). ## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`) A small Redis-backed marker layer at `graphrag:phase:{kb_id}:{resolution_done\|community_done}` (7-day TTL). `run_graphrag_for_kb` consults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when: - new docs are merged into the graph (which invalidates prior resolution and community results), - `delete_index` wipes the graph, or - `delete_knowledge_graph` is called. Redis failures never block a run -- markers are an optimization, not a gate. ## 4. Idempotent community detection `extract_community` previously did `delete-then-insert` on `community_report` rows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from `(kb_id, community.title)`, the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix. ## 5. Tunable doc-store insert pipeline The GraphRAG insert loop in `rag/graphrag/utils.py` and the `community_report` insert in `rag/graphrag/general/index.py` were both hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead. - New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py` batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout semantics as the prior loop. - Defaults: 64 docs per batch, 4 batches in flight (matches the regular ingest pipeline in `document_service.py`). Tunable per-deployment via `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`. - Both `set_graph` and `extract_community` now use the helper. This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default). ## Tests - `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges. - `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure. - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`: new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers. ## Compatibility - Backward compatible: tasks queued before this change behave identically (default `wipe=true`, no markers expected). - No schema/migration changes; all new state lives in Redis. - New optional REST query param `wipe` on `DELETE /v1/datasets/<id>/<index_type>`. - New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour. ## Example of resume Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying. <img width="521" height="677" alt="image" src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a" /> ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-05-06 15:01:01 +08:00
NeedmeFordev	38e45a1117	Fix: serialize GraphRAG entity resolution merges to avoid graph mutation races (#14237 ) ### What problem does this PR solve? This PR fixes the merge-phase crash reported in #14236 during GraphRAG entity resolution. The issue happens after candidate pair resolution completes, when multiple merge coroutines mutate the same shared `networkx` graph concurrently. In `_merge_graph_nodes`, the code iterates over `graph.neighbors(node1)` and also awaits during edge/description merging. That allows another coroutine to modify the graph adjacency structure in between, which can trigger `RuntimeError: dictionary keys changed during iteration` and can also lead to unsafe shared-graph mutation. This change keeps the PR scoped to that single issue by: - serializing merge-time graph mutations with a dedicated merge lock - snapshotting `graph.neighbors(node1)` with `list(...)` before iteration Together, these changes prevent concurrent mutation of the shared graph during the merge phase and make the merge loop safe against live-view invalidation. Fixes #14236 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-22 16:42:53 +08:00
euvre	0cd49e14dd	fix: make Infinity connection pool size configurable and add retry logic for GraphRAG write bursts (#14143 ) ### What problem does this PR solve? Resolve #14137 . ### Problem Graph resolution succeeds (nodes/edges merged, pagerank updated), but the subsequent burst of Infinity write operations in `set_graph` exhausts the connection pool with `TOO_MANY_CONNECTIONS` errors. Root causes: 1. Hardcoded pool size — `infinity_conn_pool.py` hardcoded `ConnectionPool(max_size=4)` on initial creation and `max_size=32` on refresh. Operators cannot tune this without patching code. 2. No retry on transient failures — a single `TOO_MANY_CONNECTIONS` on edge deletes or chunk inserts kills the entire resolution+community pipeline with no retry. ### Changes #### `common/doc_store/infinity_conn_pool.py` - Read `ConnectionPool` `max_size` from the `INFINITY_POOL_MAX_SIZE` environment variable (default: `4`), applied consistently to both initial creation and refresh paths. - Log the actual pool size on startup for easier debugging. #### `rag/graphrag/utils.py` — `set_graph()` - Edge deletes: add exponential-backoff retry (3 attempts, 1s/2s/4s delays) so transient `TOO_MANY_CONNECTIONS` errors are retried instead of failing the entire job. Concurrency continues to be gated by the existing `chat_limiter`. - Batch inserts: add exponential-backoff retry (3 attempts, 1s/2s/4s delays) for the same reason. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-16 15:40:54 +08:00
Minal Mahala	f930389311	Refact: improve task resume mechanism for graphrag (#14096 ) ### What problem does this PR solve? Addresses review feedback on #14074 (Checkpoint mechanism for long-running workflow jobs, issue #12494). Changes based on @yuzhichang's review: 1. Renamed `checkpoint_service.py` → `task_checkpoint.py` as suggested. 2. Replaced Redis with direct docEngine queries as suggested — the subgraph already gets persisted to the doc store by `generate_subgraph()`, so we just query for it instead of maintaining a separate checkpoint in Redis. This is simpler, has no extra dependency, and uses a single source of truth. Changes based on CodeRabbit review: 3. Fixed `source_id` query format mismatch — subgraphs are stored with `source_id: [doc_id]` (list), but the original query used `source_id: doc_id` (string). Now follows the same pattern as `does_graph_contains()` in `rag/graphrag/utils.py`: filter by `knowledge_graph_kwd` only, then match `source_id` in Python. This avoids ambiguity across Elasticsearch / Infinity / OceanBase backends. ### Changes \| File \| Change \| \|---\|---\| \| `api/db/services/task_checkpoint.py` (new) \| `load_subgraph_from_store()` and `has_raptor_chunks()` — docEngine-based checkpoint queries \| \| `rag/graphrag/general/index.py` \| `build_one()` calls `load_subgraph_from_store()` before running LLM extraction \| \| `rag/svr/task_executor.py` \| RAPTOR per-doc loop calls `has_raptor_chunks()` before processing \| \| `test/unit_test/rag/graphrag/test_checkpoint_resume.py` (new) \| 10 unit tests covering subgraph loading, source_id filtering, edge cases \| ### How it works - GraphRAG: Before running expensive LLM entity/relation extraction for a doc, checks the doc store for an existing subgraph (saved by a previous interrupted run). If found, loads it directly and skips LLM calls. - RAPTOR: Before processing a doc, checks if RAPTOR chunks (`raptor_kwd="raptor"`) already exist for it. If yes, skips. ### Testing - 10 new unit tests — all passing - Full existing suite: 617 passed ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring	2026-04-15 17:37:28 +08:00
Yongteng Lei	b33d2fdea5	Refa: GraphRAG to use async chat methods instead of thread pool execution (#14002 ) ### What problem does this PR solve? GraphRAG _async_chat. ### Type of change - [x] Refactoring - [x] Performance Improvement <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Refactor * Unified chat calls to an async invocation across extractors, improving timeout handling and ensuring task IDs propagate reliably. * Tests * Added and expanded unit tests and mocks to cover extractor behavior, timeout scenarios, and safe test-package imports, reducing regression risk. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2026-04-09 19:57:35 +08:00
Zhichang Yu	b7744e053e	fix: support dense_vector from ES fields response (ES 9.x compatibility) (#13972 ) fix: support dense_vector from ES fields response (ES 9.x compatibility) - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Configuration Chore (non-breaking change which updates configuration) ## Summary by CodeRabbit * Bug Fixes * More accurate handling and unwrapping of dense-vector fields so returned values have correct shapes. * Field selection reliably limits returned data and falls back to alternate result locations when needed. * Use of consistent result IDs and tolerant handling when score values are missing. * Chores / Configuration * Increased build memory and adjusted build-time flags for the frontend build. * Simplified runtime model/GPU checks and removed an automated runtime GPU-install attempt. * Build Fixes * `web/vite.config.ts`: make `build.minify` and `build.sourcemap` respect `VITE_MINIFY` and `VITE_BUILD_SOURCEMAP` env vars from Dockerfile instead of hardcoding `terser` and `true`. * Environment * Allow stack version override and default the runtime image tag to "latest". <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * Bug Fixes * Correct unwrapping of dense-vector fields and reliable field selection with fallback locations. * Consistent use of hit-level IDs and tolerant handling when score values are missing. * Chores / Configuration * Increased frontend build memory and added build-time minify/sourcemap flags; build minification and sourcemap now configurable. * Removed runtime GPU detection for model initialization; force CPU initialization. * Environment * Allow stack version override and default runtime image tag to "latest". <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 17:44:13 +08:00
yH	757d8d42dd	Fix: use configured OrderByExpr in _community_retrieval_ (#13683 ) The `odr` variable was configured with `desc("weight_flt")` but a new empty `OrderByExpr()` was passed to `dataStore.search()` instead, causing the descending sort to have no effect. ### What problem does this PR solve? In `_community_retrieval_`, the configured `OrderByExpr` with `desc("weight_flt")` was discarded — a new empty `OrderByExpr()` was passed to `dataStore.search()` instead, so community reports were never sorted by weight. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-19 17:55:40 +08:00
Idriss Sbaaoui	7827f0fce5	fix : empty mind map (#13693 ) ### What problem does this PR solve? Fix graphrag extractor chat response parsing and skip truncated cache values ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-19 13:53:06 +08:00
Lynn	62cb292635	Feat/tenant model (#13072 ) ### What problem does this PR solve? Add id for table tenant_llm and apply in LLMBundle. ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com> Co-authored-by: Liu An <asiro@qq.com>	2026-03-05 17:27:17 +08:00
TheoG	67937a668e	Fix graphrag extraction (#13113 ) ### What problem does this PR solve? Fix error when extracting the graph. A string is expected, but a tuple was provided. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-11 20:11:56 +08:00
Magicbook1108	7be3dacdaa	Fix: custom delimeter in docx (#12946 ) ### What problem does this PR solve? Fix: custom delimeter in docx ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-02-03 09:43:18 +08:00
Kevin Hu	32c0161ff1	Refa: Clean the folders. (#12890 ) ### Type of change - [x] Refactoring	2026-01-29 14:23:26 +08:00

21 Commits