ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-05 19:08:38 +08:00

Author	SHA1	Message	Date
Wang Qi	4cbe597d7e	Refactor: consolidate to use @login_required (#15652 ) Refactor: consolidate to use @login_required	2026-06-05 11:35:00 +08:00
euvre	f4b8f53b6d	Fix: restore embedding model switching for datasets with existing chunks (#14732 ) ### What problem does this PR solve? ## Problem During the REST API refactoring (#13690), the `/api/v2/kb/check_embedding` endpoint was removed and never migrated to the new RESTful structure. The frontend was pointed to the `/api/v1/datasets/{id}/embedding` endpoint (which is `run_embedding` — a completely different function). Additionally, a hard guard was introduced that rejects any `embd_id` change when `chunk_num > 0`, making it impossible to switch embedding models on datasets with existing chunks. ## Root Cause 1. Missing endpoint: The old `check_embedding` logic (sample random chunks, re-embed with the new model, compare cosine similarity) was not carried over to the new REST API service layer. 2. Wrong frontend URL: `checkEmbedding` in `api.ts` pointed to `/datasets/{id}/embedding` (`run_embedding`) instead of a dedicated check endpoint. 3. Overly restrictive guard: `dataset_api_service.py` line 310 blocked all `embd_id` updates when `chunk_num > 0`. This check did not exist in the pre-refactor code — it was incorrectly introduced during the refactor. ## Changes ### Backend - `api/apps/services/dataset_api_service.py` - Remove the `chunk_num > 0` hard guard on `embd_id` updates - Add `check_embedding()` service function: samples random chunks, re-embeds them with the candidate model, computes cosine similarity, returns compatibility result (avg ≥ 0.9 = compatible) - Add `import re` for the `_clean()` helper - `api/apps/restful_apis/dataset_api.py` - Add `POST /datasets/<dataset_id>/embedding/check` endpoint following the new REST API conventions - Clean up unused top-level imports (`random`, `re`, `numpy`) ### Frontend - `web/src/utils/api.ts` - Fix `checkEmbedding` URL from `/datasets/${datasetId}/embedding` → `/datasets/${datasetId}/embedding/check` ### Tests - `test/testcases/test_http_api/test_dataset_management/test_update_dataset.py` - Update `test_embedding_model_with_existing_chunks` to assert success (`code == 0`) instead of expecting the old `102` error - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py` - Update `test_update_route_branch_matrix_unit` to assert `RetCode.SUCCESS` when updating `embd_id` on a chunked dataset, replacing the old `chunk_num` error assertion ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-05-09 18:48:57 +08:00
Preston Percival	e8f19aa338	feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238 ) This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases. Fixes #14236. ## 1. Fix concurrent merge crash Long GraphRAG runs would crash near the end of entity resolution with: ``` RuntimeError: dictionary keys changed during iteration ``` in `Extractor._merge_graph_nodes`. Two changes: - `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)` via `list(...)` before iterating, so concurrent `add_edge` / `remove_node` mutations on the shared `nx.Graph` cannot invalidate the iterator. Also tracks each redirected neighbour in `node0_neighbors` so a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting via `add_edge`. - `rag/graphrag/entity_resolution.py`: serialize the merge step with a dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix. ## 2. Don't wipe partial graph on pause Previously the pause / cancel UI path called `settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`, destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already added `load_subgraph_from_store`. After main was merged in (which deleted `api/apps/kb_app.py` per #14394), the pause path now lives on the new REST surface `DELETE /v1/datasets/<id>/<index_type>`: - `api/apps/services/dataset_api_service.py`: `delete_index` accepts a `wipe: bool = True` parameter. When `False` the doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour. - `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false\|0\|no\|off` from the query string and forwards it. - `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`: `unbindPipelineTask` appends `?wipe=false` when explicitly false. - The GraphRAG pause action in `web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe: false` for `KnowledgeGraph`; raptor is unchanged. UX impact: the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in `GenerateLogButton` (trash icon behind a confirmation modal). ## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`) A small Redis-backed marker layer at `graphrag:phase:{kb_id}:{resolution_done\|community_done}` (7-day TTL). `run_graphrag_for_kb` consults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when: - new docs are merged into the graph (which invalidates prior resolution and community results), - `delete_index` wipes the graph, or - `delete_knowledge_graph` is called. Redis failures never block a run -- markers are an optimization, not a gate. ## 4. Idempotent community detection `extract_community` previously did `delete-then-insert` on `community_report` rows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from `(kb_id, community.title)`, the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix. ## 5. Tunable doc-store insert pipeline The GraphRAG insert loop in `rag/graphrag/utils.py` and the `community_report` insert in `rag/graphrag/general/index.py` were both hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead. - New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py` batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout semantics as the prior loop. - Defaults: 64 docs per batch, 4 batches in flight (matches the regular ingest pipeline in `document_service.py`). Tunable per-deployment via `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`. - Both `set_graph` and `extract_community` now use the helper. This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default). ## Tests - `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges. - `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure. - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`: new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers. ## Compatibility - Backward compatible: tasks queued before this change behave identically (default `wipe=true`, no markers expected). - No schema/migration changes; all new state lives in Redis. - New optional REST query param `wipe` on `DELETE /v1/datasets/<id>/<index_type>`. - New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour. ## Example of resume Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying. <img width="521" height="677" alt="image" src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a" /> ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-05-06 15:01:01 +08:00
euvre	4dcc42e0e1	feat(api): add unified index API and dataset management endpoints (#14222 ) ### What problem does this PR solve? ## Summary Refactor the dataset API layer into a clean service/REST separation pattern, add a unified `/index` API for graph/raptor/mindmap operations, and introduce several new dataset management endpoints with full test coverage. ## Changes ### Service Layer (`dataset_api_service.py`) - Added `trace_index(dataset_id, tenant_id, index_type)` — unified trace function for all index types - Added `run_index`, `delete_index` service functions - Added `get_dataset`, `get_ingestion_summary`, `list_ingestion_logs`, `get_ingestion_log` - Added `run_embedding`, `list_tags`, `aggregate_tags`, `delete_tags`, `rename_tag` - Added `get_flattened_metadata`, `get_auto_metadata`, `update_auto_metadata` ### REST API Layer (`dataset_api.py`) New unified routes: \| Method \| Route \| Description \| \|--------\|-------\|-------------\| \| POST \| `/datasets/<id>/index?type=graph\\|raptor\\|mindmap` \| Run index task \| \| GET \| `/datasets/<id>/index?type=graph\\|raptor\\|mindmap` \| Trace index task \| \| DELETE \| `/datasets/<id>/<index_type>` \| Delete index \| \| GET \| `/datasets/<id>` \| Get dataset details \| \| GET \| `/datasets/<id>/ingestions/summary` \| Ingestion summary \| \| GET \| `/datasets/<id>/ingestions` \| List ingestion logs \| \| GET \| `/datasets/<id>/ingestions/<log_id>` \| Get single ingestion log \| \| POST \| `/datasets/<id>/embedding` \| Run embedding \| \| GET \| `/datasets/<id>/tags` \| List tags \| \| GET \| `/datasets/tags/aggregation` \| Aggregate tags across datasets \| \| DELETE \| `/datasets/<id>/tags` \| Delete tags \| \| PUT \| `/datasets/<id>/tags` \| Rename tag \| \| GET \| `/datasets/metadata/flattened` \| Get flattened metadata \| \| GET/PUT \| `/datasets/<id>/metadata/config` \| New metadata config path \| Removed routes (replaced by unified `/index`): - `POST /datasets/<id>/mindmap` - `GET /datasets/<id>/mindmap` Preserved legacy routes (backward compatibility): - `/run_graphrag`, `/trace_graphrag`, `/run_raptor`, `/trace_raptor` - `/auto_metadata` GET/PUT ### Test Suite - Updated `common.py` helpers: added `trace_index`, removed `run_mindmap`/`trace_mindmap` - Added 7 new test files with 39 test cases total: \| Test File \| Cases \| \|-----------\|-------\| \| `test_get_dataset.py` \| 4 \| \| `test_ingestion_summary.py` \| 2 \| \| `test_ingestion_logs.py` \| 5 \| \| `test_index_api.py` \| 14 \| \| `test_embedding.py` \| 2 \| \| `test_tags.py` \| 8 \| \| `test_flattened_metadata.py` \| 4 \| - Deleted `test_mindmap_tasks.py` (covered by unified index tests) ## Design Decisions 1. Unified `/index?type=...` — single endpoint replaces 3 separate route pairs for graph/raptor/mindmap 2. Backward compatibility — old routes (`/run_graphrag`, `/run_raptor`, `/auto_metadata`) preserved alongside new paths 3. `_VALID_INDEX_TYPES = {"graph", "raptor", "mindmap"}` — input validation via constant set 4. `_INDEX_TYPE_TO_TASK_ID_FIELD` — maps index type to KB model task ID field for clean dispatch ## Files Changed - `api/apps/restful_apis/dataset_api.py` - `api/apps/services/dataset_api_service.py` - `sdk/python/ragflow_sdk/modules/dataset.py` - `test/testcases/test_http_api/common.py` - `test/testcases/test_http_api/test_dataset_management/` (7 new files) ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 09:38:01 +08:00
Idriss Sbaaoui	f3b4d6ab0e	Fix: ci fails (#13778 ) ### What problem does this PR solve? fix tests failing at p2 and p3 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-03-25 17:56:13 +08:00
Lynn	4bb1acaa5b	Refactor: dataset / kb API to RESTFul style (#13690 ) ### What problem does this PR solve? 1. Split dataset api to gateway and service, and modify web UI to use restful http api. 2. Old KB releated APIs are commented. ### Type of change - [x] Refactoring --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-03-19 14:41:36 +08:00
Jin Hai	986dcf1cc8	Revert "Refactor: dataset / kb API to RESTFul style" (#13646 ) Reverts infiniflow/ragflow#13619	2026-03-17 12:09:48 +08:00
Lynn	1db5409d82	Refactor: dataset / kb API to RESTFul style (#13619 ) ### What problem does this PR solve? 1. Split dataset api to gateway and service, and modify web UI to use restful http api. 2. Old KB releated APIs are commented. ### Type of change - [x] Refactoring	2026-03-16 22:51:34 +08:00
Jin Hai	a2d72202cf	Revert "Refactor dataset / kb API to RESTFul style" (#13614 ) Reverts infiniflow/ragflow#13263	2026-03-16 10:44:38 +08:00
Lynn	7c32e206be	Refactor dataset / kb API to RESTFul style (#13263 ) ### What problem does this PR solve? 1. Split dataset api to gateway and service, and modify web UI to use restful http api. 2. Old KB releated APIs are commented. ### Type of change - [x] Refactoring	2026-03-13 20:02:35 +08:00
Yongteng Lei	51be1f1442	Refa: empty ids means no-op operation (#13439 ) ### What problem does this PR solve? Empty ids means no-op operation. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Documentation Update - [x] Refactoring --------- Co-authored-by: writinwaters <cai.keith@gmail.com>	2026-03-06 18:16:42 +08:00
6ba3i	22c4d72891	tests: improve RAGFlow coverage based on Codecov report (#13219 ) ### What problem does this PR solve? Codecov’s coverage report shows that several RAGFlow code paths are currently untested or under-tested. This makes it easier for regressions to slip in during refactors and feature work. This PR adds targeted automated tests to cover the files and branches highlighted by Codecov, improving confidence in core behavior while keeping runtime functionality unchanged. ### Type of change - [x] Other (please describe): Test coverage improvement (adds/extends unit and integration tests to address Codecov-reported gaps)	2026-02-26 19:03:26 +08:00

12 Commits