Files
ragflow/rag/graphrag/general/extractor.py

373 lines
17 KiB
Python
Raw Normal View History

#
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import asyncio
import logging
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
import os
import re
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
from collections import Counter, defaultdict
from copy import deepcopy
from typing import Callable
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
import networkx as nx
from api.db.services.task_service import has_canceled
from common.token_utils import truncate
from rag.graphrag.general.graph_prompt import SUMMARIZE_DESCRIPTIONS_PROMPT
from rag.graphrag.utils import (
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
GraphChange,
chat_limiter,
flat_uniq_list,
get_from_to,
get_llm_cache,
handle_single_entity_extraction,
handle_single_relationship_extraction,
set_llm_cache,
split_string_by_multi_markers,
)
from common.misc_utils import thread_pool_exec
from rag.llm.chat_model import Base as CompletionLLM
from rag.prompts.generator import message_fit_in
from common.exceptions import TaskCanceledException
GRAPH_FIELD_SEP = "<SEP>"
DEFAULT_ENTITY_TYPES = ["organization", "person", "geo", "event", "category"]
ENTITY_EXTRACTION_MAX_GLEANINGS = 2
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
MAX_CONCURRENT_PROCESS_AND_EXTRACT_CHUNK = int(os.environ.get("MAX_CONCURRENT_PROCESS_AND_EXTRACT_CHUNK", 10))
class Extractor:
_llm: CompletionLLM
def __init__(
self,
llm_invoker: CompletionLLM,
language: str | None = "English",
entity_types: list[str] | None = None,
):
self._llm = llm_invoker
self._language = language
self._entity_types = entity_types or DEFAULT_ENTITY_TYPES
@staticmethod
def _normalize_response_text(response):
if isinstance(response, (list, tuple)):
response = response[0] if response else ""
if response is None:
return ""
return response if isinstance(response, str) else str(response)
@staticmethod
def _is_truncated_cache(response):
return len((response or "").strip()) <= 1
async def _async_chat(self, system, history, gen_conf={}, task_id=""):
hist = deepcopy(history)
conf = deepcopy(gen_conf)
response = await thread_pool_exec(get_llm_cache, self._llm.llm_name, system, hist, conf)
response = self._normalize_response_text(response)
if self._is_truncated_cache(response):
response = ""
if response:
return response
_, system_msg = message_fit_in([{"role": "system", "content": system}], int(self._llm.max_length * 0.92))
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
response = ""
for attempt in range(3):
if task_id:
if await thread_pool_exec(has_canceled, task_id):
logging.info(f"Task {task_id} cancelled during entity resolution candidate processing.")
raise TaskCanceledException(f"Task {task_id} was cancelled")
try:
response = await asyncio.wait_for(
self._llm.async_chat(system_msg[0]["content"], hist, conf),
timeout=60 * 20,
)
response = self._normalize_response_text(response)
response = re.sub(r"^.*</think>", "", response, flags=re.DOTALL)
if response.find("**ERROR**") >= 0:
raise Exception(response)
if not self._is_truncated_cache(response):
await thread_pool_exec(set_llm_cache, self._llm.llm_name, system, response, history, gen_conf)
break
except asyncio.TimeoutError:
logging.warning("_async_chat timed out after 20 minutes")
raise # timeout is not a transient error; do not retry
except Exception as e:
logging.exception(e)
if attempt == 2:
raise
return response
def _entities_and_relations(self, chunk_key: str, records: list, tuple_delimiter: str):
maybe_nodes = defaultdict(list)
maybe_edges = defaultdict(list)
ent_types = [t.lower() for t in self._entity_types]
for record in records:
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
record_attributes = split_string_by_multi_markers(record, [tuple_delimiter])
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
if_entities = handle_single_entity_extraction(record_attributes, chunk_key)
if if_entities is not None and if_entities.get("entity_type", "unknown").lower() in ent_types:
maybe_nodes[if_entities["entity_name"]].append(if_entities)
continue
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
if_relation = handle_single_relationship_extraction(record_attributes, chunk_key)
if if_relation is not None:
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
maybe_edges[(if_relation["src_id"], if_relation["tgt_id"])].append(if_relation)
return dict(maybe_nodes), dict(maybe_edges)
async def __call__(self, doc_id: str, chunks: list[str], callback: Callable | None = None, task_id: str = ""):
self.callback = callback
start_ts = asyncio.get_running_loop().time()
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
async def extract_all(doc_id, chunks, max_concurrency=MAX_CONCURRENT_PROCESS_AND_EXTRACT_CHUNK, task_id=""):
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
out_results = []
error_count = 0
max_errors = int(os.environ.get("GRAPHRAG_MAX_ERRORS", 3))
limiter = asyncio.Semaphore(max_concurrency)
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
async def worker(chunk_key_dp: tuple[str, str], idx: int, total: int, task_id=""):
nonlocal error_count
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
async with limiter:
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled during entity extraction")
try:
await self._process_single_content(chunk_key_dp, idx, total, out_results, task_id)
except Exception as e:
error_count += 1
error_msg = f"Error processing chunk {idx + 1}/{total}: {str(e)}"
logging.warning(error_msg)
if self.callback:
self.callback(msg=error_msg)
if error_count > max_errors:
raise Exception(f"Maximum error count ({max_errors}) reached. Last errors: {str(e)}")
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
tasks = [
asyncio.create_task(worker((doc_id, ck), i, len(chunks), task_id))
for i, ck in enumerate(chunks)
]
try:
await asyncio.gather(*tasks, return_exceptions=False)
except Exception as e:
logging.error(f"Error in worker: {str(e)}")
for t in tasks:
t.cancel()
await asyncio.gather(*tasks, return_exceptions=True)
raise
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
if error_count > 0:
warning_msg = f"Completed with {error_count} errors (out of {len(chunks)} chunks processed)"
logging.warning(warning_msg)
if self.callback:
self.callback(msg=warning_msg)
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
return out_results
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled before entity extraction")
out_results = await extract_all(doc_id, chunks, max_concurrency=MAX_CONCURRENT_PROCESS_AND_EXTRACT_CHUNK, task_id=task_id)
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled after entity extraction")
maybe_nodes = defaultdict(list)
maybe_edges = defaultdict(list)
sum_token_count = 0
for m_nodes, m_edges, token_count in out_results:
for k, v in m_nodes.items():
maybe_nodes[k].extend(v)
for k, v in m_edges.items():
maybe_edges[tuple(sorted(k))].extend(v)
sum_token_count += token_count
now = asyncio.get_running_loop().time()
if self.callback:
self.callback(msg=f"Entities and relationships extraction done, {len(maybe_nodes)} nodes, {len(maybe_edges)} edges, {sum_token_count} tokens, {now - start_ts:.2f}s.")
start_ts = now
logging.info("Entities merging...")
all_entities_data = []
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled before nodes merging")
tasks = [
asyncio.create_task(self._merge_nodes(en_nm, ents, all_entities_data, task_id))
for en_nm, ents in maybe_nodes.items()
]
try:
await asyncio.gather(*tasks, return_exceptions=False)
except Exception as e:
logging.error(f"Error merging nodes: {e}")
for t in tasks:
t.cancel()
await asyncio.gather(*tasks, return_exceptions=True)
raise
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled after nodes merging")
now = asyncio.get_running_loop().time()
if self.callback:
self.callback(msg=f"Entities merging done, {now - start_ts:.2f}s.")
start_ts = now
logging.info("Relationships merging...")
all_relationships_data = []
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled before relationships merging")
tasks = []
for (src, tgt), rels in maybe_edges.items():
tasks.append(
asyncio.create_task(
self._merge_edges(src, tgt, rels, all_relationships_data, task_id)
)
)
try:
await asyncio.gather(*tasks, return_exceptions=False)
except Exception as e:
logging.error(f"Error during relationships merging: {e}")
for t in tasks:
t.cancel()
await asyncio.gather(*tasks, return_exceptions=True)
raise
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled after relationships merging")
now = asyncio.get_running_loop().time()
if self.callback:
self.callback(msg=f"Relationships merging done, {now - start_ts:.2f}s.")
if not len(all_entities_data) and not len(all_relationships_data):
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
logging.warning("Didn't extract any entities and relationships, maybe your LLM is not working")
if not len(all_entities_data):
logging.warning("Didn't extract any entities")
if not len(all_relationships_data):
logging.warning("Didn't extract any relationships")
return all_entities_data, all_relationships_data
async def _merge_nodes(self, entity_name: str, entities: list[dict], all_relationships_data, task_id=""):
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled during merge nodes")
if not entities:
return
entity_type = sorted(
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
Counter([dp["entity_type"] for dp in entities]).items(),
key=lambda x: x[1],
reverse=True,
)[0][0]
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
description = GRAPH_FIELD_SEP.join(sorted(set([dp["description"] for dp in entities])))
already_source_ids = flat_uniq_list(entities, "source_id")
description = await self._handle_entity_relation_summary(entity_name, description, task_id=task_id)
node_data = dict(
entity_type=entity_type,
description=description,
source_id=already_source_ids,
)
node_data["entity_name"] = entity_name
all_relationships_data.append(node_data)
async def _merge_edges(self, src_id: str, tgt_id: str, edges_data: list[dict], all_relationships_data=None, task_id=""):
if not edges_data:
return
weight = sum([edge["weight"] for edge in edges_data])
description = GRAPH_FIELD_SEP.join(sorted(set([edge["description"] for edge in edges_data])))
description = await self._handle_entity_relation_summary(f"{src_id} -> {tgt_id}", description, task_id=task_id)
keywords = flat_uniq_list(edges_data, "keywords")
source_id = flat_uniq_list(edges_data, "source_id")
Refa: revise the implementation of LightRAG and enable response caching (#9828) ### What problem does this PR solve? This revision performed a comprehensive check on LightRAG to ensure the correctness of its implementation. It **did not involve** Entity Resolution and Community Reports Generation. There is an example using default entity types and the General chunking method, which shows good results in both time and effectiveness. Moreover, response caching is enabled for resuming failed tasks. [The-Necklace.pdf](https://github.com/user-attachments/files/22042432/The-Necklace.pdf) After: ![img_v3_02pk_177dbc6a-e7cc-4732-b202-ad4682d171fg](https://github.com/user-attachments/assets/5ef1d93a-9109-4fe9-8a7b-a65add16f82b) ```bash Begin at: Fri, 29 Aug 2025 16:48:03 GMT Duration: 222.31 s Progress: 16:48:04 Task has been received. 16:48:06 Page(1~7): Start to parse. 16:48:06 Page(1~7): OCR started 16:48:08 Page(1~7): OCR finished (1.89s) 16:48:11 Page(1~7): Layout analysis (3.72s) 16:48:11 Page(1~7): Table analysis (0.00s) 16:48:11 Page(1~7): Text merged (0.00s) 16:48:11 Page(1~7): Finish parsing. 16:48:12 Page(1~7): Generate 7 chunks 16:48:12 Page(1~7): Embedding chunks (0.29s) 16:48:12 Page(1~7): Indexing done (0.04s). Task done (7.84s) 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... 16:48:17 Start processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... 16:49:30 Completed processing for f421fb06849e11f0bdd32724b93a52b2: She had no dresses, no je... after 1 gleanings, 21985 tokens. 16:49:30 Entities extraction of chunk 3 1/7 done, 12 nodes, 13 edges, 21985 tokens. 16:49:40 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Finally, she replied, hes... after 1 gleanings, 22584 tokens. 16:49:40 Entities extraction of chunk 5 2/7 done, 19 nodes, 19 edges, 22584 tokens. 16:50:02 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Then she asked, hesitatin... after 1 gleanings, 24610 tokens. 16:50:02 Entities extraction of chunk 0 3/7 done, 16 nodes, 28 edges, 24610 tokens. 16:50:03 Completed processing for f421fb06849e11f0bdd32724b93a52b2: And this life lasted ten ... after 1 gleanings, 24031 tokens. 16:50:04 Entities extraction of chunk 1 4/7 done, 24 nodes, 22 edges, 24031 tokens. 16:50:14 Completed processing for f421fb06849e11f0bdd32724b93a52b2: So they begged the jewell... after 1 gleanings, 24635 tokens. 16:50:14 Entities extraction of chunk 6 5/7 done, 27 nodes, 26 edges, 24635 tokens. 16:50:29 Completed processing for f421fb06849e11f0bdd32724b93a52b2: Her husband, already half... after 1 gleanings, 25758 tokens. 16:50:29 Entities extraction of chunk 2 6/7 done, 25 nodes, 35 edges, 25758 tokens. 16:51:35 Completed processing for f421fb06849e11f0bdd32724b93a52b2: The Necklace By Guy de Ma... after 1 gleanings, 27491 tokens. 16:51:35 Entities extraction of chunk 4 7/7 done, 39 nodes, 37 edges, 27491 tokens. 16:51:35 Entities and relationships extraction done, 147 nodes, 177 edges, 171094 tokens, 198.58s. 16:51:35 Entities merging done, 0.01s. 16:51:35 Relationships merging done, 0.01s. 16:51:35 ignored 7 relations due to missing entities. 16:51:35 generated subgraph for doc f421fb06849e11f0bdd32724b93a52b2 in 198.68 seconds. 16:51:35 run_graphrag f421fb06849e11f0bdd32724b93a52b2 graphrag_task_lock acquired 16:51:35 set_graph removed 0 nodes and 0 edges from index in 0.00s. 16:51:35 Get embedding of nodes: 9/147 16:51:35 Get embedding of nodes: 109/147 16:51:37 Get embedding of edges: 9/170 16:51:37 Get embedding of edges: 109/170 16:51:40 set_graph converted graph change to 319 chunks in 4.21s. 16:51:40 Insert chunks: 4/319 16:51:40 Insert chunks: 104/319 16:51:40 Insert chunks: 204/319 16:51:40 Insert chunks: 304/319 16:51:40 set_graph added/updated 147 nodes and 170 edges from index in 0.53s. 16:51:40 merging subgraph for doc f421fb06849e11f0bdd32724b93a52b2 into the global graph done in 4.79 seconds. 16:51:40 Knowledge Graph done (204.29s) ``` Before: ![img_v3_02pk_63370edf-ecee-4ee8-8ac8-69c8d2c712fg](https://github.com/user-attachments/assets/1162eb0f-68c2-4de5-abe0-cdfa168f71de) ```bash Begin at: Fri, 29 Aug 2025 17:00:47 GMT processDuration: 173.38 s Progress: 17:00:49 Task has been received. 17:00:51 Page(1~7): Start to parse. 17:00:51 Page(1~7): OCR started 17:00:53 Page(1~7): OCR finished (1.82s) 17:00:57 Page(1~7): Layout analysis (3.64s) 17:00:57 Page(1~7): Table analysis (0.00s) 17:00:57 Page(1~7): Text merged (0.00s) 17:00:57 Page(1~7): Finish parsing. 17:00:57 Page(1~7): Generate 7 chunks 17:00:57 Page(1~7): Embedding chunks (0.31s) 17:00:57 Page(1~7): Indexing done (0.03s). Task done (7.88s) 17:00:57 created task graphrag 17:01:00 Task has been received. 17:02:17 Entities extraction of chunk 1 1/7 done, 9 nodes, 9 edges, 10654 tokens. 17:02:31 Entities extraction of chunk 2 2/7 done, 12 nodes, 13 edges, 11066 tokens. 17:02:33 Entities extraction of chunk 4 3/7 done, 9 nodes, 10 edges, 10433 tokens. 17:02:42 Entities extraction of chunk 5 4/7 done, 11 nodes, 14 edges, 11290 tokens. 17:02:52 Entities extraction of chunk 6 5/7 done, 13 nodes, 15 edges, 11039 tokens. 17:02:55 Entities extraction of chunk 3 6/7 done, 14 nodes, 13 edges, 11466 tokens. 17:03:32 Entities extraction of chunk 0 7/7 done, 19 nodes, 18 edges, 13107 tokens. 17:03:32 Entities and relationships extraction done, 71 nodes, 89 edges, 79055 tokens, 149.66s. 17:03:32 Entities merging done, 0.01s. 17:03:32 Relationships merging done, 0.01s. 17:03:32 ignored 1 relations due to missing entities. 17:03:32 generated subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 in 149.69 seconds. 17:03:32 run_graphrag b1d9d3b6848711f0aacd7ddc0714c4d3 graphrag_task_lock acquired 17:03:32 set_graph removed 0 nodes and 0 edges from index in 0.00s. 17:03:32 Get embedding of nodes: 9/71 17:03:33 Get embedding of edges: 9/88 17:03:34 set_graph converted graph change to 161 chunks in 2.27s. 17:03:34 Insert chunks: 4/161 17:03:34 Insert chunks: 104/161 17:03:34 set_graph added/updated 71 nodes and 88 edges from index in 0.28s. 17:03:34 merging subgraph for doc b1d9d3b6848711f0aacd7ddc0714c4d3 into the global graph done in 2.60 seconds. 17:03:34 Knowledge Graph done (153.18s) ``` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement
2025-08-29 17:58:36 +08:00
edge_data = dict(src_id=src_id, tgt_id=tgt_id, description=description, keywords=keywords, weight=weight, source_id=source_id)
all_relationships_data.append(edge_data)
async def _merge_graph_nodes(self, graph: nx.Graph, nodes: list[str], change: GraphChange, task_id=""):
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled during merge graph nodes")
if len(nodes) <= 1:
return
change.added_updated_nodes.add(nodes[0])
change.removed_nodes.update(nodes[1:])
nodes_set = set(nodes)
node0_attrs = graph.nodes[nodes[0]]
node0_neighbors = set(graph.neighbors(nodes[0]))
for node1 in nodes[1:]:
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled during merge_graph nodes")
# Merge two nodes, keep "entity_name", "entity_type", "page_rank" unchanged.
node1_attrs = graph.nodes[node1]
node0_attrs["description"] += f"{GRAPH_FIELD_SEP}{node1_attrs['description']}"
node0_attrs["source_id"] = sorted(set(node0_attrs["source_id"] + node1_attrs["source_id"]))
feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238) This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases. Fixes #14236. ## 1. Fix concurrent merge crash Long GraphRAG runs would crash near the end of entity resolution with: ``` RuntimeError: dictionary keys changed during iteration ``` in `Extractor._merge_graph_nodes`. Two changes: - `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)` via `list(...)` before iterating, so concurrent `add_edge` / `remove_node` mutations on the shared `nx.Graph` cannot invalidate the iterator. Also tracks each redirected neighbour in `node0_neighbors` so a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting via `add_edge`. - `rag/graphrag/entity_resolution.py`: serialize the merge step with a dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix. ## 2. Don't wipe partial graph on pause Previously the pause / cancel UI path called `settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`, destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already added `load_subgraph_from_store`. After main was merged in (which deleted `api/apps/kb_app.py` per #14394), the pause path now lives on the new REST surface `DELETE /v1/datasets/<id>/<index_type>`: - `api/apps/services/dataset_api_service.py`: `delete_index` accepts a `wipe: bool = True` parameter. When `False` the doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour. - `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false|0|no|off` from the query string and forwards it. - `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`: `unbindPipelineTask` appends `?wipe=false` when explicitly false. - The GraphRAG pause action in `web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe: false` for `KnowledgeGraph`; raptor is unchanged. **UX impact:** the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in `GenerateLogButton` (trash icon behind a confirmation modal). ## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`) A small Redis-backed marker layer at `graphrag:phase:{kb_id}:{resolution_done|community_done}` (7-day TTL). `run_graphrag_for_kb` consults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when: - new docs are merged into the graph (which invalidates prior resolution and community results), - `delete_index` wipes the graph, or - `delete_knowledge_graph` is called. Redis failures never block a run -- markers are an optimization, not a gate. ## 4. Idempotent community detection `extract_community` previously did `delete-then-insert` on `community_report` rows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from `(kb_id, community.title)`, the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix. ## 5. Tunable doc-store insert pipeline The GraphRAG insert loop in `rag/graphrag/utils.py` and the `community_report` insert in `rag/graphrag/general/index.py` were both hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead. - New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py` batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout semantics as the prior loop. - Defaults: 64 docs per batch, 4 batches in flight (matches the regular ingest pipeline in `document_service.py`). Tunable per-deployment via `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`. - Both `set_graph` and `extract_community` now use the helper. This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default). ## Tests - `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges. - `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure. - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`: new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers. ## Compatibility - Backward compatible: tasks queued before this change behave identically (default `wipe=true`, no markers expected). - No schema/migration changes; all new state lives in Redis. - New optional REST query param `wipe` on `DELETE /v1/datasets/<id>/<index_type>`. - New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour. ## Example of resume Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying. <img width="521" height="677" alt="image" src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a" /> ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):
2026-05-06 02:01:01 -05:00
# Snapshot neighbors before mutation; otherwise networkx raises
# "dictionary keys changed during iteration" when concurrent merges
# or graph.add_edge/remove_node below touch the same adjacency dict.
for neighbor in list(graph.neighbors(node1)):
change.removed_edges.add(get_from_to(node1, neighbor))
if neighbor not in nodes_set:
edge1_attrs = graph.get_edge_data(node1, neighbor)
if neighbor in node0_neighbors:
# Merge two edges
change.added_updated_edges.add(get_from_to(nodes[0], neighbor))
edge0_attrs = graph.get_edge_data(nodes[0], neighbor)
edge0_attrs["weight"] += edge1_attrs["weight"]
edge0_attrs["description"] += f"{GRAPH_FIELD_SEP}{edge1_attrs['description']}"
for attr in ["keywords", "source_id"]:
edge0_attrs[attr] = sorted(set(edge0_attrs[attr] + edge1_attrs[attr]))
edge0_attrs["description"] = await self._handle_entity_relation_summary(f"({nodes[0]}, {neighbor})", edge0_attrs["description"], task_id=task_id)
graph.add_edge(nodes[0], neighbor, **edge0_attrs)
else:
graph.add_edge(nodes[0], neighbor, **edge1_attrs)
feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238) This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases. Fixes #14236. ## 1. Fix concurrent merge crash Long GraphRAG runs would crash near the end of entity resolution with: ``` RuntimeError: dictionary keys changed during iteration ``` in `Extractor._merge_graph_nodes`. Two changes: - `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)` via `list(...)` before iterating, so concurrent `add_edge` / `remove_node` mutations on the shared `nx.Graph` cannot invalidate the iterator. Also tracks each redirected neighbour in `node0_neighbors` so a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting via `add_edge`. - `rag/graphrag/entity_resolution.py`: serialize the merge step with a dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix. ## 2. Don't wipe partial graph on pause Previously the pause / cancel UI path called `settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`, destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already added `load_subgraph_from_store`. After main was merged in (which deleted `api/apps/kb_app.py` per #14394), the pause path now lives on the new REST surface `DELETE /v1/datasets/<id>/<index_type>`: - `api/apps/services/dataset_api_service.py`: `delete_index` accepts a `wipe: bool = True` parameter. When `False` the doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour. - `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false|0|no|off` from the query string and forwards it. - `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`: `unbindPipelineTask` appends `?wipe=false` when explicitly false. - The GraphRAG pause action in `web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe: false` for `KnowledgeGraph`; raptor is unchanged. **UX impact:** the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in `GenerateLogButton` (trash icon behind a confirmation modal). ## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`) A small Redis-backed marker layer at `graphrag:phase:{kb_id}:{resolution_done|community_done}` (7-day TTL). `run_graphrag_for_kb` consults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when: - new docs are merged into the graph (which invalidates prior resolution and community results), - `delete_index` wipes the graph, or - `delete_knowledge_graph` is called. Redis failures never block a run -- markers are an optimization, not a gate. ## 4. Idempotent community detection `extract_community` previously did `delete-then-insert` on `community_report` rows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from `(kb_id, community.title)`, the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix. ## 5. Tunable doc-store insert pipeline The GraphRAG insert loop in `rag/graphrag/utils.py` and the `community_report` insert in `rag/graphrag/general/index.py` were both hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead. - New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py` batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout semantics as the prior loop. - Defaults: 64 docs per batch, 4 batches in flight (matches the regular ingest pipeline in `document_service.py`). Tunable per-deployment via `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`. - Both `set_graph` and `extract_community` now use the helper. This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default). ## Tests - `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges. - `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure. - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`: new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers. ## Compatibility - Backward compatible: tasks queued before this change behave identically (default `wipe=true`, no markers expected). - No schema/migration changes; all new state lives in Redis. - New optional REST query param `wipe` on `DELETE /v1/datasets/<id>/<index_type>`. - New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour. ## Example of resume Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying. <img width="521" height="677" alt="image" src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a" /> ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):
2026-05-06 02:01:01 -05:00
# Track the redirected neighbour so a later node1 in this
# merge that also points to it takes the merge branch
# above instead of overwriting the edge we just added.
node0_neighbors.add(neighbor)
graph.remove_node(node1)
node0_attrs["description"] = await self._handle_entity_relation_summary(nodes[0], node0_attrs["description"], task_id=task_id)
graph.nodes[nodes[0]].update(node0_attrs)
async def _handle_entity_relation_summary(self, entity_or_relation_name: str, description: str, task_id="") -> str:
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled during summary handling")
summary_max_tokens = 512
use_description = truncate(description, summary_max_tokens)
description_list = use_description.split(GRAPH_FIELD_SEP)
if len(description_list) <= 12:
return use_description
prompt_template = SUMMARIZE_DESCRIPTIONS_PROMPT
context_base = dict(
entity_name=entity_or_relation_name,
description_list=description_list,
language=self._language,
)
use_prompt = prompt_template.format(**context_base)
logging.info(f"Trigger summary: {entity_or_relation_name}")
if task_id and has_canceled(task_id):
raise TaskCanceledException(f"Task {task_id} was cancelled during summary handling")
async with chat_limiter:
summary = await self._async_chat("", [{"role": "user", "content": use_prompt}], {}, task_id)
return summary