ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-04 01:29:35 +08:00

Author	SHA1	Message	Date
Hunnyboy1217	782084780e	feat(connectors): ETag-based bypass for incremental S3 ingestion (#14628 ) (#14677 ) ### What problem does this PR solve? S3-family connector syncs currently re-download every in-window object just so we can compute `xxhash128(blob)` and compare against `Document.content_hash`. Anything that bumps `LastModified` without changing bytes (`aws s3 cp` touches, bucket re-encryption, etc.) pays full bandwidth and re-parses files that didn't actually change. #14628 covers the broader incremental-ingestion redesign; this PR is the first slice. The fix is a pre-listing short-circuit. `BlobStorageConnector` (S3 / R2 / GCS / OCI / S3-compat) now implements a new `FingerprintConnector` interface: `list_keys()` paginates `list_objects_v2` and yields `KeyRecord(key, fingerprint)` where `fingerprint = xxhash128(ETag)`. The orchestrator joins those against the connector's existing `{doc_id: content_hash}` map and only calls `get_value(key)` when the fingerprint differs. Unchanged keys are skipped entirely — no `GetObject`, no re-parse. No DDL. xxhash128(ETag) is 32 hex chars and reuses the existing `Document.content_hash` column per @yingfeng's suggestion; the connector decides at listing time whether to populate it. Local uploads and connectors that don't opt in fall through to the existing post-download `xxhash128(blob)` path with no behavior change. This is PR-1 of a 4-PR series — full design lives on #14628. Subsequent PRs extend tier 1 to local FS / WebDAV / Dropbox / Seafile / RDBMS (PR-2), wire up tier 2 cursor connectors with `SyncLogs.next_checkpoint` (PR-3), and unify deletion via `KeyRecord(deleted=True)` reconciliation (PR-4). Holding those back keeps this PR additive and reviewable on its own. #### Files touched - `common/data_source/models.py` — new `KeyRecord`; optional `fingerprint` on `Document` - `common/data_source/interfaces.py` — `IncrementalCapability` enum, `FingerprintConnector` ABC - `common/data_source/blob_connector.py` — `BlobStorageConnector` implements `FingerprintConnector`; per-object download factored into `_build_document_from_obj()` so `_yield_blob_objects`, `list_keys`, `get_value` all share it - `rag/svr/sync_data_source.py` — `_BlobLikeBase._fingerprint_filtered_generator` does the bypass loop; `_run_task_logic` plumbs `doc.fingerprint` into the upload dict - `api/db/services/document_service.py` — `list_id_content_hash_map_by_kb_and_source_type()` helper - `api/db/services/connector_service.py` + `file_service.py` — fingerprint flows through `duplicate_and_parse → upload_document` and lands in `content_hash` - `test/unit_test/common/test_blob_connector_fingerprint.py` — 14 tests covering ETag normalization (single-part, multipart, quoted, empty), `list_keys()` not calling `GetObject`, `get_value()` materializing with fingerprint, deterministic/stable fingerprints, and the bypass loop asserting `GetObject` is not called on a match #### Worth flagging for review Old `_BlobLikeBase._generate` called `poll_source(start, now)` with a `LastModified` window when `poll_range_start` was set. New code uses `_fingerprint_filtered_generator` (full bucket listing + fingerprint compare) outside of explicit `reindex=1`. Strictly better for unchanged-bucket cases since it skips `GetObject`, but it does mean every sync now does a full `list_objects_v2` paginate. Should still be cheap for most buckets — flagging in case anyone has a very large bucket where the time-window filter was meaningful. On migration: existing rows have `content_hash = xxhash128(blob)` from the old code. The first sync after this lands sees ETag-derived fingerprints that don't match, re-fetches every object once, and writes the new fingerprint. From the second sync onward the bypass works as expected. "Slow day one, fast every day after." A `fingerprint_backfill: trust` opt-out is sketched in the design doc but not in this PR. #### Test plan - [x] `uv run ruff check` — clean on all 8 touched files - [x] `uv run pytest test/unit_test/common/test_blob_connector_fingerprint.py -v` — 14 passed - [x] Broader unit-test suite — no regressions in anything I touched - [ ] Manual smoke against a real S3 bucket — configure a connector, run sync twice, expect the second sync to log `bypassed=N, fetched=0` and no `GetObject` calls in CloudTrail / bucket access logs - [ ] Manual smoke with `reindex=1` — confirm the full re-download path still works ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Yingfeng <yingfeng.zhang@gmail.com>	2026-05-09 20:03:56 +08:00
Lynn	efe6d23d61	Fix: handle id as keyword (#14729 ) ### What problem does this PR solve? Update mapping.json to treat id as a keyword. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-09 17:41:08 +08:00
Ricardo-M-L	1046042e01	fix(llm): replace mutable default `gen_conf={}` with None + defensive copy (#14566 ) ### What 19 methods across `rag/llm/chat_model.py` and `rag/llm/cv_model.py` declare `gen_conf={}` (or `gen_conf: dict = {}`) as a parameter default and then mutate `gen_conf` in place — typically `del gen_conf["max_tokens"]`, `gen_conf["penalty_score"] = ...`, or `gen_conf.pop(...)` as part of provider-specific normalization. ### The two bugs in this pattern 1. Mutable default argument (Python footgun). Python evaluates default values once at function-definition time, so the single `{}` dict is shared across every caller that doesn't pass `gen_conf`. The first such call's mutations leak into the default seen by every subsequent call. ```python # Before def chat_streamly(self, system, history, gen_conf={}, kwargs): if "max_tokens" in gen_conf: del gen_conf["max_tokens"] # mutates the SHARED default dict ... ``` After call N with `max_tokens` set, call N+1 that omits `gen_conf` no longer sees `max_tokens` — even though the caller never touched it. 2. Caller-dict pollution.** When the caller does pass a `gen_conf` dict, the same in-place mutations modify the caller's dict. A reused `gen_conf` (very common for chat-loop callers that build the config once and pass it on every turn) silently loses `max_tokens`, `presence_penalty`, etc. after the first round. ### The fix In every affected method: - Change `gen_conf={}` (or `gen_conf: dict = {}`) → `gen_conf=None`. - Add `gen_conf = dict(gen_conf or {})` as the first statement of the body so all subsequent mutations operate on a fresh local copy. ```python # After def chat_streamly(self, system, history, gen_conf=None, kwargs): gen_conf = dict(gen_conf or {}) if "max_tokens" in gen_conf: del gen_conf["max_tokens"] # local copy — safe ... ``` This is byte-for-byte identical provider-side behavior for callers that already pass a fresh `gen_conf` per call. The new `dict(...)` copy is O(small constant) per call. ### Files changed - `rag/llm/chat_model.py` — 17 methods - `rag/llm/cv_model.py` — 2 methods ### Tests Adds `test/unit_test/rag/llm/test_gen_conf_no_mutable_default.py` — an `ast`-based regression guard that walks both modules and asserts no parameter named `gen_conf` ever has a mutable literal (`{}` or `[]`) as its default. The test caught five additional `gen_conf: dict = {}` sites that an initial `gen_conf={}` text grep had missed (annotated parameters with whitespace), and would fail again if the pattern is ever reintroduced. ``` $ pytest test/unit_test/rag/llm/test_gen_conf_no_mutable_default.py -v ============================== 3 passed in 0.04s =============================== ``` `ruff check` passes on all touched files. ### Notes - This PR is intentionally focused on just** the `gen_conf` default + copy fix. There's a related (but separate) `history.insert(0, ...)` pattern in the same files that mutates the caller's history list in 12 places — left for a follow-up so this PR stays mechanical and easy to review. ### Latest revision (`700bb54a7`) — addresses CodeRabbit review - Type annotation: `gen_conf: dict = None` → `gen_conf: dict \| None = None` (5 occurrences in `chat_model.py`). The old annotation was a static-checker mismatch since `None` isn't a `dict`. - Regression test: the AST check accessed `default.keys` directly. `ast.List` has no `.keys` attribute — a future `gen_conf=[]` would crash with `AttributeError` instead of being caught. Use `getattr` for both `.keys` (Dict) and `.elts` (List). Manually verified the updated check correctly catches both `gen_conf={}` and `gen_conf=[]` while ignoring `gen_conf=None` and non-empty literals. --------- Co-authored-by: Ricardo <ricardo@example.com>	2026-05-09 13:11:44 +08:00
VincentLambert	4f3711d37f	fix: handle missing 'total' key causing KeyError in deep research retrieval (#13942 ) ## Summary - When KB retrieval fails (e.g. ES `AssertionError` on empty `index_names`), `kbinfos` falls back to a dict without a `total` key - `_async_update_chunk_info` then iterates over `chunk_info.keys()` (which includes `total`) and tries `kbinfos['total']`, raising a `KeyError` - This error surfaces when using Tavily web retrieval in a chat with no knowledge base attached ## Changes - Add `'total': 0` to all default `kbinfos` dicts in `_retrieve_information` - Add `setdefault('total', 0)` guard after successful KB retrieval to handle cases where the retrieval result omits the key - Accumulate `total` correctly in the merge branch of `_async_update_chunk_info` ## Test plan - [ ] Start a chat with Tavily configured and no knowledge base - [ ] Verify no `KeyError: 'total'` is raised - [ ] Verify Tavily results are returned correctly --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-09 10:57:51 +08:00
Octopus	653b00b94c	fix(sync): scope document IDs per connector to prevent cross-KB collisions (#14378 ) Fixes #14360 ## Problem When the same blob storage bucket is connected to multiple knowledge bases (each through a different data source connector), the sync pipeline hashes only the blob path (`bucket_type:bucket_name:object_key`) to derive the document ID. Every connector pointing at the same bucket therefore produces identical IDs for the same object. The collision guard in `FileService.upload_document` then fires for the second knowledge base: ``` Existing document id collision with another knowledge base; skipping update. ``` This makes it impossible to index the same bucket into more than one KB simultaneously. ## Solution Include `connector_id` in the hash input so that each connector produces a distinct document ID even when the underlying blob path is identical: ```python # Before "id": hash128(doc.id), # After "id": hash128(f"{task['connector_id']}:{doc.id}"), ``` Because each KB connection uses its own connector (with a unique `connector_id`), documents are now namespaced per connector and no collision occurs. Note: This is a breaking change for existing synced data sources. After upgrading, a re-sync will create new documents with the updated ID format. Old documents (indexed under the previous format) will remain in the database but can be manually deleted or cleaned up via a re-sync with reindex enabled. ## Testing - Verified that the one-line change produces unique IDs for two connectors pointing at the same S3 path. - Existing unit test `test_upload_document_skips_cross_kb_document_id_collision` continues to pass — the collision guard in `FileService` is still valid for genuinely colliding IDs from other sources. --------- Co-authored-by: octo-patch <octo-patch@github.com>	2026-05-09 10:33:54 +08:00
qinling0210	4d6e8dffac	Do not bypass threshold for rerank when metadata filter is enabled (#14684 ) ### What problem does this PR solve? Do not bypass threshold for rerank when metadata filter is enabled ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-08 17:48:30 +08:00
web-dev0521	a32ebf32bd	Fix: handle null document_metadata in kb_prompt to prevent citation crash (#14651 ) (#14666 ) ### What problem does this PR solve? Fixes #14651. `kb_prompt()` in `rag/prompts/generator.py` crashes with `AttributeError: 'NoneType' object has no attribute 'items'` during agent citation generation when a retrieved chunk carries `document_metadata: null`. Root cause. The crash happens at `rag/prompts/generator.py:132-133`: ```python meta = ck.get("document_metadata", {}) for k, v in meta.items(): ``` `dict.get(key, default)` only returns the default when the key is missing. When the key is present with an explicit `None` value, `.get()` returns `None`, and `.items()` crashes. How the chunk gets `None`. It's a round-trip inside RAGFlow itself, not bad input from retrieval: 1. The agent stores retrieved chunks via `agent/canvas.py:814`, which routes them through `chunks_format()`. 2. `rag/prompts/generator.py:61` canonicalizes the field with `chunk.get("document_metadata")` (no default), so chunks without metadata become `{"document_metadata": None, ...}`. 3. `agent/component/agent_with_tools.py:314` feeds those canonicalized chunks back into `kb_prompt()` for citation generation, and `.get("document_metadata", {})` no longer protects us. Fix. One-line change at `rag/prompts/generator.py:132`: use `ck.get("document_metadata") or {}` so an explicit `None` is also coerced to `{}`. The line-61 `None` is intentionally part of the API/UI contract — the frontend handles it via optional chaining (`web/src/components/markdown-content/index.tsx:184`, `web/src/pages/next-search/search-view.tsx:217`) — so the fix belongs at the consumer, not the producer. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-05-08 16:54:33 +08:00
D2758695161	f063e03a24	fix: add bucket prefix to Azure Blob SPN and SAS storage operations (#14347 ) ## Summary Fixes file collision between different datasets when using Azure Blob storage (SPN or SAS authentication). ## Bug azure_spn_conn.py and zure_sas_conn.py ignored the ucket parameter entirely, storing all files flat with just the filename. This caused files with the same name from different datasets (knowledge bases) to overwrite each other. ## Fix Prepend bucket/ as a path prefix in all methods (put, m, get, obj_exist, get_presigned_url, health) to match the behavior of MinIO and S3 implementations. ## Changes - rag/utils/azure_spn_conn.py: Added {bucket}/ prefix to file paths in all operations - rag/utils/azure_sas_conn.py: Same fix applied for consistency (also noted in the original issue) ## Testing Manual verification: files from different datasets now stored under distinct bucket/ prefixes, preventing collisions. Fixes #14159 Co-authored-by: Hunter <hunter@yitong.ai> Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-05-08 12:06:28 +08:00
d 🔹	057806d7f1	fix: prepend bucket prefix to Azure SPN and SAS storage paths (#14185 ) ## Summary Fixes #14159 — files from different datasets can overwrite each other in Azure Blob storage. ## Problem Both `azure_spn_conn.py` and `azure_sas_conn.py` ignore the `bucket` parameter in all storage operations (`put`, `get`, `rm`, `obj_exist`, `get_presigned_url`). Files are stored flat using only the filename, so two datasets containing a file with the same name will overwrite each other. The MinIO and S3 implementations correctly use the bucket (typically the knowledge base ID) as a path prefix to create logical folder isolation: - MinIO: uses `use_prefix_path` decorator → `{orig_bucket}/{fnm}` - S3: uses `use_prefix_path` decorator → `{prefix_path}/{bucket}/{fnm}` ## Fix Prepend `{bucket}/` to the file path in all 5 operations across both Azure connector files: \| File \| Methods fixed \| \|------\|---------------\| \| `azure_spn_conn.py` \| `put`, `get`, `rm`, `obj_exist`, `get_presigned_url` \| \| `azure_sas_conn.py` \| `put`, `get`, `rm`, `obj_exist`, `get_presigned_url` \| This matches the existing convention where `bucket` is the knowledge base ID used as a directory prefix. ## ⚠️ Migration Note Existing Azure SPN/SAS deployments have files stored without the bucket prefix. After this fix, new files will be stored under `{bucket}/{filename}` while existing files remain at `{filename}`. A one-time migration script or manual file move may be needed for existing deployments. New deployments are unaffected. ## Testing - Verified the fix is consistent across all 5 methods in both files - The `health()` method is intentionally left unchanged as it uses a hardcoded test filename without bucket semantics Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-05-07 20:48:32 +08:00
Jack Storment	59bb184e63	feat(moodle): support deleted-file sync (#14548 ) Fixes #14551 ### What problem does this PR solve? The Moodle connector did not let the sync runner clean up indexed documents that were deleted from the source. Other connectors such as dropbox, seafile, webdav, and rss already do this through a slim snapshot pass. This PR adds the same support for Moodle. When `sync_deleted_files` is on, the runner now asks the Moodle connector for a lightweight list of every module id that could be indexed. The runner then compares this list with the index and removes any indexed document whose id is not in the list. The slim pass does not download files. It only goes through courses and modules and yields ids. The id format matches the ids that the loader produces, so the match is exact. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Notes - `MoodleConnector` now also implements `SlimConnectorWithPermSync`. - New `retrieve_all_slim_docs_perm_sync` yields slim docs with the same ids the loader uses (`moodle_resource_<id>`, `moodle_forum_<id>`, `moodle_page_<id>`, `moodle_book_<id>`, `moodle_assign_<id>`, `moodle_quiz_<id>`). - The `Moodle` sync class now returns `(document_generator, file_list)` so the runner can do the cleanup. If the slim snapshot fails, `file_list` is set back to `None` and the run continues without cleanup. - The web data source map exposes `syncDeletedFiles` for Moodle so the option shows up in the UI. ### How was this tested? - `ruff check` passes on the changed Python files. - Manual review of the produced slim ids against the ids the loader builds in `_process_resource`, `_process_forum`, `_process_page`, `_process_book`, and `_process_activity`. - Behavior parity with the merged dropbox (#14476), seafile (#14499), webdav (#14491), and rss (#14493) PRs.	2026-05-07 17:44:46 +08:00
Octopus	5c9124c3ef	fix: prepend bucket prefix in Azure Blob (SAS/SPN) to prevent cross-dataset file overwrites (#14174 ) Fixes #14159 ## Problem The `put()`, `get()`, `rm()`, and `obj_exist()` methods in both `azure_spn_conn.py` and `azure_sas_conn.py` ignore the `bucket` parameter entirely, storing all files flat using only the filename. This causes files from different datasets to overwrite each other when they share the same filename. By contrast, the MinIO and S3 implementations correctly use the bucket (typically the knowledge base ID) as a path prefix, creating logical folder isolation like `{kb_id}/{filename}`. ## Solution Prepend the `bucket` parameter as a path prefix to all file operations in both Azure storage implementations: - `azure_spn_conn.py`: `create_file`, `delete_file`, `get_file_client` now use `f"{bucket}/{fnm}"` - `azure_sas_conn.py`: `upload_blob`, `delete_blob`, `download_blob`, `get_blob_client` now use `f"{bucket}/{fnm}"` This matches the behavior of all other storage backends (MinIO, S3) and prevents filename collisions across knowledge bases. ## Testing - Verified the fix aligns with how MinIO/S3 connectors handle the bucket parameter - The `health()` method is left unchanged as it uses a fixed test path for connectivity checks only Co-authored-by: octo-patch <octo-patch@github.com> Co-authored-by: Jin Hai <haijin.chn@gmail.com>	2026-05-07 17:13:43 +08:00
Magicbook1108	911671cef0	Feat: enable sync deleted files for RDBMS & fix remove last file issue (#14615 ) ### What problem does this PR solve? Feat: enable sync deleted files for RDBMS & fix remove last file issue ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality)	2026-05-07 13:31:05 +08:00
Zhichang Yu	86fe78c73f	feat(llm): add MiniMax GroupId header support (#14610 ) ## Summary - Add MiniMax provider GroupId query parameter support in `LiteLLMBase` - Extract `group_id` from key configuration in `__init__` - Append `GroupId` as query parameter to `api_base` in `_construct_complete_args` ## Why this change is needed MiniMax provides an OpenAI-compatible API endpoint (`/v1/chat/completions`), but `GroupId` is a MiniMax-specific account identifier required for billing and rate limiting - it is not part of the OpenAI standard. Looking at LiteLLM's `MinimaxChatConfig`: - `get_complete_url()` only constructs the base URL (e.g., `https://api.minimaxi.com/v1/chat/completions`) - LiteLLM does not automatically inject `GroupId` into requests - This must be handled by the caller (ragflow's chat_model.py) The implementation appends `GroupId` as a query parameter to `api_base`: ```python api_base = completion_args.get("api_base", self.base_url) separator = "&" if "?" in api_base else "?" completion_args["api_base"] = f"{api_base}{separator}GroupId={self.group_id}" ``` This matches MiniMax's official API format (as documented by LlamaFactory): ```bash curl --location 'https://api.minimaxi.chat/v1/text/chatcompletion?GroupId=你的GroupId' \ --header 'Authorization: Bearer 你的API_Key' ``` ## Test plan - [ ] Verify MiniMax API calls work with GroupId query parameter - [ ] Verify backward compatibility for other providers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 11:54:49 +08:00
Preston Percival	e8f19aa338	feat(graphrag): fix merge concurrency and add resume-from-checkpoint (#14238 ) This PR addresses three related GraphRAG reliability issues that together allow long-running GraphRAG tasks (10+ hours of LLM extraction) to be resumed after a crash or pause without re-doing completed work. It builds on #14096 (per-doc subgraph cache) and extends the same idea to the resolution and community-detection phases. Fixes #14236. ## 1. Fix concurrent merge crash Long GraphRAG runs would crash near the end of entity resolution with: ``` RuntimeError: dictionary keys changed during iteration ``` in `Extractor._merge_graph_nodes`. Two changes: - `rag/graphrag/general/extractor.py`: snapshot `graph.neighbors(node1)` via `list(...)` before iterating, so concurrent `add_edge` / `remove_node` mutations on the shared `nx.Graph` cannot invalidate the iterator. Also tracks each redirected neighbour in `node0_neighbors` so a later merged node sharing the same external neighbour takes the edge-merge branch instead of overwriting via `add_edge`. - `rag/graphrag/entity_resolution.py`: serialize the merge step with a dedicated `asyncio.Semaphore(1)`. `nx.Graph` is not thread-safe and concurrent merges on overlapping neighbourhoods can produce incorrect results even with the snapshot fix. ## 2. Don't wipe partial graph on pause Previously the pause / cancel UI path called `settings.docStoreConn.delete({"knowledge_graph_kwd": [...]}, ...)`, destroying every subgraph, entity, relation, and graph row. Re-triggering then started GraphRAG from scratch even though #14096 had already added `load_subgraph_from_store`. After main was merged in (which deleted `api/apps/kb_app.py` per #14394), the pause path now lives on the new REST surface `DELETE /v1/datasets/<id>/<index_type>`: - `api/apps/services/dataset_api_service.py`: `delete_index` accepts a `wipe: bool = True` parameter. When `False` the doc-store rows and GraphRAG phase markers are left intact and only the running task is cancelled. Default preserves historical behaviour. - `api/apps/restful_apis/dataset_api.py`: parses `?wipe=false\|0\|no\|off` from the query string and forwards it. - `web/src/utils/api.ts` + `web/src/services/knowledge-service.ts`: `unbindPipelineTask` appends `?wipe=false` when explicitly false. - The GraphRAG pause action in `web/src/pages/dataset/dataset/generate-button/hook.ts` passes `wipe: false` for `KnowledgeGraph`; raptor is unchanged. UX impact: the pause icon next to a running GraphRAG task no longer wipes graph data. The only path that still wipes is the explicit Delete action in `GenerateLogButton` (trash icon behind a confirmation modal). ## 3. Phase-completion markers (`rag/graphrag/phase_markers.py`) A small Redis-backed marker layer at `graphrag:phase:{kb_id}:{resolution_done\|community_done}` (7-day TTL). `run_graphrag_for_kb` consults the markers on entry and skips phases that already completed in a prior run. Markers are cleared automatically when: - new docs are merged into the graph (which invalidates prior resolution and community results), - `delete_index` wipes the graph, or - `delete_knowledge_graph` is called. Redis failures never block a run -- markers are an optimization, not a gate. ## 4. Idempotent community detection `extract_community` previously did `delete-then-insert` on `community_report` rows; a crash mid-insert left the dataset with no reports. Now report IDs are derived deterministically from `(kb_id, community.title)`, the existing report IDs are snapshotted before insert, new rows are written, then only stale rows are pruned. A failure at any step leaves either the prior or the new report set intact -- never a partial mix. ## 5. Tunable doc-store insert pipeline The GraphRAG insert loop in `rag/graphrag/utils.py` and the `community_report` insert in `rag/graphrag/general/index.py` were both hardcoded to `es_bulk_size = 4` and ran strictly sequentially. On a real KB this meant 1077 chunks took ~21 minutes for a 100-chunk slice -- pure round-trip overhead. - New `insert_chunks_bounded()` helper in `rag/graphrag/utils.py` batches inserts via a bounded `asyncio.Semaphore`. Same retry / timeout semantics as the prior loop. - Defaults: 64 docs per batch, 4 batches in flight (matches the regular ingest pipeline in `document_service.py`). Tunable per-deployment via `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`. - Both `set_graph` and `extract_community` now use the helper. This dropped the same 1077-chunk insert from minutes to seconds in local testing without measurable extra pressure on Infinity (total in-flight docs ≤ `BULK_SIZE × CONCURRENCY` = 256 by default). ## Tests - `test/unit_test/rag/graphrag/test_merge_graph_nodes.py` (3 tests): dense neighbourhood merge, neighbour-snapshot regression, concurrent serialized merges. - `test/unit_test/rag/graphrag/test_phase_markers.py` (4 tests): set/has round-trip, kb-scoped clear, no-op on empty input, graceful Redis failure. - `test/testcases/test_web_api/test_dataset_management/test_dataset_sdk_routes_unit.py`: new `test_delete_index_wipe_flag_unit` covers `wipe=false` for both GraphRAG and raptor on the new REST route, and confirms the default still wipes and clears phase markers. ## Compatibility - Backward compatible: tasks queued before this change behave identically (default `wipe=true`, no markers expected). - No schema/migration changes; all new state lives in Redis. - New optional REST query param `wipe` on `DELETE /v1/datasets/<id>/<index_type>`. - New optional env vars `GRAPHRAG_INSERT_BULK_SIZE` and `GRAPHRAG_INSERT_CONCURRENCY`; defaults preserve safe behaviour. ## Example of resume Screenshot below shows a test resuming knowledge graph generation after applying the concurrency fix and re-deploying. <img width="521" height="677" alt="image" src="https://github.com/user-attachments/assets/9ef0d405-cbb3-420d-a1a1-e51f3e7e9b7a" /> ### Type of change - [X] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-05-06 15:01:01 +08:00
Idriss Sbaaoui	38f6484e98	Fix OpenDataLoader naive parsing by normalizing `@OpenDataLoader` and filtering unsupported parser kwargs (#14581 ) ### What problem does this PR solve? This PR fixes a bug where `layout_recognize="<name>@OpenDataLoader"` was misrouted and then failed during parsing in the naive parser path. It now routes correctly to OpenDataLoader and avoids passing unsupported arguments that caused runtime errors. fixes #14572 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-05-06 15:00:55 +08:00
euvre	8269fa01b4	Fix AttributeError when appending non-streaming tool calls to chat history in Agentic Agent (#14456 ) ### What problem does this PR solve? Fix #14340 ## Problem Description When using an Agentic Agent (not Workflow) with one or more Retrieval tools (e.g., Dataset Retrieval + Memory Retrieval), the agent silently returns an empty response (`agent_response: ""`) after hanging for several minutes. The server logs show: ``` AttributeError: 'ChatCompletionMessageToolCall' object has no attribute 'index' ``` This error propagates as a `GENERIC_ERROR`, causing the canvas to return an empty response. The subsequent Memory save task then receives the empty `agent_response` and logs: ``` Document for referred_document_id XXXX not found ``` ## Reproduction Steps 1. Set `DOC_ENGINE=infinity` (or `elasticsearch` — the engine itself is not the root cause). 2. Create a blank Agentic Agent (not a Workflow). 3. Add two Retrieval tools to the Agent node: - `Retrieval_DS` → Dataset (Knowledge Base) - `Retrieval_Mem` → Memory component 4. Add a Message node with Save to Memory enabled. 5. Launch the agent and send any message (e.g., "hola"). 6. The agent hangs and returns an empty response. ## Root Cause Analysis The crash occurs in `_append_history` and `_append_history_batch` inside `rag/llm/chat_model.py`. These methods directly access `.index` on tool call objects: ```python # _append_history_batch { "index": tc.index, # <-- crashes here ... } ``` However, non-streaming LLM responses (`stream=False`) return `ChatCompletionMessageToolCall` objects, which do not have an `index` field according to the OpenAI API specification. The `index` field only exists on `ChoiceDeltaToolCall` objects returned in streaming responses (`stream=True`). When the agentic agent triggers an internal `full_question` call (used to compress multi-turn conversation history), the request is incorrectly routed through `async_chat_with_tools` because `is_tools=True` is set at the `LLMBundle` level. If the LLM decides to emit `tool_calls` during this auxiliary request, the code enters the non-streaming tool loop and crashes when trying to append history. ## Fix Replaced all direct `.index` accesses with `getattr(..., "index", None)` for safe, backward-compatible access: \| Method \| File \| Line \| Change \| \|--------\|------\|------\|--------\| \| `_append_history` \| `rag/llm/chat_model.py` \| ~L304 \| `tool_call.index` → `getattr(tool_call, "index", None)` \| \| `_append_history_batch` \| `rag/llm/chat_model.py` \| ~L332 \| `tc.index` → `getattr(tc, "index", None)` \| \| `_append_history` \| `rag/llm/chat_model.py` \| ~L1467 \| `tool_call.index` → `getattr(tool_call, "index", None)` \| \| `_append_history_batch` \| `rag/llm/chat_model.py` \| ~L1496 \| `tc.index` → `getattr(tc, "index", None)` \| ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Signed-off-by: noob <yixiao121314@outlook.com>	2026-05-06 14:39:40 +08:00
buua436	5672be0652	Feat: add IMAP deleted document sync (#14539 ) ### What problem does this PR solve? add IMAP deleted document sync ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-05-06 14:06:46 +08:00
NeedmeFordev	89961962c0	feat(dingtalk-ai-table): support deleted-file sync via slim snapshot (#14525 ) ### What problem does this PR solve? Incremental DingTalk AI Table (Notable) sync did not reconcile rows removed on the remote side with documents already in the knowledge base. This follows the coordinated datasource work in #14362 (“sync deleted files”). This PR adds a full slim snapshot (`retrieve_all_slim_docs_perm_sync`) that lists current record IDs for all sheets without building document blobs, using the same logical document IDs as full ingest (`dingtalk_ai_table:{table_id}:{sheet_id}:{record_id}`). When `sync_deleted_files` is enabled on incremental runs, `DingTalkAITable._generate` returns `(document_generator, file_list)` so `SyncBase` can run `cleanup_stale_documents_for_task` and remove KB rows that no longer exist remotely. Design notes: - `_document_id` centralizes the ID string so slim snapshots and `_convert_record_to_document` stay aligned with `hash128(doc.id)` semantics used during ingestion/cleanup. - `end_ts` is captured before building `file_list`, then `poll_source` uses the same upper bound (consistent with other Dropbox-style connectors). - `batch_size` from connector config is coerced to a positive `int` before constructing the connector. - Slim snapshot failures are caught in `_generate`; `file_list` is set to `None` so cleanup is skipped rather than running on partial/error state. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### Files changed (summary) \| Area \| Change \| \|------\|--------\| \| `common/data_source/dingtalk_ai_table_connector.py` \| `SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`, `_document_id` shared with document conversion \| \| `rag/svr/sync_data_source.py` \| `DingTalkAITable._generate`: slim snapshot + tuple return; `batch_size` validation; shared `end_ts` with `poll_source` \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `syncDeletedFiles` for DingTalk AI Table in `DataSourceFeatureVisibilityMap` \| Closes / relates to: #14362	2026-05-06 14:06:23 +08:00
Attili-sys	24af0875e5	Feat/configurable metadata display (#13464 ) ### What problem does this PR solve? Currently, RAGFlow's Search and Chat interfaces display only raw vectorized text chunks during retrieval, without contextual information about their source documents. Users cannot see document titles, page numbers, upload dates, or custom metadata fields that would help them understand and trust the retrieved results. This PR introduces an optional metadata display feature that enriches retrieved chunks with document-level metadata in both the Search tab and Chatbot interface. Key improvements: - Search results: Display document metadata as styled badges beneath chunk snippets - Chat citations: Show metadata in citation popovers and reference lists for better source context - LLM context: Metadata is injected into the LLM prompt to enable more accurate, citation-aware responses - External API support: Applications using RAGFlow's SDK retrieval endpoints (`/v1/retrieval`, `/v1/searchbots/retrieval_test`) can opt-in via request parameters - User control: Multi-select dropdown UI allows users to choose which metadata fields to display Implementation approach: - ✅ Reuses existing `DocMetadataService` infrastructure (no new database tables or indices) - ✅ Settings stored in existing JSON configuration fields (`search_config.reference_metadata`, `prompt_config.reference_metadata`) - ✅ No database migrations required - ✅ Disabled by default (fully opt-in and backward-compatible) - ✅ Dynamic metadata field selection populated from actual document metadata keys - ✅ Fixed critical bug where Python's builtin `set()` was shadowed by a route handler function Modified endpoints (all backward-compatible): - `POST /v1/retrieval` (Public SDK) - `POST /v1/searchbots/retrieval_test` (Searchbots) - `POST /v1/chunk/retrieval_test` (UI/Internal) - Chat completions endpoints (via `extra_body.reference_metadata` or `prompt_config`) ### Type of change - [x] New Feature (non-breaking change which adds functionality) ###Images - <img width="879" height="1275" alt="image" src="https://github.com/user-attachments/assets/95b2d731-31ae-45a1-b081-bf5893f52aeb" /> <br><br> <br><br> <img width="1532" height="362" alt="image" src="https://github.com/user-attachments/assets/9cebc65b-b7a7-459f-b25e-3b13fa9b638e" /> <br><br> <br><br> <img width="2586" height="1320" alt="image" src="https://github.com/user-attachments/assets/2153d493-d899-461f-a7a9-041391e07776" /> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Attili-sys <Attili-sys@users.noreply.github.com> Co-authored-by: Ahmad Intisar <ahmadintisar@Ahmads-MacBook-M4-Pro.local>	2026-04-30 23:13:27 +08:00
Magicbook1108	5fd4579a2f	Fix: sync data source empty list (#14530 ) ### What problem does this PR solve? Fix: sync data source empty list ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 18:56:43 +08:00
bitloi	a69e0c73c7	feat(rss): support deleted-file sync (#14493 ) ### What problem does this PR solve? Partially addresses #14362. This PR enables syncing deleted files for RSS data sources. Previously, RSS incremental sync only returned feed entries whose timestamps were inside the poll window. If an entry was removed from the RSS feed, RAGFlow had no full current RSS snapshot to pass into the shared stale-document cleanup path, so the deleted remote entry could remain in the knowledge base. This PR: - adds `retrieve_all_slim_docs_perm_sync()` to `RSSConnector` - reuses the same `rss:<md5(stable_key)>` document ID derivation used by normal RSS ingest - returns `(document_generator, file_list)` for incremental RSS sync when `sync_deleted_files` is enabled - captures the poll end timestamp before snapshot/poll so cleanup does not race against the same sync window - adds start/end logs around RSS slim snapshot collection - exposes the deleted-file sync toggle for RSS in the data source UI Per maintainer request on related datasource PRs, this PR contains no test-case changes. Local verification was run with an external script. Validation: - `uv run ruff check common/data_source/rss_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` - `git diff --check` - `uv run python /tmp/verify_rss_deleted_sync.py --repo /root/74/ragflow` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-30 18:56:13 +08:00
NeedmeFordev	bedf9592ef	feat(webdav): support deleted-file sync via slim snapshot (#14491 ) ## What problem does this PR solve? Incremental WebDAV sync only ingested files whose modification time fell inside the poll window; documents removed on the WebDAV server were never removed from the knowledge base. This aligns with [#14362](https://github.com/infiniflow/ragflow/issues/14362) (coordinated datasource “sync deleted files” work). This PR adds a full-tree slim snapshot (`retrieve_all_slim_docs_perm_sync`) that enumerates current remote paths without downloading file contents, using the same logical document IDs as full ingest (`webdav:{base_url}:{file_path}`). When `sync_deleted_files` is enabled on incremental runs, sync returns `(document_generator, file_list)` so `SyncBase` runs `cleanup_stale_documents_for_task` and removes KB rows no longer present remotely. Design notes: - `_list_files_recursive` gains `filter_by_mtime`: snapshot passes `filter_by_mtime=False` (full tree under `remote_path`); `poll_source` keeps mtime-window filtering as before. - Slim snapshot applies the same extension and `size_threshold` rules as `_yield_webdav_documents` so retain IDs match what would be indexed. - `end_ts` is captured before building `file_list`, then `poll_source` uses the same upper bound (consistent with Dropbox-style connectors). ## Type of change - [x] New Feature (non-breaking change which adds functionality) ## Files changed \| Area \| Change \| \|------\|--------\| \| `common/data_source/webdav_connector.py` \| `SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`, `filter_by_mtime` on `_list_files_recursive` \| \| `rag/svr/sync_data_source.py` \| WebDAV `_generate`: `file_list` + tuple return; pass `batch_size` from connector config \| \| `web/src/pages/user-setting/data-source/constant/index.tsx` \| `syncDeletedFiles` for WebDAV in `DataSourceFeatureVisibilityMap` \|	2026-04-30 17:26:27 +08:00
Wang Qi	f45ce00347	Not allow to sort by id (#14526 ) ### What problem does this PR solve? id as "text", not a "keyword", order by it will cause error. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 14:52:43 +08:00
bitloi	17eda04b8d	feat(zendesk): support deleted-file sync (#14487 ) ### What problem does this PR solve? Refs #14362. This PR enables syncing deleted files for Zendesk data sources. Previously, Zendesk incremental sync never returned a slim remote snapshot to the shared stale-document cleanup path, so deleted remote Zendesk records could remain in RAGFlow. The existing Zendesk slim snapshot also included records that ingestion intentionally skips, such as draft articles, articles without bodies, skipped-label articles, empty-body articles, and tickets with `status == "deleted"`. This PR: - exposes the deleted-file sync option for Zendesk in the data source UI - returns Zendesk slim snapshots during incremental sync when `sync_deleted_files` is enabled - reuses Zendesk indexability rules so cleanup compares against the same records ingestion can materialize - adds start/end logs around Zendesk slim snapshot collection for operational visibility Per maintainer request, this PR contains no test-case changes. Manual verification recording will be provided separately. Validation: - `uv run ruff check common/data_source/zendesk_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-04-30 14:44:05 +08:00
bitloi	8f75e52bbf	feat(asana): support deleted-file sync (#14468 ) ### What problem does this PR solve? Partially addresses #14362. Adds deleted-file sync support for the Asana data source. Asana already indexes task attachments as documents, but it did not provide the slim document snapshot required by stale-document reconciliation, and the sync wrapper never returned a `file_list` for cleanup. This PR: - adds `retrieve_all_slim_docs_perm_sync()` to `AsanaConnector` - builds slim IDs with the same `asana:{task_id}:{attachment_gid}` format used by indexed documents - avoids downloading attachment blobs during the snapshot - aborts the snapshot if Asana API errors occur, preventing partial snapshots from deleting valid local docs - captures the incremental poll end time before snapshotting and makes `poll_source()` respect that boundary - exposes the deleted-file sync toggle for Asana in the data source UI Per maintainer request, this PR contains no test-case changes. Manual verification recording will be provided separately. Validation: - `uv run ruff check common/data_source/asana_connector.py rag/svr/sync_data_source.py` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` - `git diff --check` ### Type of change - [x] New Feature	2026-04-30 14:41:36 +08:00
orbisai0security	e992fe39b2	fix: the oceanbase database connector constructs sql... in ob_conn.py (#14470 ) ## Summary Fix critical severity security issue in `rag/utils/ob_conn.py`. ## Vulnerability \| Field \| Value \| \|-------\|-------\| \| ID \| V-003 \| \| Severity \| CRITICAL \| \| Scanner \| multi_agent_ai \| \| Rule \| `V-003` \| \| File \| `rag/utils/ob_conn.py:691` \| Description: The OceanBase database connector constructs SQL WHERE clauses by directly embedding user-controlled filter expressions using Python f-strings at lines 726, 777, 781, 787, 793, 821, and 827. No parameterization or allowlist validation is applied before the expressions are incorporated into live SQL queries. This is the most critical vulnerability in the codebase because it directly exposes the RAG knowledge base — the platform's core business asset — to complete compromise. ## Changes - `rag/utils/ob_conn.py` ## Verification - [x] Build passes - [x] Scanner re-scan confirms fix - [x] LLM code review passed --- Automated security fix by [OrbisAI Security](https://orbisappsec.com)	2026-04-30 14:25:17 +08:00
Magicbook1108	bb3b99f0a5	Feat: add button for remove header & footer in pipeline (#14486 ) ### What problem does this PR solve? Feat: add button for remove header & footer in pipeline ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-30 12:30:41 +08:00
NeedmeFordev	2932b65da6	feat(seafile): support deleted-file sync via slim snapshot (#14499 ) ### What problem does this PR solve? Incremental Seafile sync only ingests files whose modification time falls in the poll window; documents removed in Seafile were never removed from the knowledge base. This contributes to [#14362](https://github.com/infiniflow/ragflow/issues/14362) (datasource “sync deleted files” coordination). This PR adds a slim snapshot (`retrieve_all_slim_docs_perm_sync`) that enumerates current remote file IDs without downloading content, using the same logical IDs as full ingest (`seafile:{repo_id}:{file_id}`). When `sync_deleted_files` is enabled on incremental runs, `SeaFile._generate` returns `(document_generator, file_list)` so `SyncBase` can run `cleanup_stale_documents_for_task` and remove stale KB documents. ### Type of change - [x] New Feature (non-breaking change which adds functionality) ### What changed - `common/data_source/seafile_connector.py`: `SeaFileConnector` implements `SlimConnectorWithPermSync`; `_list_files_recursive(..., filter_by_mtime=...)` supports full-tree listing for snapshots; `retrieve_all_slim_docs_perm_sync()` reuses the same library/root scan as ingest and applies the same size ceiling; logging for snapshot start/end and counts. - `rag/svr/sync_data_source.py`: `SeaFile._generate` validates `batch_size`, captures `end_ts` before snapshot + `poll_source`, wraps slim retrieval in `try`/`except` ( `file_list = None` on failure so ingest continues), returns `(generator, file_list)`. - `web/src/pages/user-setting/data-source/constant/index.tsx`: `syncDeletedFiles` for Seafile in `DataSourceFeatureVisibilityMap`.	2026-04-30 12:05:12 +08:00
Idriss Sbaaoui	9075872435	Fix: Manual/Naive outline tuple unpack crash (#14518 ) ### What problem does this PR solve? This fixes a crash in Manual and Naive parsing when PDF outlines include page numbers as a third tuple value. It makes outline unpacking accept extra values so parsing no longer fails. fixes #14411 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-30 11:55:02 +08:00
sapienza yoan	811e9826d0	perf: avoid O(n²) array growth in embedding accumulation (#14369 ) ### What problem does this PR solve? Both tokenizer (`rag/flow/tokenizer/tokenizer.py`) and `BuiltinEmbed.encode` (`rag/llm/embedding_model.py`) currently accumulate embedding batches via `np.concatenate` inside the per-batch loop. `np.concatenate` allocates a new array and copies all existing data on every call, so accumulating N batches is O(N²) in both time and peak memory. Replacing the incremental concatenate with a list-of-batches + a single `np.vstack` at the end gives O(N) total work. For tokenizer the title-vector broadcast `np.concatenate([vts[0]] * N)` is also replaced by `np.tile`, which does the same job with a single contiguous allocation instead of building a Python list of references. This is purely a CPU/memory optimisation — output shape and dtype are unchanged. Measured impact grows with document size: - 1k chunks (batch 512, 2 iters): ~negligible - 10k chunks (20 iters): ~10× speedup on this stage - 100k chunks (195 iters): ~100× speedup, and peak RAM drops from O(N) extra to near-zero ### Type of change - [x] Performance Improvement Co-authored-by: yoan sapienza <Yoan Sapienza yoan.sapienza@orange.fr Yoan Sapienza zappy@macbookpro.home>	2026-04-30 11:00:10 +08:00
FuturMix	2548c28d65	feat: add FuturMix as model provider (#14419 ) ## Summary Add [FuturMix](https://futurmix.ai) as a new model provider. FuturMix is an OpenAI-compatible unified AI gateway that provides access to 22+ models (GPT, Claude, Gemini, DeepSeek, and more) through a single API endpoint and key. - API Base: `https://futurmix.ai/v1` (OpenAI-compatible) - Supported capabilities: Chat, Embedding, Image2Text, TTS, Speech2Text, Rerank ### Changes \| File \| Change \| \|------\|--------\| \| `rag/llm/__init__.py` \| Add `FuturMix` to `SupportedLiteLLMProvider` enum, `FACTORY_DEFAULT_BASE_URL`, and `LITELLM_PROVIDER_PREFIX` \| \| `rag/llm/chat_model.py` \| Add `FuturMixChat(Base)` — follows Astraflow/Avian pattern \| \| `rag/llm/embedding_model.py` \| Add `FuturMixEmbed(OpenAIEmbed)` — follows Astraflow pattern \| \| `rag/llm/cv_model.py` \| Add `FuturMixCV(GptV4)` — follows SILICONFLOW/OpenRouter pattern \| \| `rag/llm/tts_model.py` \| Add `FuturMixTTS(OpenAITTS)` — follows CometAPI/DeerAPI pattern \| \| `rag/llm/sequence2txt_model.py` \| Add `FuturMixSeq2txt(GPTSeq2txt)` — follows StepFun pattern \| \| `rag/llm/rerank_model.py` \| Add `FuturMixRerank(OpenAI_APIRerank)` \| \| `conf/llm_factories.json` \| Add factory config with 8 chat, 2 embedding, 1 image2text, 2 TTS, 1 speech2text models \| \| `docs/guides/models/supported_models.mdx` \| Add FuturMix to supported models table \| ### Models included - Chat: claude-sonnet-4-20250514, claude-3.5-haiku, gpt-4o, gpt-4o-mini, gemini-2.5-flash, gemini-2.0-flash, deepseek-chat, deepseek-reasoner - Embedding: text-embedding-3-small, text-embedding-3-large - Image2Text: gpt-4o - TTS: tts-1, tts-1-hd - Speech2Text: whisper-1 ## Test plan - [ ] Verify FuturMix appears in the model provider list in RAGFlow UI - [ ] Configure FuturMix with API key and test chat completion - [ ] Test embedding model with document indexing - [ ] Test image2text with a sample image 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-30 10:59:37 +08:00
Magicbook1108	de8c6ad0f3	Feat: enable sync deleted file for Discord (#14451 ) ### What problem does this PR solve? Feat: enable sync deleted file for Discord ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:40 +08:00
bitloi	2bc8c6d35e	feat(dropbox): support deleted-file sync (#14476 ) ### What problem does this PR solve? Partially addresses #14362 by adding deleted-file sync support for the Dropbox data source. Dropbox previously did not provide the slim current-file snapshot required by stale document reconciliation, and its sync runner returned only document batches. As a result, enabling deleted-file sync could not remove local documents that had been deleted from Dropbox. This PR: - Adds `retrieve_all_slim_docs_perm_sync()` to `DropboxConnector`. - Reuses Dropbox metadata traversal to collect current remote file IDs without downloading file contents. - Wires incremental Dropbox sync to return `(document_generator, file_list)` when `sync_deleted_files` is enabled. - Enables the deleted-file sync toggle for Dropbox in the data source settings UI. - Adds regression coverage for slim snapshots, nested folders, paginated listings, duplicate filenames, and full reindex behavior. Tests: - `uv run pytest test/unit_test/common/test_dropbox_connector.py -q` - `uv run pytest test/unit_test/rag/test_sync_data_source.py -q` - `uv run pytest test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py -q` - `uv run ruff check common/data_source/dropbox_connector.py rag/svr/sync_data_source.py test/unit_test/common/test_dropbox_connector.py test/unit_test/rag/test_sync_data_source.py` - `./node_modules/.bin/eslint src/pages/user-setting/data-source/constant/index.tsx` ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:05:11 +08:00
Magicbook1108	db1a73b255	Feat: enable sync deleted files in gitlab (#14481 ) ### What problem does this PR solve? Feat: enable sync deleted files in gitlab ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 19:04:10 +08:00
Magicbook1108	e0b3070012	Feat: enable sync deleted files for Gmail && fix google drive issues (#14462 ) ### What problem does this PR solve? Feat: enable sync deleted files for Gmail && fix google drive issues ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Co-authored-by: bill <yibie_jingnian@163.com> Co-authored-by: balibabu <assassin_cike@163.com>	2026-04-29 17:03:56 +08:00
buua436	c08ced09a7	Fix: add retrieval fallback comments (#14457 ) ### What problem does this PR solve? add retrieval fallback comments ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 14:44:31 +08:00
buua436	a7ce1b1677	Fix: prune deleted doc chunks from retrieval (#14454 ) ### What problem does this PR solve? prune deleted doc chunks from retrieval ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-29 13:03:09 +08:00
Magicbook1108	3b7a6eaa6c	Feat: sync deleted files in Bitbucket (#14450 ) ### What problem does this PR solve? Feat: sync deleted files in Bitbucket ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-29 11:29:17 +08:00
Paras Sondhi	74fa54f122	feat(google-drive): optimize memory payload and enable sync deletion (#14372 ) Addresses the Google Drive integration for #14362 This PR completely overhauls the Google Drive sync logic to accurately detect remote deletions, while drastically reducing the memory footprint during the snapshot phase. ### What changed under the hood: * Killed the memory bloat: Swapped out the massive document dictionary objects for a lightweight `collections.namedtuple` (`SlimDoc = namedtuple('SlimDoc', ['id'])`). This prevents RAM spikes during `retrieve_all_slim_docs_perm_sync` on massive enterprise drives. * Flawless downstream integration: The `SlimDoc` object relies on simple duck typing. It perfectly delivers the `.id` attribute required by `ConnectorService.cleanup_stale_documents_for_task`, meaning your core `hash128` vector cleanup logic runs natively without modification. * Fixed the Shared Drive blindspot: The standard API query was missing team folders. Injected the `corpora="allDrives"` and `includeItemsFromAllDrives=True` override flags so the connector now accurately maps state across both personal workspaces and organizational Shared Drives. ### Testing: Isolated the Google API retrieval logic locally to prove the `SlimDoc` mapping works and correctly registers state drops when a file is trashed remotely. ### Type of change - [x] New Feature (non-breaking change which adds functionality) - [x] Performance Improvement	2026-04-29 10:04:36 +08:00
Stephen Hu	345bec812d	refactor: improve QwenRerank logic (#14388 ) ### What problem does this PR solve? improve QwenRerank logic ### Type of change - [x] Refactoring	2026-04-28 20:17:34 +08:00
Magicbook1108	0d18b293f5	Fix: enable sync deleted file in airtable (#14438 ) ### What problem does this PR solve? Fix: enable sync deleted file in airtable ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 20:09:08 +08:00
buua436	e6e80041f5	Fix: agent toolcall null response & schema validation & DeepSeek think history (#14425 ) ### What problem does this PR solve? agent toolcall null response & schema validation & DeepSeek think history ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 17:09:08 +08:00
Magicbook1108	18fbfafca6	Feat: enable sync deleted files for more connectors (#14353 ) ### What problem does this PR solve? Feat: enable sync delted files for connectors ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-04-28 15:07:14 +08:00
Idriss Sbaaoui	2a37562791	Fix manual naive parser position extraction fallback (#14420 ) ### What problem does this PR solve? This PR fixes a regression where Manual pipeline + Naive (Plain Text) PDF parsing crashed with `AttributeError: 'PlainParser' object has no attribute 'extract_positions'` in `rag/app/manual.py`. fixes #14411 ### Type of change: - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-28 14:21:30 +08:00
Jack	872ff08304	Fix: add executor.shutdown (#14403 ) ### What problem does this PR solve? Add executor shutdown in finally clause to free resources. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-27 22:38:43 +08:00
Idriss Sbaaoui	4303be223f	Fix metadata parsing regression for upgraded v0.24 datasets (#14383 ) ### What problem does this PR solve? This PR fixes issue #14371 where file parsing failed after upgrading from v0.24.0 to v0.25.0, because metadata config could be a JSON Schema object but was handled like a list and later caused `KeyError: 'properties'`. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-04-27 16:18:06 +08:00
euvre	2846a93998	Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 ) ### What problem does this PR solve? Fixes #14196 ## Problem When using DeepDOC to parse large PDFs (over 1000 pages), the parser silently truncated processing at 300 pages due to a hardcoded default `page_to=299` in `RAGFlowPdfParser.__images__()`. This caused: - Errors on pages beyond the limit - Poor image quality as the parser attempted to compensate with missing page data - Inconsistent chunk splitting between full PDF imports and partial imports Additionally, the codebase scattered magic numbers (`299`, `600`, `10000`, `100000`, `100000000`, `10000000000`, `10*9`) across 22 files as sentinel values for "parse all pages", making future maintenance error-prone. ## Root Cause ```python # deepdoc/parser/pdf_parser.py (before) def __images__(self, fnm, zoomin=3, page_from=0, page_to=299, callback=None): # Only the first 300 pages were rendered; everything beyond was silently dropped ``` While most callers in `rag/app/.py` correctly passed `to_page=100000`, the base class `RAGFlowPdfParser.__call__()` and `parse_into_bboxes()` invoked `__images__` without forwarding `page_from`/`page_to`, falling back to the restrictive default of 299. ## Solution ### 1. Define constants in `common/constants.py` ```python MAXIMUM_PAGE_NUMBER = 100000 # Used by the parsing layer MAXIMUM_TASK_PAGE_NUMBER = MAXIMUM_PAGE_NUMBER * 1000 # Used by the task/DB layer ``` ### 2. Replace all hardcoded sentinel values \| Layer \| Files Changed \| Old Values \| New Value \| \|---\|---\|---\|---\| \| Deepdoc parsers \| `pdf_parser.py`, `mineru_parser.py`, `docling_parser.py`, `opendataloader_parser.py`, `paddleocr_parser.py`, `docx_parser.py` \| `299`, `600`, `109`, `100000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Chunk parsers \| `naive.py`, `book.py`, `qa.py`, `one.py`, `manual.py`, `paper.py`, `presentation.py`, `laws.py`, `resume.py`, `email.py`, `table.py` \| `100000`, `10000`, `10000000000` \| `MAXIMUM_PAGE_NUMBER` \| \| Task/DB layer** \| `db_models.py`, `task_service.py`, `document_service.py`, `file_service.py` \| `100000000` \| `MAXIMUM_TASK_PAGE_NUMBER` \| ### 3. Fix `parse_into_bboxes()` missing parameters Added `from_page`/`to_page` parameters to `parse_into_bboxes()` so that the `rag/flow/parser/parser.py` DeepDOC path no longer falls back to the restrictive default. ## Files Changed (22) - `common/constants.py` - `deepdoc/parser/pdf_parser.py` - `deepdoc/parser/mineru_parser.py` - `deepdoc/parser/docling_parser.py` - `deepdoc/parser/opendataloader_parser.py` - `deepdoc/parser/paddleocr_parser.py` - `deepdoc/parser/docx_parser.py` - `rag/app/naive.py` - `rag/app/book.py` - `rag/app/qa.py` - `rag/app/one.py` - `rag/app/manual.py` - `rag/app/paper.py` - `rag/app/presentation.py` - `rag/app/laws.py` - `rag/app/resume.py` - `rag/app/email.py` - `rag/app/table.py` - `api/db/db_models.py` - `api/db/services/task_service.py` - `api/db/services/document_service.py` - `api/db/services/file_service.py` ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 14:57:20 +08:00
yuch85	0d87cecae2	feat: persist PDF bookmark outline as document metadata (#13287 ) ## Summary PDF files often contain a bookmark/outline tree (table of contents built into the file by the authoring tool). RAGFlow's `pdf_parser.outlines` already extracts these `(title, depth)` tuples via pypdf, but they are used ephemerally during chunking (`manual` parser uses them for hierarchy detection) and then discarded. This PR persists the outline as `doc.meta_fields["outline"]` — a JSON array of `{"title": str, "depth": int}` objects — so downstream features can use the structural information. ### Why this matters - Complementary to `toc_extraction` — the existing `toc_extraction` feature uses LLM calls to generate a TOC and only works for the `naive` parser. The raw PDF outline is free (already extracted by pypdf), works for all parsers, and captures the author's original document structure. - Document navigation — frontends can render a clickable TOC from the outline - Entity extraction — the outline provides a structural map for identifying document sections and key topics - Search result context — knowing which section a chunk belongs to helps users evaluate relevance ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/app/naive.py` \| Attach `pdf_parser.outlines` as `__outline__` on first chunk dict \| ~7 \| \| `rag/app/manual.py` \| Same for the manual parser \| ~5 \| \| `rag/svr/task_executor.py` \| Extract `__outline__`, persist via `DocMetadataService.update_document_metadata()` \| ~12 \| ### Design decisions - Transient key pattern: The outline is passed from parser → task_executor via `__outline__` on the first chunk dict, then removed before indexing. This follows the same pattern as `metadata_obj` for LLM-generated metadata. - No schema changes: Uses the existing `meta_fields` JSON column on the document table. - Graceful degradation: If a PDF has no outline (common for scanned docs), nothing is stored. If persistence fails, it logs a warning and continues — parsing is not interrupted. ### Backward compatibility - Fully backward compatible — no existing fields, behavior, or schemas changed - PDFs without outlines are unaffected - Existing `meta_fields` data is preserved (merged, not overwritten) ## Test plan - [ ] Parse a PDF with bookmarks (e.g. any multi-chapter document), verify `meta_fields["outline"]` is populated - [ ] Parse a PDF without bookmarks, verify no errors and no outline key in meta_fields - [ ] Verify existing `meta_fields` data is preserved (not overwritten) when outline is added - [ ] Verify `manual` parser also persists outlines - [ ] Verify outline JSON structure: `[{"title": "Chapter 1", "depth": 0}, ...]` Related: #9921 (Deterministic Document Access Layer) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 11:57:06 +08:00
euvre	f3b7d55a1e	fix: handle Infinity table-not-exist error (3022) in update() methods (#14153 ) ### What problem does this PR solve? ## Summary Closes #6102 When using Infinity as the document store engine (GPU version), calling `update()` on a non-existent table throws an unhandled `InfinityException` with error code 3022 (`TABLE_NOT_EXIST`). This causes users to see a raw "3022" error when clicking on a parsed document. ## Root Cause The `update()` methods in both `rag/utils/infinity_conn.py` and `memory/utils/infinity_conn.py` call `db_instance.get_table(table_name)` without catching `InfinityException`. In contrast, other CRUD methods (`insert`, `delete`, `search`) all handle this exception gracefully: \| Method \| Handles table-not-exist? \| Behavior \| \|----------\|--------------------------\|----------\| \| `insert` \| ✅ Yes \| Auto-creates the table \| \| `search` \| ✅ Yes \| Skips the table \| \| `delete` \| ✅ Yes \| Returns 0 \| \| `update` \| ❌ No \| Crashes with 3022 \| Additionally, `api/apps/document_app.py` worked around this with a fragile string match (`"3022" in msg`) to detect the error. ## Changes - `rag/utils/infinity_conn.py`: Catch `InfinityException` in `update()`. When `TABLE_NOT_EXIST` is detected, log a warning and return `False` — consistent with `delete()`. - `memory/utils/infinity_conn.py`: Apply the same fix to its `update()` method. - `api/apps/document_app.py`: Remove the fragile `"3022"` string-matching workaround. Table-not-exist is now handled by the `if not ok` path with an improved error message. ### Type of change - [x] Refactoring --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-04-27 11:52:22 +08:00
yuch85	3ad3241ae0	feat: persist RAPTOR layer metadata on summary chunks (#13286 ) ## Summary RAPTOR's recursive clustering builds a `layers` list tracking `(start_idx, end_idx)` boundaries per level, but currently discards this information — only the flat `chunks` list is returned. This makes it impossible to distinguish leaf-level summaries from top-level ones. This PR: - Returns `(chunks, layers)` tuple from `raptor.py`'s `__call__` - Annotates each RAPTOR summary chunk with `raptor_layer_int` (1 = first summary level, 2 = summary-of-summaries, etc.) - Adds `raptor_layer_int` to `infinity_mapping.json` (Elasticsearch handles it via existing `_int` dynamic template) ### Why this matters Downstream features need to know which RAPTOR layer a summary belongs to: - Retrieving the top-level document summary* for entity extraction, search snippets, or document comparison - Filtering by abstraction level — users may want only high-level summaries or only leaf-level cluster summaries - RAPTOR recall quality — #10951 reports summaries not being recalled for definition queries; layer metadata enables targeted retrieval ### Changes \| File \| Change \| LOC \| \|------\|--------\|-----\| \| `rag/raptor.py` \| Return `(chunks, layers)` tuple \| ~3 \| \| `rag/svr/task_executor.py` \| Build `chunk_layer` mapping, set `raptor_layer_int` \| ~12 \| \| `conf/infinity_mapping.json` \| Add `raptor_layer_int` integer field \| ~1 \| ### Backward compatibility - Additive only — no existing fields or behavior changed - Existing RAPTOR chunks continue to work (they'll have `raptor_layer_int = 0` by default) - New RAPTOR chunks get layer metadata automatically ## Test plan - [ ] Parse a document with RAPTOR enabled, verify `raptor_layer_int` is set on indexed chunks - [ ] Verify `raptor_layer_int` values increase with abstraction level (layer 1 < layer 2 < ...) - [ ] Verify existing RAPTOR deletion (`delete by raptor_kwd`) still works - [ ] Verify Infinity backend accepts the new field Fixes #7488 Related: #4104, #11191, #10951 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yuch85 <yuch85.1@gmail.com> Co-authored-by: Wang Qi <wangq8@outlook.com>	2026-04-27 10:20:46 +08:00

1 2 3 4 5 ...

1430 Commits