feat(dingtalk-ai-table): support deleted-file sync via slim snapshot (#14525)

### What problem does this PR solve?

Incremental DingTalk AI Table (Notable) sync did not reconcile rows
removed on the remote side with documents already in the knowledge base.
This follows the coordinated datasource work in #14362 (“sync deleted
files”).

This PR adds a **full slim snapshot**
(`retrieve_all_slim_docs_perm_sync`) that lists **current record IDs for
all sheets** without building document blobs, using the same logical
document IDs as full ingest
(`dingtalk_ai_table:{table_id}:{sheet_id}:{record_id}`). When
**`sync_deleted_files`** is enabled on incremental runs,
`DingTalkAITable._generate` returns **`(document_generator,
file_list)`** so **`SyncBase`** can run
**`cleanup_stale_documents_for_task`** and remove KB rows that no longer
exist remotely.

Design notes:

- **`_document_id`** centralizes the ID string so slim snapshots and
**`_convert_record_to_document`** stay aligned with
**`hash128(doc.id)`** semantics used during ingestion/cleanup.
- **`end_ts`** is captured before building **`file_list`**, then
**`poll_source`** uses the same upper bound (consistent with other
Dropbox-style connectors).
- **`batch_size`** from connector config is coerced to a positive
**`int`** before constructing the connector.
- Slim snapshot failures are caught in **`_generate`**; **`file_list`**
is set to **`None`** so cleanup is skipped rather than running on
partial/error state.

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

### Files changed (summary)

| Area | Change |
|------|--------|
| `common/data_source/dingtalk_ai_table_connector.py` |
`SlimConnectorWithPermSync`, `retrieve_all_slim_docs_perm_sync`,
`_document_id` shared with document conversion |
| `rag/svr/sync_data_source.py` | `DingTalkAITable._generate`: slim
snapshot + tuple return; `batch_size` validation; shared `end_ts` with
`poll_source` |
| `web/src/pages/user-setting/data-source/constant/index.tsx` |
`syncDeletedFiles` for DingTalk AI Table in
`DataSourceFeatureVisibilityMap` |

Closes / relates to: #14362
This commit is contained in:
NeedmeFordev
2026-05-06 08:06:23 +02:00
committed by GitHub
parent c502001d9e
commit 89961962c0
3 changed files with 83 additions and 7 deletions

View File

@@ -1547,10 +1547,18 @@ class DingTalkAITable(SyncBase):
"""
Sync records from DingTalk AI Table (Notable).
"""
raw_batch_size = self.conf.get("batch_size", INDEX_BATCH_SIZE)
try:
batch_size = int(raw_batch_size)
except (TypeError, ValueError):
batch_size = INDEX_BATCH_SIZE
if batch_size <= 0:
batch_size = INDEX_BATCH_SIZE
self.connector = DingTalkAITableConnector(
table_id=self.conf.get("table_id"),
operator_id=self.conf.get("operator_id"),
batch_size=self.conf.get("batch_size", INDEX_BATCH_SIZE),
batch_size=batch_size,
)
credentials = self.conf.get("credentials", {})
@@ -1562,14 +1570,36 @@ class DingTalkAITable(SyncBase):
)
poll_start = task.get("poll_range_start")
file_list = None
if task.get("reindex") == "1" or poll_start is None:
document_generator = self.connector.load_from_state()
_begin_info = "totally"
else:
end_ts = datetime.now(timezone.utc).timestamp()
if self.conf.get("sync_deleted_files"):
file_list = []
logging.info(
"DingTalk AI Table: fetching slim snapshot for stale-document reconciliation "
"(connector_id=%s, kb_id=%s, table_id=%s)",
task["connector_id"],
task["kb_id"],
self.conf.get("table_id"),
)
try:
for slim_batch in self.connector.retrieve_all_slim_docs_perm_sync():
file_list.extend(slim_batch)
except Exception:
logging.exception(
"DingTalk AI Table slim snapshot failed; continuing without stale-document cleanup "
"(connector_id=%s, kb_id=%s)",
task["connector_id"],
task["kb_id"],
)
file_list = None
document_generator = self.connector.poll_source(
poll_start.timestamp(),
datetime.now(timezone.utc).timestamp(),
end_ts,
)
_begin_info = f"from {poll_start}"
@@ -1579,7 +1609,7 @@ class DingTalkAITable(SyncBase):
task,
)
return document_generator
return document_generator, file_list
class MySQL(SyncBase):