ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Author	SHA1	Message	Date
Jack	e629c0203b	feat: add KG entity/relation/community search functions (#15689 ) ## Summary Knowledge Graph search functions for entity, relation, community report, and type-samples retrieval. Uses DocEngine.SelectFields (PR #15684) for KG-specific fields. ### Functions \| Function \| Description \| \|----------\|-------------\| \| `SearchKGEntities` \| Hybrid search over KG entities (dense + text + fusion) \| \| `SearchKGEntitiesByTypes` \| Entity search filtered by `entity_type_kwd` \| \| `SearchKGRelations` \| Hybrid search over KG relations \| \| `SearchKGCommunityReports` \| Community report search by entity names \| \| `SearchKGTypeSamples` \| Type→entities mapping for query_rewrite \| ### Internal helpers \| Helper \| Description \| \|--------\|-------------\| \| `buildHybridExpr` \| Shared dense+text+fusion expression construction \| \| `buildKGDenseExpr` \| Wraps `Embed()` call for vector search \| \| `Parse*` \| Convert raw chunks to typed structs \| ### Testing 35 tests (pure function + mock integration) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 13:23:04 +08:00
Haruko386	4b2af1347c	feat[Go]: implement Agent/Workflow PUT /api/v1/agents/<canvas_id>/tags (#15641 ) feat[Go]: implement Agent/Workflow PUT /api/v1/agents/<canvas_id>/tags (#15641)	2026-06-05 13:22:23 +08:00
buua436	71649db3b0	fix: prevent duplicated post-think text (#15651 ) ### What problem does this PR solve? This fixes duplicated post-think text in streamed chat responses. When the model emits text immediately after `</think>`, the stream state now advances its cursor correctly so the same visible prefix is not emitted twice. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-05 13:21:26 +08:00
Jack	f6ff862a24	fix: restore case-insensitive contains/not contains/not in and consolidate metadata filter pipeline (#15686 ) ## Summary This PR fixes case-sensitivity regressions introduced in #15656 and consolidates the metadata filtering pipeline by removing the duplicate `applySingleCondition` adapter layer. ### Bug fixes 1. contains / not contains: restored case-insensitive matching (was lost when `applySingleCondition` was replaced by `common.MetaFilter.matchValue` which lacked `strings.ToLower`) 2. not in: restored case-insensitive matching (was lost for same reason; uses `strings.EqualFold`) 3. != with date filter values: non-date metadata values now correctly match the `≠` operator (a non-date value IS not equal to any date, but was returning false) ### Architecture 4. Removed `applySingleCondition` (65 lines) — the inline switch was a duplicate of `common.MetaFilter` logic. `ApplyMetaFilter` now converts conditions and delegates to `common.MetaFilter` once per filter set, eliminating ~25 lines of duplicate AND/OR merge logic. 5. Added `filterSet` — O(n+m) hash-map fast path for `in`/`not in` operators, replacing the O(nm) linear scan in `matchValue`. 6. Exported `NormalizeOperator`* from `common` for consistent operator alias handling. ### Cleanup 7. Removed 18 lines of dead code (`matchValue`'s `in`/`not in` branches already bypassed by `filterOut` delegation) 8. Fixed orphaned godoc comment for `convertOperator` 9. Fixed incorrect `filterSet` doc comment (claimed "matching EqualFold" but used `strings.ToLower`) 10. Completed `convertToMetaCondition` operator normalization documentation ### Testing - 60 tests (24 service + 36 common), all passing - New tests: `==`, `≠`, `>`, `<`, `≥`, `≤`, `empty`, `not empty` through `ApplyMetaFilter` - New tests: `<`, `≤`, `≠` through `MetaFilter`; `not-in-empty-list` through `filterSet` - All 18 `MetaFilter` tests pass; all 10 `filterSet` unit tests pass --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 12:47:55 +08:00
Jack	ee32d91aab	feat: add EnrichChunksWithDocMetadata function to attach document metadata to chunks (#15659 ) ## Summary Add `EnrichChunksWithDocMetadata` as a method on `MetadataService` that attaches document metadata to retrieval chunks in-place. Equivalent to Python's `enrich_chunks_with_document_metadata()` from `api/utils/reference_metadata_utils.py`. ### Usage ```go metadataSvc.EnrichChunksWithDocMetadata(chunks, tenantID, metadataFields) ``` ### Changes - `service/metadata.go`: Added `EnrichChunksWithDocMetadata` method - `service/enrich_metadata_test.go` (new): 7 test cases ### Algorithm 1. Collect unique `(kb_id, doc_id)` pairs from chunks 2. Fetch metadata from ES via `SearchMetadata(kbID, tenantID, docIDs)` 3. Attach `document_metadata` field to each matching chunk 4. Optionally filter to specified `metadataFields` ### Testing All 7 tests pass: ``` === RUN TestEnrichChunksWithDocMetadata_NoChunks --- PASS === RUN TestEnrichChunksWithDocMetadata_EmptyChunks --- PASS === RUN TestEnrichChunksWithDocMetadata_EmptyDocID --- PASS === RUN TestEnrichChunksWithDocMetadata_DuplicateDocIDs --- PASS === RUN TestEnrichChunksWithDocMetadata_MultipleKBs --- PASS === RUN TestEnrichChunksWithDocMetadata_WithMetadataFields --- PASS === RUN TestEnrichChunksWithDocMetadata_MixedFields --- PASS ``` Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 11:42:23 +08:00
Jack	3b1ae3f829	feat: support SelectFields override in DocEngine for KG-specific queries (#15684 ) ## Summary Both ES and Infinity engines now respect `SearchRequest.SelectFields`, allowing callers to specify output columns for KG entity/relation/community queries instead of the default chunk columns. ### Changes - `internal/engine/elasticsearch/chunk.go`: Added `SelectFields` override after default `outputColumns` - `internal/engine/infinity/chunk.go`: Added `SelectFields` override after default `outputColumns` - `internal/engine/elasticsearch/kg_test.go` (new): Integration test (skipped unless `ES_TEST=1`) ### Usage ```go result, err := docEngine.Search(ctx, \&types.SearchRequest{ KbIDs: kbIDs, SelectFields: []string{entity_kwd, entity_type_kwd, rank_flt, n_hop_with_weight}, Filter: map[string]interface{}{knowledge_graph_kwd: entity}, }) ``` Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 11:41:39 +08:00
Wang Qi	4cbe597d7e	Refactor: consolidate to use @login_required (#15652 ) Refactor: consolidate to use @login_required	2026-06-05 11:35:00 +08:00
bitloi	9f3e289b78	Fix: preserve markdown tables during delimiter extraction (#15632 ) ### What problem does this PR solve? Markdown extraction can split tables row by row when delimiter-based extraction uses a newline delimiter. That loses table structure during chunking even though delimiters should still split normally outside tables. This PR keeps the follow-up to #15482 intentionally narrow: - preserve Markdown pipe tables during delimiter-based extraction - preserve borderless pipe tables during delimiter-based extraction - preserve multiline HTML tables during delimiter-based extraction - keep delimiter splitting unchanged outside protected table ranges Refs #15482 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Testing - `ruff check deepdoc/parser/markdown_parser.py test/unit_test/deepdoc/parser/test_markdown_parser.py` - `python3 run_tests.py -t test/unit_test/deepdoc/parser/test_markdown_parser.py` - `git diff --check`	2026-06-05 10:35:33 +08:00
dripsmvcp	431f52a5d4	feat[Go]: implement GET /agents/templates (issue #15240 ) (#15573 ) ## Summary Port the canvas-template catalogue endpoint to the Go API server. Listed in the Go-API port checklist of #15240. Mirrors `list_agent_template` in `api/apps/restful_apis/agent_api.py`: returns every row from the `canvas_template` table so that the UI can render the template gallery on the New-Agent screen. ## What - `internal/dao/canvas_template.go` — new `CanvasTemplateDAO.GetAll()` ordered by `create_time desc` (newest templates first). - `internal/service/agent.go` — wire the new DAO into `AgentService` and expose `ListTemplates() ([]entity.CanvasTemplate, error)`. - `internal/handler/agent.go` — new `AgentHandler.ListTemplates` HTTP handler (auth-gated, mirrors Python `@login_required`). - `internal/router/router.go` — `agents.GET("/templates", r.agentHandler.ListTemplates)` registered alongside the existing `GET /agents`. - `internal/handler/agent_test.go` — three new tests covering: success path, empty-list → JSON array (not `null`), and the auth gate. ## Notes - `CanvasTemplate` entity, GORM tags, and DB migration already exist in `internal/entity/canvas.go` and `internal/dao/database.go` — no schema change required. - The handler coerces a `nil` slice to `[]entity.CanvasTemplate{}` so the JSON payload is always an array (the frontend does `data.map(...)` on it). ## Test plan - [x] `go vet ./internal/handler ./internal/service ./internal/dao ./internal/router` clean - [x] Three unit tests added; existing `TestListAgents_Success` untouched - [ ] CI runs `go test ./internal/handler` with cgo binding linked ## Related - Tracker: #15240	2026-06-05 10:13:30 +08:00
Jack	a237a89b90	feat: add QueryRewrite prompt builder and response parser (#15669 ) QueryRewrite prompt builder and response parser. Zero external dependencies. ### Functions - `BuildQueryRewritePrompt`: Renders `minirag_query2kwd` prompt with query and type pool - `ParseQueryRewriteResponse`: Parses LLM JSON response with fallback for markdown and extra text ### Testing ``` === RUN TestBuildQueryRewritePrompt --- PASS === RUN TestParseQueryRewriteResponse_ValidJSON --- PASS === RUN TestParseQueryRewriteResponse_MarkdownBlock --- PASS === RUN TestParseQueryRewriteResponse_ExtraText --- PASS === RUN TestParseQueryRewriteResponse_Invalid --- PASS === RUN TestParseQueryRewriteResponse_EmptyEntities --- PASS ``` Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 10:11:14 +08:00
Jack	bf6c091c9f	feat: add KG scoring utilities (#15666 ) KG scoring utilities as pure functions. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-05 10:10:59 +08:00
kpdev	bd49fd70aa	fix(api): set SDK document download Content-Type from filename (#15112 ) (#15113 ) ## Summary - Infer `Content-Type` from the stored document filename on SDK download routes. - Covers `GET /api/v1/datasets/<dataset_id>/documents/<document_id>` and `GET /api/v1/documents/<document_id>`. - Aligns with REST preview/download via `CONTENT_TYPE_MAP`. ## Test plan - [x] `pytest test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py::TestDocRoutesUnit::test_download_mimetype_from_filename` - [x] Manual: `curl -sSI` on SDK dataset document download for a PDF; expect `Content-Type: application/pdf` Fixes #15112.	2026-06-05 10:08:53 +08:00
Lynn	794c1f4b25	Fix: volc engine and other json key factories (#15653 ) ### What problem does this PR solve? Fix: - VolcEngine adapt to new api_key format - Save dict api_key as json ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-05 09:45:44 +08:00
He Wang	7789862cc5	fix(docker): mount tmpfs on es01 /tmp for entrypoint permissions (#15655 ) ### What problem does this PR solve? On some Linux hosts (e.g. x86_64 with enforced POSIX ACL on overlay storage), the official `elasticsearch` Docker image cannot start because `docker-entrypoint.sh` needs to create temporary files under `/tmp` for bash here-documents, while the image ACL grants `user:elasticsearch` only `r-x` on `/tmp`: ``` /usr/local/bin/docker-entrypoint.sh: line 73/84: cannot create temp file for here-document: Permission denied ``` RAGFlow users hit this when running `docker compose` with the default `es01` service. See also Refs #284. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ## Summary Mount a writable `tmpfs` at `/tmp` for the `es01` service so Elasticsearch entrypoint scripts can run on ACL-enforced environments. Closes the startup failure described in #284 for non-ARM deployments. ## Changes - Add `tmpfs: /tmp:mode=1777,size=512m` to `es01` in `docker/docker-compose-base.yml` - Document why the mount is required (ES image `/tmp` ACL vs entrypoint here-documents) ## Test plan - [x] Verified on Linux (x86_64): `docker run --rm elasticsearch:8.11.3 bash -c 'mktemp'` fails without tmpfs and succeeds with `--tmpfs /tmp:mode=1777,size=512m` - [x] Verified `es01` becomes healthy after `docker compose up -d es01` with this change - [ ] Upstream maintainers: `docker compose -f docker/docker-compose-base.yml --profile elasticsearch up -d es01` on a host where ACL is enforced Made with [Cursor](https://cursor.com) Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-04 23:19:31 +08:00
Jack	eee6ad546f	feat: add ResolveReferenceMetadata utility function (#15663 ) Add `ResolveReferenceMetadata` to parse `include_metadata` / `metadata_fields` from request and config payloads. ### Changes - New: `internal/common/reference_metadata.go` — pure function, zero dependencies - New: `internal/common/reference_metadata_test.go` — 8 test cases Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 22:34:18 +08:00
Jack	96a416629d	refactor: change GetFlattedMetaByKBs return type to common.MetaData (#15656 ) ## Summary Change `GetFlattedMetaByKBs` return type from `map[string]interface{}` to strongly-typed `common.MetaData`. Depends on: #15648 (provides `MetaData`, `MetaValueDocs` types) ### Changes - `service/metadata.go`: Changed return type, removed type assertions - `service/metadata_filter.go`: Updated all metadata function signatures - `service/metadata_filter_test.go` (new): 12 test cases ### Bug fix `applySingleCondition` used `.([]interface{})` assertions on `[]string` data, silently breaking operators like `!=`, `contains`, `start with`, etc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 22:16:04 +08:00
web-dev0521	98f2a2e60b	feat(connectors): add Azure Blob Storage data source connector (#15466 ) ### What problem does this PR solve? Closes #15465. RAGFlow supports S3, Google Cloud Storage, R2, and OCI as data sources but not Azure Blob Storage, leaving Azure users without a way to index container objects into a knowledge base. This adds a first-class Azure Blob Storage data-source connector — distinct from RAGFlow's existing Azure storage backends (`rag/utils/azure_sas_conn.py`, `rag/utils/azure_spn_conn.py`) which store RAGFlow's own files. Highlights - `common/data_source/azure_blob_connector.py`: new `AzureBlobConnector` (`CheckpointedConnectorWithPermSync` + `SlimConnectorWithPermSync`). - Uses the existing `azure-storage-blob` dependency (already in `pyproject.toml`). - Three auth modes, tried in order of precedence: 1. Account key — `account_name` + `account_key` + `container_name`. 2. Connection string — `connection_string` + `container_name`. 3. SAS token — `container_url` + `sas_token` (same shape as `RAGFlowAzureSasBlob`). - ETag fingerprint stored per blob in `AzureBlobCheckpoint.etags` — unchanged blobs (same ETag as last run) are skipped without a download. Only new/modified blobs are fetched. - Optional `prefix` scopes indexing to a virtual folder. - `validate_connector_settings()` probes `get_container_properties()` and maps `AuthenticationFailed / 403 / ContainerNotFound` to typed connector exceptions. - Slim-doc IDs are blob names so prune reconciles correctly. - `common/constants.py`, `common/data_source/config.py`, `common/data_source/__init__.py`: register `azure_blob` in `FileSource` / `DocumentSource` and export `AzureBlobConnector`. - `rag/svr/sync_data_source.py`: new `AzureBlob(SyncBase)` class routed through `load_from_checkpoint` (ETag fingerprint owns change-detection) and added to `func_factory`. - Frontend: - `web/src/pages/user-setting/data-source/constant/index.tsx`: new `DataSourceKey.AZURE_BLOB`, auth-mode selector (account key / connection string / SAS token), all credential fields, prefix + batch-size, `syncDeletedFiles` capability, default form values, tile entry with icon. - `web/src/locales/{en,zh}.ts`: description + per-field tooltips for all 9 new keys. - `web/src/assets/svg/data-source/azure-blob.svg`: Azure-branded stacked-cylinders icon. Verification - `npm run build` (vite + esbuild) passes (37 s). ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-06-04 21:06:01 +08:00
Jack	a78a3fdd47	fix: add nil guard to DocumentDAO.GetByIDs and add tests (#15649 ) ## Summary `DocumentDAO.GetByIDs()` generated `WHERE id IN ()` for empty/nil ID slices, which is invalid SQL and would fail on most databases. This PR adds a nil guard and comprehensive tests. ### Changes - Modified: `internal/dao/document.go` — Added `len(ids) == 0` guard to `GetByIDs` - New: `internal/dao/document_test.go` — 4 test cases covering success, empty IDs, nil IDs, and no-match ### Testing ``` === RUN TestDocumentGetByIDs_Success --- PASS === RUN TestDocumentGetByIDs_EmptyIDs --- PASS === RUN TestDocumentGetByIDs_NilIDs --- PASS === RUN TestDocumentGetByIDs_NoMatch --- PASS ``` Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 21:00:02 +08:00
Jack	461c190c49	feat: migrate meta_filter and convert_conditions to Go (#15648 ) ## Summary Migrate the metadata filtering utilities `meta_filter` and `convert_conditions` from `common/metadata_utils.py` to Go as pure functions with zero external dependencies. These functions are used by `dify/retrieval`, `openai/chat/completions`, `document_api`, and `chunk_api` for filtering documents by metadata conditions. ### Changes - New: `internal/common/metadata_utils.go` — `ConvertConditions()` and `MetaFilter()` with full operator support - New: `internal/common/metadata_utils_test.go` — 18 test cases covering all operators and edge cases ### Supported Operators `=`, `≠`, `>`, `<`, `≥`, `≤`, `contains`, `not contains`, `in`, `not in`, `start with`, `end with`, `empty`, `not empty` ### Design - Numeric comparison via `strconv.ParseFloat` - Date comparison via YYYY-MM-DD format detection - Case-insensitive string comparison fallback - `and` / `or` logic support for multiple conditions - Zero external dependencies — pure functions only	2026-06-04 20:14:27 +08:00
Jack	e627f5d8c5	feat: implement POST /api/v1/searchbots/related_questions API (#15639 ) ## Summary Implement the `POST /api/v1/searchbots/related_questions` endpoint in Go, generating related search questions via LLM. ### Changes - New: `internal/handler/related_questions.go` — Handler with injectable LLM interface, prompt constant, and response parsing - New: `internal/handler/related_questions_test.go` — 9 tests (4 handler + 5 parse) - Modified: `internal/router/router.go` — Added route + `RelatedQuestionsHandler` to struct - Modified: `cmd/server_main.go` — Wired handler with `SearchService` and `ModelProviderService` ### Testing All 9 tests pass: ``` === RUN TestRelatedQuestionsHandler_Success --- PASS === RUN TestRelatedQuestionsHandler_EmptyResponse --- PASS === RUN TestRelatedQuestionsHandler_LLMFailure --- PASS === RUN TestRelatedQuestionsHandler_MissingQuestion --- PASS === RUN TestParseRelatedQuestions_Standard --- PASS === RUN TestParseRelatedQuestions_Empty --- PASS === RUN TestParseRelatedQuestions_NoNumberedLines --- PASS === RUN TestParseRelatedQuestions_MixedContent --- PASS === RUN TestParseRelatedQuestions_MultiDigit --- PASS ``` Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 19:13:58 +08:00
Jack	6143205b37	feat: implement GET /api/v1/agents/<agent_id>/versions/<version_id> API (#15640 ) ## Summary Implement the `GET /api/v1/agents/<agent_id>/versions/<version_id>` endpoint in Go, returning full version details including DSL. Depends on #15629 which introduced the version list endpoint and `UserCanvasVersionDAO` infrastructure. ### Changes - Modified: `internal/handler/agent.go` — Added `GetAgentVersion` handler with auth check and ownership verification - Modified: `internal/router/router.go` — Registered `GET /:agent_id/versions/:version_id` route - New/Modified tests: Service and handler tests for the version detail endpoint ### Testing ``` === RUN TestGetVersion_Success --- PASS === RUN TestGetVersion_WrongCanvas --- PASS === RUN TestGetVersion_NotFound --- PASS === RUN TestGetAgentVersionHandler_Success --- PASS === RUN TestGetAgentVersionHandler_VersionNotFound --- PASS ``` Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 19:13:26 +08:00
buua436	423fb6faae	fix: duplicate document ingest guard (#15638 ) ### What problem does this PR solve? When a document is rerun or updated concurrently, the previous unconditional update could overwrite a newer task state. This change adds an `update_time`-based optimistic lock so the update only succeeds if the record has not been modified by another flow in the meantime. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-04 17:57:51 +08:00
Haruko386	baeb0c0431	Refactor[Go Model Provider]: refactor baseURL and modelConfig (#15627 ) ### What problem does this PR solve? As Title ### Type of change - [x] Refactoring	2026-06-04 17:50:22 +08:00
buua436	04dc3bb19c	fix: pass search id to searchbots ask (#15646 ) ### What problem does this PR solve? This change ensures `/searchbots/ask` receives `search_id` from the frontend, so the backend can load the matching search configuration when the shared search flow invokes the endpoint. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-04 17:41:56 +08:00
Jack	23aae19898	feat: implement POST /api/v1/agents/<agent_id>/upload API (#15633 ) ## Summary Implement the `POST /api/v1/agents/<agent_id>/upload` endpoint in Go, allowing file uploads associated with agent canvases. ### Changes - Modified: `internal/service/agent.go` — Added `CheckCanvasAccess` method (owner + team-level permission semantics) - Modified: `internal/handler/agent.go` — Added `UploadAgentFile` handler with auth check, multipart file parsing, and delegation to `FileService`. Added `fileUploader` interface for testability. - Modified: `internal/router/router.go` — Registered `POST /:agent_id/upload` route - Modified: `cmd/server_main.go` — Wired `fileService` into `AgentHandler` - New: `internal/service/agent_test.go` — 4 service-level tests for `CheckCanvasAccess` (owner, team member, private denial, not found) - New: `internal/handler/agent_upload_test.go` — 3 handler-level tests (success with fake file service, cross-user denial, empty file rejection) ### Testing All 7 tests pass with zero mocking of the DB layer (in-memory SQLite): ``` === RUN TestCheckCanvasAccess_Owner --- PASS === RUN TestCheckCanvasAccess_NotOwner --- PASS === RUN TestCheckCanvasAccess_PrivateCanvas_Denied --- PASS === RUN TestCheckCanvasAccess_NotFound --- PASS === RUN TestUploadAgentFileHandler_Success --- PASS === RUN TestUploadAgentFileHandler_NoPermission --- PASS === RUN TestUploadAgentFileHandler_NoFiles --- PASS ``` Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 17:21:47 +08:00
Lynn	b65b18ba4c	Fix: model provider (#15634 ) ### What problem does this PR solve? Not display `success` when check not passed. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-04 16:05:00 +08:00
Jack	02d163a177	feat: implement GET /api/v1/agents/<agent_id>/versions API (#15629 ) ## Summary Implement the `GET /api/v1/agents/<agent_id>/versions` endpoint in Go, listing all version snapshots for an agent canvas in descending update time order. ### Changes - New: `internal/dao/user_canvas_version.go` — `UserCanvasVersionDAO` with `ListByCanvasID` (ordered by update_time DESC) and `GetByID` - Modified: `internal/service/agent.go` — Added `CheckCanvasAccess`, `ListVersions`, `GetVersion` methods - Modified: `internal/handler/agent.go` — Added `ListAgentVersions` handler with auth check - Modified: `internal/router/router.go` — Registered `GET /:agent_id/versions` route - New: `internal/service/agent_test.go` — 5 service-level tests (SQLite in-memory DB, zero mock) - Modified: `internal/handler/agent_test.go` — 3 handler-level tests (real DB, pre-authenticated context) ### Testing All 8 tests pass with zero mocking (in-memory SQLite replaces MySQL): ``` === RUN TestListVersions_Success --- PASS === RUN TestListVersions_Empty --- PASS === RUN TestCheckCanvasAccess_Owner --- PASS === RUN TestCheckCanvasAccess_NotOwner --- PASS === RUN TestCheckCanvasAccess_NotFound --- PASS === RUN TestListAgentVersionsHandler_Success --- PASS === RUN TestListAgentVersionsHandler_NoPermission --- PASS === RUN TestListAgentVersionsHandler_CanvasNotFound --- PASS ``` Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-04 15:36:26 +08:00
Jack	c6eee09ed3	feat: migrate POST /api/v1/datasets/<dataset_id>/documents/stop to Go (#15597 ) ## Summary Migrate the stop parse documents endpoint from Python to Go. ### Python endpoint `POST /api/v1/datasets/<dataset_id>/documents/stop` — `api/apps/restful_apis/document_api.py:1542-1641` ### Changes \| File \| Change \| \|------\|--------\| \| `internal/dao/task.go` \| Add `GetByDocID` method \| \| `internal/dao/task_test.go` \| 3 DAO tests (new file) \| \| `internal/service/document.go` \| Add `StopParseDocuments` + refactor shared helpers \| \| `internal/service/document_test.go` \| 8 service tests \| \| `internal/handler/document.go` \| Add handler + request struct + interface \| \| `internal/handler/document_test.go` \| 5 handler tests \| \| `internal/router/router.go` \| Add `POST /:dataset_id/documents/stop` route \| ### How it works 1. Validates all document IDs belong to the dataset 2. For each document in RUNNING/CANCEL state (or with unfinished tasks): - Sets Redis cancel signal `{task_id}-cancel` for each associated task - Updates `document.run` to CANCEL ("2") 3. Returns `{"success_count": N, "errors": [...]}` ### Test strategy - DAO/Service: SQLite in-memory DB, zero mocks. Redis is nil-safe by design. - Handler: `fakeDocumentService` implementing `documentServiceIface` interface. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-04 14:16:13 +08:00
Yufeng He	5db1b296fb	fix: fall back from empty Docling native chunks (#15601 ) ## Summary - keep the native Docling chunking path when it returns usable chunks - fall back to the standard Docling response parser when a chunked request gets HTTP 200 but returns no usable chunks - add a regression test for older Docling servers that accept the chunking request but return a standard conversion payload ## Why Older external Docling servers can accept a request containing `do_chunking: true` and still return the standard conversion response shape. The current code treats any HTTP 200 from the chunked request as a native chunk response, finds no chunk entries, and returns zero sections without trying the standard response parser. Fixes #15569. ## Validation - `python -m pytest test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py -q` - `python -m py_compile deepdoc\\parser\\docling_parser.py test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py` - `python -m ruff check deepdoc\\parser\\docling_parser.py test\\unit_test\\deepdoc\\parser\\test_docling_parser_remote.py` - `git diff --check`	2026-06-04 13:42:58 +08:00
bitloi	01a5598aa5	Fix: markdown fenced code block extraction (#15630 ) ### What problem does this PR solve? Markdown extraction currently applies custom delimiters before respecting fenced code blocks. When a delimiter such as a newline is configured, fenced code can be split into separate chunks, and longer outer fences can be closed incorrectly by shorter nested fences. This PR keeps the fix intentionally narrow for the Markdown chunking discussion in #15482: - preserve fenced code blocks when delimiter-based extraction is used - support both backtick and tilde fences - respect fence length so longer outer fences can contain shorter inner fences - keep delimiter splitting unchanged outside fenced blocks Refs #15482 ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) ### Testing - `ruff check deepdoc/parser/markdown_parser.py test/unit_test/deepdoc/parser/test_markdown_parser.py` - `python3 run_tests.py -t test/unit_test/deepdoc/parser/test_markdown_parser.py`	2026-06-04 13:33:46 +08:00
buua436	c70f19e138	Fix: remove duplicate document preview access check (#15625 ) ### What problem does this PR solve? remove duplicate document preview access check ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-04 13:05:15 +08:00
Lynn	597ac1e900	Fix: search bot and verify model instance (#15588 ) ### What problem does this PR solve? Fix: - Verify provider with empty llm list in llm_factories.json - Set search bot's chat_llm_name, use tenant default chat model as default ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-04 11:59:55 +08:00
buua436	bbacb31226	Fix: think stream tail handling (#15582 ) ### What problem does this PR solve? think stream tail handling ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-04 10:04:35 +08:00
kpdev	d26d799467	fix(api): restore accessible check on document preview (#15505 ) Restore `DocumentService.accessible` on `GET /api/v1/documents/{doc_id}/preview` so cross-tenant users cannot stream documents by UUID. Fixes #15501 ### What problem does this PR solve? PR #15146 (`71a52d579`) moved the agent attachment download route and accidentally removed the `DocumentService.accessible(doc_id, current_user.id)` guard from the REST preview handler. The endpoint still requires login, but any authenticated user who knows another tenant's `doc_id` can download the raw file bytes. This restores the same authorization check that existed before #15146, returning a generic `"Document not found!"` when access is denied (no cross-tenant ID enumeration). SDK download routes tracked in #15125 are unchanged. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-04 09:59:07 +08:00
dripsmvcp	2196f2260a	fix(api): restore DocumentService.accessible check on /preview (#15508 ) ## Summary Restore the `DocumentService.accessible(doc_id, current_user.id)` check that PR #15146 dropped from the REST document preview handler. Any authenticated caller could download any tenant's document bytes by guessing/knowing the `doc_id`. ## Root cause `api/apps/restful_apis/document_api.py` — the `GET /documents/<doc_id>/preview` handler called `DocumentService.get_by_id` and went straight to `File2DocumentService.get_storage_address` + `STORAGE_IMPL.get`, with no tenant check between the lookup and the read. The handler's docstring even promises "user must belong to the tenant that owns the document's knowledge base" — the code didn't enforce it. ## Fix - Add `current_user` to the existing `api.apps` import. - Immediately after `get_by_id`, call `DocumentService.accessible(doc_id, current_user.id)`; on denial, return the same `get_data_error_result(message="Document not found!")` shape used for the missing-doc branch. That makes a cross-tenant probe indistinguishable from a missing-doc probe, preventing ID enumeration (the issue body calls this out explicitly). - Emit `logging.warning` with caller user + doc_id for audit. - Restores symmetry with peer routes that already call `accessible(doc_id, user_id)` (e.g. `_run_sync` at `document_api.py:1380`). ## Test plan Adds `test/unit_test/api/apps/restful_apis/test_document_preview_accessible.py`: - `test_cross_tenant_preview_is_denied` — owner tenant ≠ caller tenant; asserts the response shape is `Document not found!` and the storage backend (`thread_pool_exec(STORAGE_IMPL.get, ...)`) is never invoked. - `test_missing_doc_returns_not_found` — missing-doc behaviour unchanged. Stub-loader pattern mirrors `test/unit_test/api/apps/sdk/test_dify_retrieval.py` (added in #15028, passing in CI). ## Provenance — how this fix was produced This PR was authored against a small cited knowledge base committed in the working tree as a `.vouch/` (see [vouchdev/vouch](https://github.com/vouchdev/vouch)). The loop used here: 1. Grounding first. Before reading the handler, queried the KB for prior context: `vouch context "tenant scoped accessible authorization"` → retrieved a cited claim distilled from PR #15028 (which restored the same `accessible()` check on `/dify/retrieval`). The retrieved rule: > ragflow REST endpoints that load by tenant-scoped id must call `<Service>.accessible(id, tenant_id)` after `get_by_id` and before storage/DB read; deny with code 109 'No authorization.' and log a warning. Established by PR #15028. 2. Applied the pattern with a domain refinement. For an API/JSON endpoint, `No authorization.` is the right denial shape. For a byte-streaming, browser-facing endpoint like `/preview`, leaking existence itself enables enumeration — so per the issue's expected behaviour, this PR denies with `Document not found!` (indistinguishable from missing) instead. Same auth check, narrower response. 3. Recorded the refinement back into the KB as a new cited claim, so the next IDOR-class issue starts already grounded in both the general pattern and the byte-route nuance. Net effect of the workflow: the fix replicates a known-good pattern instead of reinventing it, and the place where the pattern was nuanced is now retrievable for the next pass. Mechanism is fully independent of this PR — it's not a runtime dependency, just process discipline. Closes #15501	2026-06-04 09:58:26 +08:00
euvre	9a9d3ddf5f	fix: show default embedding model when provider is not yet registered (#15511 ) ### What problem does this PR solve? ### Problem On the Model Providers page, the Embedding Model dropdown in System Model Settings shows empty (no default selected), even though a default embedding model is configured in `service_conf.yaml`. ### Root Cause Two issues were identified: 1. Backend: `_get_model_info` fails for unregistered providers The tenant's `embd_id` is set to `bge-m3@xxxx` during initialization (from the placeholder config `factory: 'xxxx'`). The `_get_model_info` function requires the provider to exist in `tenant_model_provider` table, but `xxxx` is never a real provider. Even after the user adds a real provider (e.g., ZHIPU-AI), the stale `embd_id` still references the non-existent one, causing the function to return `None`. 2. Frontend: default models cache not invalidated after adding provider `useAddProviderInstance` only invalidates `addedProviders` and `allModels` caches after adding a provider instance, but does not invalidate the `defaultModels` cache. This means the default model list is not re-fetched until the user manually refreshes the page. ### Fix `api/apps/services/models_api_service.py` - Added `_resolve_model_from_tenant_providers()` helper: when the default model's provider doesn't exist (e.g., placeholder `xxxx`), it searches through the tenant's actually registered providers for a model of the same type and returns the first match. - When an instance name doesn't match (e.g., `"default"` vs actual name `"1"`), the function now auto-resolves to the first real instance under that provider. - Falls back to `FACTORY_LLM_INFOS` validation when neither provider nor instance exists. `web/src/hooks/use-llm-request.tsx` - Added `queryClient.invalidateQueries({ queryKey: LlmKeys.defaultModels() })` to `useAddProviderInstance` so that the default model list is re-fetched immediately after a provider instance is added, eliminating the need for a manual page refresh. ### Testing - Verified with a tenant whose `embd_id=bge-m3@xxxx` and only provider is ZHIPU-AI (instance `1`): `_resolve_model_from_tenant_providers` correctly resolves to `embedding-2@1@ZHIPU-AI`. - After adding a provider via the UI, the embedding model dropdown now immediately shows the resolved default without requiring a page refresh. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) --------- Signed-off-by: noob <yixiao121314@outlook.com>	2026-06-04 09:55:49 +08:00
Jack	67c3e73d70	feat: migrate DELETE /api/v1/datasets/:dataset_id/documents to Go (#15577 ) ## Summary Migrate the batch document deletion endpoint from Python to Go. Two modes supported: explicit `ids` list and `delete_all`. ## Changes \| File \| Change \| \|------\|--------\| \| `internal/dao/file2document.go` \| Add `GetByDocumentID`, `DeleteByDocumentID` \| \| `internal/dao/file2document_test.go` \| 5 new tests \| \| `internal/dao/kb_test.go` \| 2 new tests (`DecreaseDocumentNum`) \| \| `internal/service/document.go` \| Add `deleteDocumentFull` + `DeleteDocuments`, refactor `DeleteDocument` \| \| `internal/service/document_test.go` \| 10 new tests \| \| `internal/handler/document.go` \| Add `documentServiceIface` + `DeleteDocuments` handler \| \| `internal/handler/document_test.go` \| 7 new tests \| \| `internal/router/router.go` \| Register `DELETE /:dataset_id/documents` \| \| `cmd/server_main.go` \| Support `RAGFLOW_DICT_PATH` env var \| \| `internal/binding/rag_analyzer.go` \| Use `-lpcre2-8` dynamic linking \| \| `internal/dao/database.go` \| Skip Error 1091/1138 during migration \| \| `internal/service/llm.go` \| Fix vet warning \| ## Per-document cleanup - Delete tasks from DB - Hard-delete document + decrement KB counters - Delete chunks from document engine (nil-guarded) - Delete metadata from document engine (nil-guarded) - Remove file2document mapping + file record + storage blob ## Test Results 24 unit tests all passing (7 DAO + 10 service + 7 handler) using SQLite :memory: + gin.TestMode. See [test report](docs/test_report_delete_documents.md) for manual integration test results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 20:55:53 +08:00
Haruko386	df55880b44	feat[Go] implement /connectors/google/oauth (#15584 ) ### What problem does this PR solve? The following API is available in go > /api/v1/connectors/google/oauth/web/start POST > /api/v1/connectors/gmail/oauth/web/callback GET > /api/v1/connectors/google-drive/oauth/web/callback GET > /api/v1/connectors/google/oauth/web/result POST ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-06-03 20:08:55 +08:00
Wang Qi	b946df8ba2	Fix: consolidate beta auth (#15581 ) Fix: consolidate beta auth	2026-06-03 19:58:06 +08:00
bitloi	2eed0d4679	refactor(go-models): add unsupported model driver defaults (#15431 ) ### What problem does this PR solve? Adds a shared safe default implementation for unsupported Go model-driver capability methods and migrates the confirmed panic-stub providers to use it. The Go `ModelDriver` interface requires providers to implement many capability methods even when the provider does not support them. XunFei had unsupported capability methods implemented as `panic("implement me")`, Mistral still had a panic in `ParseFile`, and HuaweiCloud carried an unreachable `panic("implement me")` after a normal chat return. ### Type of change - [x] Refactoring Co-authored-by: Haruko386 <tryeverypossible@163.com>	2026-06-03 19:16:28 +08:00
bohdansolovie	ae316b3415	fix(api): guard document rename when linked file row is missing (#15536 ) ## Summary Fixes #15534 — `update_document_name_only()` crashes with `AttributeError` when `File2Document` exists but the linked `File` row was deleted. `update_document_name_only()` in `document_api_service.py` called `FileService.get_by_id()` when a `File2Document` row existed, then accessed `file.id` without checking the lookup result. An orphan `File2Document` link (file deleted, mapping left behind) caused document rename via `PATCH /api/v1/datasets/{dataset_id}/documents/{document_id}` to return HTTP 500. This PR mirrors guards used in `file2document_api.py` and `file_api_service.py`: skip the optional file rename when the file is missing, and still update the document record and search index. ## Changes - `api/apps/services/document_api_service.py` — check `e and file` before `FileService.update_by_id` - `test/unit_test/api/apps/services/test_update_document_name_only.py` — regression tests (orphan link + happy path) ## Test plan - [x] `pytest test/unit_test/api/apps/services/test_update_document_name_only.py -v` - [ ] Manual: PATCH document `name` when `File2Document` points to a non-existent `file_id` → 200, document/index renamed, no 500	2026-06-03 17:57:19 +08:00
Jin Hai	2061edd308	Remove unused codes (#15579 ) ### What problem does this PR solve? Remove unused code. ### Type of change - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-06-03 17:35:36 +08:00
Jack	b363146997	refactor: overhaul task executor with layered architecture and comprehensive test suite (#15471 ) ## Summary Decomposes the monolithic `task_executor.py` (1945 lines) into a 6-layer architecture with clear separation of concerns. The refactored code is functionally equivalent to the original, verified through 400 passing tests and a production-vs-dry-run comparison framework. ## Architecture ``` entry (task_manager) └─ orchestration (task_handler) ├─ services (chunk_service, embedding_service, dataflow_service, raptor_service, post_processor) │ └─ utilities (chunk_builder, chunk_post_processor, embedding_utils) └─ infrastructure (task_context, recording_context, interceptor) ``` Key design decisions: - TaskContext — typed facade over raw task dict, injects rate limiters + callbacks via composition - RecordingContext + Comparator — enables side-by-side production vs dry-run execution for safe migration - NullRecordingContext — zero-allocation no-op for production, uses `__slots__` - WriteOperationInterceptor — FIFO replay of previous runs function returns for comparison mode ## Migration Strategy The original `handle_task()` in `task_executor.py` uses a 3-way switch via `TE_RUN_MODE`: - `TE_RUN_MODE=0` (default) → runs refactored code - `TE_RUN_MODE=1` → runs both original + refactored, compares all intermediate results - `TE_RUN_MODE=2` → runs original code (fallback) The comparison mode (`TE_RUN_MODE=1`) records ~40 intermediate values (chunks, vectors, token counts, func return values) from the production run and replays them during dry-run, then uses `ContextComparator` to report mismatches. ## Functional Equivalence Fixes All divergences between original and refactored code were identified and fixed: - Timeout decorators (handle/build_chunks/raptor/embedding) - NullRecordingContext leak in finally block causing RuntimeError - MinIO None-binary check with proper FileNotFoundError - Dataflow dispatch after embedding binding + init_kb - Memory task missing return after processing - RAPTOR checkpoint progress reporting - Tag cache (get_tags_from_cache/set_tags_to_cache) restoration - dataflow_id correction in _load_dsl - Language default Chinese, dead code guard removal - embed_chunks made async with proper thread_pool_exec - Full GraphRAG default configuration (10 parameters) - Hardcoded q_768_vec fallback removal in RAPTOR ## Test Changes - 20 new tests covering table parser manual mode, tag cache, embedding edge cases, RAPTOR checkpoint, dataflow_id correction, storage binary None, cancel cleanup, metadata=None boundary - Unified `make_task_context`/`make_task_dict` factories eliminated 10+ duplicated helpers - DataflowService tests migrated from internal method mocks to IO boundary mocks (real orchestration code executes) - Parametrized duplicate build_chunks post-processor tests - 7 raptor tests modernized to @pytest.mark.asyncio - Mock count per test reduced through boundary-level mocking strategy Test count: 400 passing, 0 warnings, 0 skips ## Files Changed \| File \| Change \| \|------\|--------\| \| `rag/svr/task_executor.py` \| +1 line (NullRecordingContext fix) \| \| `rag/svr/task_executor_refactor/task_handler.py` \| Orchestration layer, 8 logic fixes \| \| `rag/svr/task_executor_refactor/chunk_service.py` \| +timeout + None-check \| \| `rag/svr/task_executor_refactor/embedding_service.py` \| sync→async rewrite \| \| `rag/svr/task_executor_refactor/dataflow_service.py` \| dataflow_id fix + timeout \| \| `rag/svr/task_executor_refactor/raptor_service.py` \| checkpoint fix + assert \| \| `rag/svr/task_executor_refactor/chunk_post_processor.py` \| tag cache restore \| \| `rag/svr/task_executor_refactor/task_context.py` \| language default fix \| \| `test/.../conftest.py` \| +294 lines shared helpers \| \| `test/.../*.py` \| 15 test files refactored, 20 new tests \| --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 17:18:31 +08:00
Jin Hai	d736f358ba	Go: refactor model provider (#15568 ) ### What problem does this PR solve? 1. Add license announcement 2. Add sanity check on API config 3. Add base class: BaseModel 4. Add GetBaseURL ### Type of change - [x] Refactoring --------- Signed-off-by: Jin Hai <haijin.chn@gmail.com>	2026-06-03 16:33:58 +08:00
Wang Qi	d6fc50a469	Fix: no more @token_required (#15562 ) Fix: no more @token_required	2026-06-03 16:24:08 +08:00
chanx	a678ed7b1f	Fix: Switching pagesize on a chunk page did not reset the current page. (#15401 ) ### What problem does this PR solve? Fix: Switching pagesize on a chunk page did not reset the current page. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-03 15:57:57 +08:00
Idriss Sbaaoui	1134769940	Chore: update cohere models (#15576 ) ### What problem does this PR solve? remove old and add latest cohere models ### Type of change - [x] Refactoring - [x] Other (please describe): update models	2026-06-03 15:55:45 +08:00
Haruko386	473d06d1ad	feat[Go]: implement add multi_models (#15563 )	2026-06-03 15:26:46 +08:00
buua436	c0e00a7f6e	Fix: agent template smart_customer_service_specialist.json (#15565 ) ### What problem does this PR solve? agent template smart_customer_service_specialist.json ### Type of change - [x] Refactoring	2026-06-03 15:05:39 +08:00
Lynn	ac3964b6bc	Feat: display intl url for siliconflow and verify model provider without llms in json (#15550 ) ### What problem does this PR solve? As title. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-06-03 14:43:08 +08:00

1 2 3 4 5 ...

6567 Commits