ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-07-01 16:25:44 +08:00

Author	SHA1	Message	Date
Harsh Kashyap	d770217b25	fix(api): fall back to factory max_tokens for tenant models (#16364 )	2026-07-01 16:00:13 +08:00
Lynn	400476f0b3	Feat: SoMark (#16482 ) Follow #15486 Co-authored-by: limuting <limuting233@gmail.com> Co-authored-by: lutianyi <lutianyi233@163.com> Co-authored-by: justinychuang <huangyicheng@soulcode.cn> Co-authored-by: maybehokori <138367708+maybehokori@users.noreply.github.com>	2026-07-01 13:29:28 +08:00
Lynn	b6fa5ce4ea	Fix: ollama provider (#16519 )	2026-07-01 13:24:31 +08:00
RazmikGevorgyan	38f8f8a656	fix: handle non-serializable objects in agent canvas SSE and state se… (#14210 ) …rialization Agent components (llm.py, agent_with_tools.py, message.py) store functools.partial objects as deferred streaming handles in their output slots. When the canvas state gets serialized for SSE events, Redis commits, or logging, these partials — plus non-copyable objects like Langfuse clients — crash json.dumps and deepcopy. Changes: - canvas_app.py: add default=str to json.dumps for SSE event serialization (lines 238, 296) - canvas.py: wrap deepcopy calls in try/except to handle non-copyable objects (Langfuse clients, etc.), add default=str to final json.dumps - base.py: add default=str to ComponentParamBase.__str__ to handle non-serializable objects in component parameters Closes #14229 ### What problem does this PR solve? _Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR._ ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: yzc <yuzhichang@gmail.com>	2026-07-01 09:33:41 +08:00
Taranum Wasu	e23f63bd93	fix(agent): prevent empty LLM user message after prompt fitting (#16413 ) ## Summary - Treat `max_tokens=0` as unset (`or 8192`) when building model context budgets, fixing agents that silently zeroed prompts when a vLLM model had `max_tokens: 0` in tenant config - Replace trailing same-role canvas history in `LLM._sys_prompt_and_msg` instead of skipping the current user prompt - Add `LLM.fit_messages()` validation after `message_fit_in` on agent paths so empty user content fails fast with a clear error instead of reaching vLLM Fixes #16411 ## Root cause Agent canvas workflow called `message_fit_in` with `int(max_length * 0.97)`. When `max_length` was `0`, both system and user content were trimmed to empty strings. The `[HISTORY STREAMLY]` log showing only `{"role":"user","content":""}` matches this. A secondary bug skipped appending the formatted user prompt when history ended with a `user` role message. ## Test plan - [x] Added `test/unit_test/agent/component/test_llm_prompt.py` for role-replace, validation, and zero-budget fitting - [x] Added `test_message_fit_in_zero_budget_preserves_non_empty_messages` in `test_generator_message_fit_in.py` - [ ] CI unit tests - [ ] Manual: agent canvas `begin → Retrieval → Agent → Message` with vLLM Qwen3; confirm user message reaches LLM Made with [Cursor](https://cursor.com) --------- Co-authored-by: Taranum Wasu <taranumwasu@Taranums-MacBook-Pro.local> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-07-01 09:30:54 +08:00
天海蒼灆	3c946a7e58	fix(agent): add canvas_type filter and field to list_agents API (#15754 ) ### What problem does this PR solve? GET /api/v1/agents (list_agents) already supports filtering by canvas_category, keywords, tags, and owner_ids, but it does not support canvas_type — even though canvas_type is a persisted field on UserCanvas and is already accepted on agent create/update APIs. This gap causes two issues: Filtering — clients cannot list agents by business category (e.g. Marketing, Agent, Ingestion Pipeline) without fetching all agents and filtering client-side. Response payload — list_agents did not return canvas_type in each canvas item, so consumers had to call GET /api/v1/agents/{id} per agent to read it. This PR adds optional canvas_type query parameter support and includes canvas_type in the list response. ### Type of change - [√] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-06-30 17:43:26 +08:00
Rene Arredondo	09dc4c8841	fix(agent): return session_id when chat completion produces no events (#15169 ) (#15228 ) ## Summary Fixes #15169 — `POST /api/v1/agents/chat/completions` returned `data: {}` with no `session_id` when the agent produced no events (e.g. the reporter's payload sent `"query": ""`). ## Root cause For `{"agent_id": "...", "query": "", "stream": false}`: 1. No `session_id` in the request → new-session branch at `agent_api.py:1278`. 2. `session_id = get_uuid()` at `agent_api.py:1294`. 3. Falls into `_run_workflow_session`. 4. `canvas.run(query="")` produces no events, so `final_ans` stays `{}`. 5. Non-streaming path then hit: ```python if not final_ans: await commit_runtime_replica() return get_result(data={}) ``` `session_id` was allocated but silently dropped on the way out. The streaming path had the same shape (only a bare `[DONE]` was yielded — no SSE event carrying `session_id`). The session-continuation path at `agent_api.py:1463` had the same bug for callers that passed `session_id` and got `{}` back. The successful (non-empty) paths were fine because every canvas event has `ans["session_id"] = session_id` attached before being yielded / captured into `final_ans` (see `agent_api.py:255` and `:303`). ## Fix Three minimal changes, all in `api/apps/restful_apis/agent_api.py`: 1. `_run_workflow_session` (non-streaming): `return get_result(data={"session_id": session_id})` instead of `data={}`. 2. `_run_workflow_session` (SSE): if the canvas loop emits no events, yield one `data:{"session_id": "...", "data": {}}` event before `[DONE]`, so the client receives the id over the wire. 3. `agent_chat_completion` session-continuation: echo the caller-supplied `session_id` back in the empty-events case instead of `{}`. No change needed on the happy paths — they already attach `session_id` to every event. ## Test plan - [ ] Repro from the issue: `POST /api/v1/agents/chat/completions` with `{"agent_id": "<id>", "query": "", "stream": false}`. Response `data` should now contain `session_id`. - [ ] Same payload with `"stream": true`. SSE stream should contain one event with `session_id` before `data:[DONE]`. - [ ] Same shape but with a real, non-empty `"query"` (new session). Response should be unchanged from before — every event still carries `session_id`, final response still includes it on `final_ans`. - [ ] Pass an existing `session_id` plus `"query": ""`. Response should echo that `session_id` back instead of `{}`. - [ ] Pass an existing `session_id` plus a normal query. Response should be unchanged from before. - [ ] `openai-compatible: true` path is untouched — sanity-check it still works. - [ ] Run `uv run pytest` to make sure no existing tests regress. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe):	2026-06-30 16:41:44 +08:00
Attili-sys	5fc254eb2e	Feature big query connector (#15871 ) ### What problem does this PR solve? This PR adds Google BigQuery as a first-class data source connector in RAGFlow. It enables users to ingest and sync BigQuery data using the same row-to-document model used by relational database connectors: selected content columns become document text, metadata columns become document metadata, an optional ID column provides stable document IDs, and an optional timestamp column enables cursor-based incremental sync. The connector supports service-account JSON credentials, table mode, custom query mode, GoogleSQL queries, cursor-based incremental sync, deleted-row pruning support, configurable query limits such as `maximum_bytes_billed`, dry-run validation, batch loading, stable document IDs, and BigQuery-aware value serialization.	2026-06-29 22:08:40 +08:00
Hz_	a10a2d8769	fix(py): chat message reference deletion index (#16436 ) Fix the reference index used when deleting a chat message pair. Each user/assistant message pair shares one reference entry, while the first assistant prologue has no reference. Using `i // 2` correctly removes the reference for the deleted pair and avoids deleting the previous turn's reference.	2026-06-29 19:05:25 +08:00
Wang Qi	6e82e2726d	Guard /datasets/{dataset_id}/chunks cannot parse ingestion pipeline, use /documents/ingest instead (#16395 )	2026-06-29 13:45:29 +08:00
euvre	a339e8a579	feat: handle partial upload success in document batch upload (#16438 )	2026-06-29 13:06:14 +08:00
jony376	8fb692f10a	fix(agent): enforce document access on POST /api/v1/agents/rerun (#15145 ) ## Related issues Closes #15144 ### What problem does this PR solve? `POST /api/v1/agents/rerun` loaded a pipeline operation log by UUID via `PipelineOperationLogService.get_documents_info` with no authorization, then wiped chunks, reset document counters, deleted tasks, and re-queued dataflow for the victim document. Any authenticated user who knew a victim's pipeline log id could disrupt parsing on documents they did not own. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Changes \| File \| Change \| \|------\|--------\| \| `api/apps/restful_apis/agent_api.py` \| Call `DocumentService.accessible(doc["id"], tenant_id)` before destructive rerun operations; deny with generic `"Document not found."` \| \| `test/unit_test/api/apps/restful_apis/test_rerun_agent_authorization.py` \| Unit tests: cross-tenant log rejected, missing/unauthorized same message, authorized rerun proceeds \| ### Security notes - CWE-639: Closes cross-tenant pipeline rerun / chunk wipe via leaked log UUID. - `tenant_id` from `@add_tenant_id_to_kwargs` is `current_user.id`; `DocumentService.accessible` covers team-shared KBs. ### Test plan - [ ] `pytest test/unit_test/api/apps/restful_apis/test_rerun_agent_authorization.py` - [ ] Manual: attacker cannot rerun victim pipeline log id ```bash cd ragflow uv run pytest test/unit_test/api/apps/restful_apis/test_rerun_agent_authorization.py -q ``` --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-06-29 09:45:17 +08:00
Tim Wang	f0f10b6092	Fix: UserFillUp interactive forms not working in agent explore mode (#14589 ) ## Summary - Backend: `_iter_session_completion_events` in `agent_api.py` was filtering out `user_inputs` and `workflow_finished` SSE events, causing agents with UserFillUp components to silently fail in explore mode — the interactive form never appeared, while the same agent worked correctly in run (editor) mode. - Frontend: `SessionChat` component in explore mode was missing `DebugContent` children rendering inside `MessageItem`, so even if the backend forwarded the events, the form UI would not render. Added `DebugContent`, `MarkdownContent`, `useAwaitCompentData` hook, and input-disabling logic to match the run mode's `chat/box.tsx` behavior. ## What was changed ### Backend (`api/apps/restful_apis/agent_api.py`) - Line 266: Added `"user_inputs"` and `"workflow_finished"` to the allowed event filter in `_iter_session_completion_events` ### Frontend (`web/src/pages/agent/explore/components/session-chat.tsx`) - Added imports: `DebugContent`, `MarkdownContent`, `useAwaitCompentData`, `useParams` - Added `sendFormMessage` from `useSendSessionMessage()` hook - Added `useAwaitCompentData` hook for form state management - Added `DebugContent` as `MessageItem` children for the latest assistant message (renders UserFillUp form) - Added `MarkdownContent` + submitted values display for previous assistant messages - Updated `NextMessageInput` disabled states to respect `isWaitting` (form submission in progress) ## Test plan - [x] Agent with UserFillUp component (e.g., email draft with send/edit/cancel options) shows interactive form in explore mode - [x] Same agent continues to work correctly in run (editor) mode - [x] Form submission sends data back to the agent and workflow continues - [x] Input field is disabled while waiting for form submission - [ ] Agents without UserFillUp components are unaffected in explore mode 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-06-29 09:45:17 +08:00
kpdev	212429bf9d	fix(api): gate sandbox artifact download on agent session ownership (#16169 ) Fixes #16168 ## Summary - Add session-scoped authorization for `GET /api/v1/documents/artifact/<filename>` - Allow download only when the artifact filename appears in the caller's `api_4_conversation` message and `UserCanvasService.accessible(dialog_id, user_id)` passes - Deny with generic `"Artifact not found."` before storage access (no cross-user enumeration) - Return 4xx when the blob is missing (existing behavior preserved) ## Approach Sandbox artifacts are runtime CodeExec outputs, not KB documents — this uses the same session gate pattern as `agent_chat_completion`, not `DocumentService.accessible`. ## Test plan - [x] Unit: denied when filename not referenced in user sessions - [x] Unit: denied when agent canvas is not accessible - [x] Unit: authorized user receives bytes; missing blob returns `"Artifact not found."` - [ ] `pytest test/testcases/test_web_api/test_document_app/test_document_metadata.py -k get_artifact` --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-06-29 09:45:16 +08:00
Renzo	6079ded70b	fix: require explicit anonymous webhook access (#14890 ) ### What problem does this PR solve? Fixes #14882 Agent webhook execution currently fails open when the saved webhook `security` block is missing/empty, or when `auth_type` is set to `none`. This allows unauthenticated webhook invocation without an explicit operator opt-in. This PR makes anonymous webhook access explicit: - Rejects missing or empty webhook security config. - Requires `allow_anonymous: true` when `auth_type` is `none`. - Preserves explicit anonymous webhooks by having the frontend serialize `allow_anonymous: true` when the user selects `None` auth. - Updates webhook unit tests to cover both denied implicit-anonymous configs and allowed explicit-anonymous configs. ### Type of change - [x] Bug Fix - [x] Security hardening - [x] Test ### Tests - [x] `ZHIPU_AI_API_KEY=dummy uv run python -m pytest --confcutdir=test/testcases/test_web_api/test_agent_app test/testcases/test_web_api/test_agent_app/test_agents_webhook_unit.py` - [x] `uv run ruff check api/apps/restful_apis/agent_api.py test/testcases/test_web_api/test_agent_app/test_agents_webhook_unit.py` - [x] `npm exec eslint src/pages/agent/utils.ts src/pages/agent/form/begin-form/schema.ts` --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-06-29 09:45:16 +08:00
philluiz2323	43a9d53c72	fix(agent): enforce tenant ownership on agentbots completions/inputs (#15457 ) ### What problem does this PR solve? Fixes #15456. The SDK agent-bot routes `POST /api/v1/agentbots/<agent_id>/completions` and `GET /api/v1/agentbots/<agent_id>/inputs` (`api/apps/restful_apis/bot_api.py`) authenticate the caller with a beta API token — which only yields the caller's `tenant_id` — but then load and run the agent named in the URL without verifying the agent belongs to the caller's tenant. `UserCanvasService.get_agent_dsl_with_release` even accepts a `tenant_id` it never uses, and `begin_inputs` calls `get_by_id` directly. Any holder of a single valid beta token could therefore run another tenant's agent (leaking its DSL/prompts/tool config) or read another tenant's agent metadata and begin input form, just by substituting a victim `agent_id`. This PR adds the project's existing ownership gate, `UserCanvasService.accessible(agent_id, tenant_id)`, to both endpoints right after token authentication — mirroring the checks already enforced on the equivalent first-party routes in `api/apps/restful_apis/agent_api.py` (lines 75/578/775) and on the sibling `chatbot_completions` / `create_agent_session` / `delete_agent_session` handlers in the same file. On failure it returns the same `Can't find agent by ID: <id>` message already used by `begin_inputs`, so it does not reveal whether an `agent_id` exists in another tenant. Added a regression test (`test/unit_test/api/apps/restful_apis/test_agentbots_access_control.py`, following the existing stubbed-loader pattern from `test_get_agent_session.py`) asserting that an inaccessible `agent_id` is rejected before the agent is loaded (`begin_inputs`) or executed (`completions`), and that an accessible agent still proceeds. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-06-29 09:45:16 +08:00
Rene Arredondo	7ecc0908ef	fix(agent): authenticate "Thinking" button in shared/embedded chat via beta token (#14985 ) (#15238 ) ## Summary Fixes #14985 — clicking the Thinking button in a shared/embedded chat returns 401 and bounces the user to the login page, even though the same share page can chat with the agent just fine. ## Root cause In shared chat, `useGetSharedChatSearchParams` binds `conversationId` to the URL's `shared_id` query param — which is the beta APIToken, not the real agent id. That `conversationId` propagates through the component tree: ```tsx <WorkFlowTimeline canvasId={conversationId}> → useFetchMessageTrace(canvasId) → GET /api/v1/agents/<sharedId>/logs/<messageId> ``` But `/agents/<agent_id>/logs/<message_id>` is decorated with `@login_required` (`api/apps/restful_apis/agent_api.py:842-846`). The share page only holds the beta token — there is no session JWT — so the request 401s and quart-auth redirects to the login page. The reporter's server log matches exactly: ``` load_user from jwt got exception No b'.' found in value load_user: No APIToken found for token=ULG10SWG3E... Unauthorized request (quart_auth) GET /api/v1/agents/394013f8d42211f0bad6123fa55e8ed9/logs/96fd72e2-... 1.1 401 ``` The `394013f8...` segment in the URL is the `shared_id` (beta token), not an actual agent id. `_load_user` already accepts the regular `APIToken.token` field, but not `APIToken.beta`, by design — beta is a much weaker share-link credential than a personal API key. The sibling endpoints `/agentbots/<id>/completions` and `/agentbots/<id>/inputs` already use the right auth pattern for this scope (beta-token via `_get_sdk_authorization_token` → `APIToken.query(beta=token)`). Trace just didn't have a parallel. ## Fix ### Backend (`api/apps/restful_apis/bot_api.py`) Added a beta-token sibling endpoint: ``` GET /api/v1/agentbots/<shared_id>/logs/<message_id> ``` - Same auth shape as the existing `agentbots` endpoints. - The `<shared_id>` path segment is a client-supplied label only. The real `agent_id` used to build the Redis key (`<agent_id>-<message_id>-logs`) is taken from `APIToken.dialog_id` on the looked-up token, so the endpoint never trusts client-supplied identifiers for the data lookup. - Returns the same `{data: ...}` shape as the existing `/agents/<id>/logs/<message_id>` endpoint, so the frontend doesn't need to reshape the response. ### Frontend - `web/src/utils/api.ts`: added `sharedTrace(sharedId, messageId)` URL builder. - `web/src/services/agent-service.ts`: added `fetchSharedTrace({ shared_id, message_id })`. - `web/src/hooks/use-agent-request.ts`: `useFetchMessageTrace` takes an optional `isShare` argument. When set, it calls `fetchSharedTrace`; `isShare` is also folded into the `queryKey` so the two modes never share cached results. - `web/src/pages/agent/log-sheet/workflow-timeline.tsx`: forwards the already-existing `isShare` prop into the hook. All other existing call sites of `useFetchMessageTrace` (webhook timeline, pipeline log, dataflow result) pass no `isShare` argument → undefined → falsy → unchanged behavior. ## Test plan - [ ] In the regular Agent UI (logged-in user): open the trace / log sheet for any message and click into "Thinking" — the timeline should still load via `/agents/<id>/logs/<msg>`, same as before. - [ ] From the Agent page, click Chat in new tab to open `/chat/share?shared_id=<token>&from=agent`. Send a message, wait for a response, then click Thinking on the assistant turn. The trace panel should load instead of redirecting to the login page. - [ ] Same flow but with the agent embedded in an iframe ("Embed into webpage") — confirm there is no login redirect. - [ ] In DevTools → Network, confirm the share-chat trace request goes to `/api/v1/agentbots/<sharedId>/logs/<msgId>` and returns 200 with the same JSON shape as the logged-in path. - [ ] Confirm the chat completions, inputs, and upload flows in the share page still work — they were not touched. - [ ] Send a bogus / expired beta token to the new endpoint and confirm it returns the standard "Authentication error: API key is invalid!" response (no traceback, no 500). - [ ] Run `uv run pytest` to make sure no existing tests regress. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): --------- Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-06-29 09:45:16 +08:00
jony376	7b81f63653	fix(agent): bind session_id to path agent_id on GET/DELETE agent sessions (#15374 ) ## Related issues Closes #15128 ### What problem does this PR solve? `GET` and `DELETE` `/api/v1/agents/<agent_id>/sessions/<session_id>` verified canvas access for `agent_id` in the URL but loaded/deleted sessions only by `session_id`, without checking `conv.dialog_id == agent_id`. Any user with access to any agent could read or delete another agent's `API4Conversation` session (messages, references, DSL, etc.) when they knew the session UUID. Agent completions in the same file already enforce this binding; chat sessions do too — these two routes were inconsistent. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [ ] New Feature (non-breaking change which adds functionality) - [ ] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### Changes \| File \| Change \| \|------\|--------\| \| `api/apps/restful_apis/agent_api.py` \| Require `conv.dialog_id == agent_id` in `get_agent_session` and `delete_agent_session_item`; return generic `"Session not found!"` on mismatch \| \| `test/unit_test/api/apps/restful_apis/test_get_agent_session.py` \| Add IDOR regression tests for GET/DELETE; fix success fixture to include `dialog_id`; track `delete_by_id` calls \| ### Test plan - [x] Unit tests added for GET/DELETE IDOR and success paths - [ ] `pytest test/unit_test/api/apps/restful_apis/test_get_agent_session.py` Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>	2026-06-29 09:45:16 +08:00
Zhichang Yu	faef22c18a	Harden closed-advisory fixes (#16409 ) ## Summary - harden reopened advisory fixes across REST connector, invoke, document downloads, and markdown rendering - add targeted regression coverage for redirect-safe SSRF handling, invoke SSRF checks, document access control, and markdown sanitization - verify each referenced GHSA against the original GitHub advisory text and align the closed-advisory plan with the implemented remediation ## What changed - add tenant access checks to document download endpoints to avoid cross-tenant document disclosure - add per-hop SSRF validation, DNS pinning, redirect handling, and redirect limits to the REST API connector - ensure invoke requests validate and pin the resolved host and never follow redirects implicitly - keep the generic rate-limited request path wrapped, not just GET and POST helpers - sanitize markdown HTML before rendering in the highlight markdown component ## Validation - `cd web && npm test -- --runInBand src/components/highlight-markdown/__tests__/index.test.tsx` - `.venv/bin/python -m pytest -q test/unit_test/data_source/test_rest_api_connector.py` - targeted `test/testcases/test_web_api/...` unit additions were reviewed, but the suite cannot be executed end-to-end in this environment because parent `test/testcases/conftest.py` requires a local service on `127.0.0.1:9380` ## Notes - all GHSA entries referenced by the plan were checked against the original GitHub advisory text, not sampled - the closed-advisory plan document was updated locally during review, but is intentionally not included in this PR	2026-06-29 09:45:16 +08:00
Zhichang Yu	0c3952147c	fix(codeql): close remaining 44 CodeQL alerts post-merge (#16408 ) ## Summary After #16407 merged, 44 of the original 93 CodeQL alerts were still open on the default branch. This PR closes the remaining ones by: 1. Moving 32 existing `// codeql[...]` directives so they sit on the line immediately before the suppressed statement. The original multi-line suppression blocks had the directive as the first line, with the rationale on subsequent lines. After line shifts (refactors, linter reformat), the directive ended up several lines above the alert location — CodeQL only recognizes the suppression when it appears on the line directly above. (32 alerts across 27 files.) 2. Adding 9 new `// codeql[...]` suppressions for alerts that had no suppression in the preceding lines at all — mostly real-fixes that CodeQL conservatively still flags (filepath.Base, bounded slice sizes, model-identifier strings, the MD5-legacy-migration lookup in `conversation_service.py`). ## Files changed - `api/db/services/conversation_service.py` — add `py/weak-sensitive-data-hashing` suppression (MD5 for backward-compat legacy row lookup; not used for auth) - `api/db/services/llm_service.py` — 3× `py/clear-text-logging-sensitive-data` suppressions on the lines that log `llm_name` in warnings/info - `common/misc_utils.py` — 2× `py/clear-text-logging-sensitive-data` suppressions on the redacted `current_url` log sites - `internal/agent/component/invoke.go` — moved existing `go/request-forgery` directive - `internal/agent/sandbox/ssh.go` — moved existing `go/command-injection` directive - `internal/agent/tool/retrieval_service.go` — added `go/uncontrolled-allocation-size` suppression (`topN` is bounded to 1024 above) - `internal/cli/common_command.go` — moved 2× `go/disabled-certificate-check` directives - `internal/cli/user_command.go` — added `go/clear-text-logging` suppression (filepath.Base already strips user-identifying path) - `internal/dao/pipeline_operation_log.go` — moved 2× `go/sql-injection` directives - `internal/dao/user_canvas.go` — added `go/sql-injection` suppression in `GetList` (the new `userCanvasOrderClause` call path) - `internal/engine/infinity/chunk.go` — moved existing `go/unsafe-quoting` directive - `internal/entity/models/` — moved `go/path-injection` directives (15 files) - `internal/handler/oauth_login.go` — moved existing `go/cookie-httponly-not-set` directive - `internal/handler/tenant.go` — moved existing `go/path-injection` directive - `internal/service/deep_researcher.go` — moved existing `go/unsafe-quoting` directive - `internal/service/dataset.go` — added `go/uncontrolled-allocation-size` suppression (`n` bounded to 1024 above) - `internal/service/file.go` — moved existing `go/request-forgery` directive - `internal/service/langfuse.go` — moved 2× `go/request-forgery` directives - `internal/utility/mcp_client.go` — moved 3× `go/request-forgery` directives - `internal/utility/smtp.go` — moved existing `go/email-injection` directive - `rag/prompts/generator.py` — added `py/clear-text-logging-sensitive-data` suppression - `web/.../use-provider-fields.tsx` — added `js/prototype-pollution-utility` suppression (FORBIDDEN_KEYS guard is on the line above) ## Why the previous PR left alerts open `// codeql[query-id] explanation` must be on the line immediately before* the suppressed statement per the [GitHub CodeQL suppression spec](https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/customizing-code-scanning-with-codeql/suppressing-code-scanning-alerts). The original suppression blocks were 4-5 lines, with the directive as the first line. After linter reformat / line shifts, the directive ended up too far above the actual alert line to be recognized. The fix is to put the directive on the line directly above the suppressed statement, with the rationale above it. ## Test plan - All 9 modified Python files `ast.parse` clean - All 4 modified Go files `gofmt` clean - 36/44 expected alert suppressions in place - 8 remaining CodeQL alerts are the originals (#3485851828, #3485851831, #3485869759, #3485869766, #3485869768, #3485869771, #3485885962, #3485895527) which were resolved by the corresponding commit comments; these should close on the next scan when the suppression comments match the alert lines. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-29 09:45:16 +08:00
Zhichang Yu	195bfffb5e	fix(security): address 93 CodeQL code-scanning alerts across 61 files (#16407 ) ## Summary Resolves all 93 open alerts at https://github.com/infiniflow/ragflow/security/code-scanning by rule: \| Rule \| Count \| Treatment \| \|------\|-------\|-----------\| \| py/clear-text-logging-sensitive-data \| 23 \| Real fix — log scrubbing \| \| go/path-injection \| 15 \| Real fix where possible, suppression with rationale \| \| go/request-forgery \| 8 \| Suppression with rationale (operator-controlled URLs) \| \| go/clear-text-logging \| 10 \| Real fix — log scrubbing \| \| go/unsafe-quoting \| 5 \| Real fix — escape or refactor \| \| go/sql-injection \| 3 \| Real fix — orderby whitelist + CodeQL comment \| \| go/uncontrolled-allocation-size \| 2 \| Real fix — cap to 1024 \| \| go/incorrect-integer-conversion \| 3 \| Real fix — ParseInt + range check \| \| go/insecure-hostkeycallback \| 1 \| Real fix — known_hosts file \| \| go/disabled-certificate-check \| 2 \| Suppression with rationale \| \| go/command-injection \| 1 \| Suppression (sanitized via shq()) \| \| go/email-injection \| 1 \| Suppression with rationale \| \| go/cookie-httponly-not-set \| 1 \| Suppression (SPA bootstrap) \| \| js/stack-trace-exposure \| 1 \| Real fix — generic client message \| \| js/prototype-pollution-utility \| 1 \| Real fix — reject __proto__/constructor/prototype \| \| py/weak-sensitive-data-hashing \| 1 \| Real fix — MD5 → SHA-256 \| \| py/incomplete-url-substring-sanitization \| 3 \| Real fix — urlparse(hostname) \| \| py/paramiko-missing-host-key-validation \| 1 \| Real fix — load_system_host_keys + RejectPolicy \| \| cpp/integer-multiplication-cast-to-long \| 2 \| Real fix — cast to size_t \| ## Real fixes (with measurable security improvement) SSH host key verification (Go + Python) Replace `InsecureIgnoreHostKey()` / `paramiko.AutoAddPolicy()` with proper host key verification against a known_hosts file (configurable via `SSH_KNOWN_HOSTS` env / `known_hosts` config field; fail-closed when unset). Loads `~/.ssh/known_hosts` first via `load_system_host_keys()` so existing setups keep working. SQL injection in `user_canvas` Add `userCanvasOrderableColumns` whitelist + `userCanvasOrderClause` helper. Both `GetList()` and `ListByTenantIDs()` now route the user-supplied `orderby` query param through the helper, defaulting to `create_time` on miss. SQL injection in `pipeline_operation_log` Existing whitelist documented via CodeQL comment. Real SQL injection in `infinity/chunk.go:931` Escape `'` → `''` on user-controlled `questionText` before splicing into `filter_fulltext(...)` SQL filter. Real SQL injection in `elasticsearch/sql.go:75` Defense-in-depth escape on tokenizer output before splicing into `MATCH(...)`. Python code injection in `result_protocol.go` Replace raw JSON literal embedding into Python/JS expressions with base64 + `json.loads` / `JSON.parse(Buffer.from(..., 'base64').toString('utf8'))`. Eliminates both the unsafe-quoting sink and the brittleness of mixing JSON true/false/null with Python syntax. URL substring check bypass in `embedding_model.py` Replace `if "dashscope-intl.aliyuncs.com" in u` with `urlparse(u).hostname == "dashscope-intl.aliyuncs.com"` so a base_url like `https://attacker.example/?u=dashscope-intl.aliyuncs.com` cannot bypass the routing. Prototype pollution in `setNestedValue` (TS) Reject `__proto__`/`constructor`/`prototype` keys before any assignment. Integer overflow - scrypt params via `ParseInt` + non-positive check (`internal/common/password.go`) - `topN` and `n` caps to 1024 (retrieval_service.go, dataset.go) - `nallocstatesize` cast to `size_t` (cpp/re2/onepass.cc) Cookie httponly* Set explicitly with rationale: this is the OAuth bootstrap cookie intentionally read by the SPA. Stack trace exposure Replace `error.message` in HTTP 500 response with generic `"internal error"`; full error still logged server-side via `console.error`. Weak hashing MD5 → SHA-256 for deterministic `conv_id` derivation (`conversation_service.py`). Log scrubbing Remove or redact user-controlled / sensitive content from clear-text logs across 8 ingestion parsers, `llm_service.py` ×11, `tenant_llm_service.py` ×7, `misc_utils.py` ×4, `redis_conn.py` ×10, `conftest.py` ×4, `init_data.py`, `dataset_api_service.py`, `generator.py`, `mysql_migration.py`, `cli.go`, `user_command.go`, `pdf_parser.go`. Most patterns converted to parameterized logging (`logging.info("...: %d", n)`) or static messages. ## CodeQL suppressions (each with rationale) For alerts where the data flow is genuinely safe but CodeQL can't see the context — operator-controlled URLs, sanitized inputs, etc. — I added `// codeql[go/<rule>] <rationale>` annotations rather than dismissing them, so future readers can audit the rationale inline: - `internal/agent/component/invoke.go:135` — Invoke is a generic canvas HTTP client - `internal/service/langfuse.go` ×2 — host is per-tenant operator config - `internal/service/file.go:1184` — already SSRF-guarded by `assertURLSafe` - `internal/utility/mcp_client.go` ×3 — already `AssertURLSafe` + IP-pinned - `internal/entity/models/bedrock.go` — sigv4-signed request, URL can't be tampered - `internal/service/deep_researcher.go:269` — `callback` is SSE display string, not SQL - `internal/engine/infinity/chunk.go:346` — UUIDs can't contain `'` (RFC 4122) - `internal/cli/common_command.go` ×2 — CLI trusts operator-configured URL - `internal/utility/smtp.go:194` — msg is server-built, not user form input - `internal/entity/models/*` ×14 (path-injection) — audio file paths are caller-supplied ## Test plan - ✅ All 13 modified Go packages build cleanly - ✅ 663 tests pass across `internal/agent/sandbox`, `internal/common`, `internal/agent/component`, `internal/engine/infinity`, `internal/dao` - ✅ All 11 modified Python files parse via `ast.parse` - ✅ TypeScript `tsc --noEmit` clean on the modified `use-provider-fields.tsx` - ✅ `node --check` clean on the modified JS file 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-29 09:45:16 +08:00
Zhichang Yu	477f2fcebd	feat[Go]: port agent webhook trigger, agent file upload/download, component input-form + debug endpoints from Python (#16403 ) port agent webhook trigger, agent file upload/download, component input-form + debug endpoints from Python - [x] New Feature (non-breaking change which adds functionality)	2026-06-29 09:45:16 +08:00
Zhichang Yu	f58fae5fb7	feat(go-agent): Ported retrieval node, added Keenable web search tool (#16396 ) Ported retrieval node, added Keenable web search tool - [x] New Feature (non-breaking change which adds functionality)	2026-06-29 09:45:16 +08:00
Wang Qi	638b59fbcd	Fix handle move file failed (#16384 ) Follow on PR: #16350	2026-06-26 18:46:21 +08:00
Wang Qi	985e3c1db5	Fix document progress not set to fail when embedding model error (#16381 )	2026-06-26 16:11:54 +08:00
Harsh Kashyap	8d3c3f868c	fix(api): validate immutable document fields when value is zero (#16309 )	2026-06-25 19:29:12 +08:00
Harsh Kashyap	49312cace3	fix(api): align use_sql Markdown separator with Source header (#16317 )	2026-06-25 19:00:01 +08:00
Wang Qi	ac9469e5f5	Fix add VLLM without apikey will fail (#16352 )	2026-06-25 17:17:29 +08:00
Idriss Sbaaoui	fb8e5ad4b2	Fix multimodal chat image routing for VLM channel requests (#16343 )	2026-06-25 14:38:29 +08:00
buua436	479a9a715e	feat: unify provider id or name routing (#16336 )	2026-06-25 13:04:21 +08:00
Wang Qi	d0fc75f1bb	Fix when empty response not set, it report: ERROR: 'knowledge' (#16338 )	2026-06-25 13:02:24 +08:00
kpdev	68d2ca0ff1	fix(api): use dataset-owner tenant for legacy /chunks docstore cleanup (#15961 )	2026-06-24 14:24:40 +08:00
Ambercssa	e9cdd09b67	fix(agent): handle different reference data formats (#16276 )	2026-06-24 13:33:59 +08:00
Wang Qi	6046bc6a8e	Fix: handle empty folder when link to datasets (#16296 )	2026-06-24 13:31:32 +08:00
Ju Boxiang	39b194453d	Fix: paginate get_flatted_meta_by_kbs to support datasets with >10k documents (#16034 ) (#16095 )	2026-06-24 13:20:07 +08:00
ちー	5928b8b9ae	fix(document_service): prevent NoneType error on progress_msg.strip() (#16289 ) ### What problem does this PR solve? When I run RAGFlow_server.py: ``` 2026-06-24 10:27:01,938 ERROR 3413485 fetch task exception Traceback (most recent call last): File "/home/infiniflow/Documents/development/ragflow/api/db/services/document_service.py", line 948, in _sync_progress if t.progress_msg.strip(): ^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'strip' ``` fixed: ```python if t.progress_msg.strip(): # fix: if (t.progress_msg or "").strip(): ``` Fix crash in `_sync_progress` when `progress_msg` is `None`. #### Root Cause `progress_msg` from task records can be `None`, causing: ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-24 13:07:40 +08:00
buua436	ba4021a9de	fix: restore dataflow rerun and detail payload (#16292 )	2026-06-24 13:06:06 +08:00
buua436	d5d9d19fbe	fix: keep chat channel bindings consistent (#16274 )	2026-06-24 11:51:35 +08:00
Wang Qi	a4f325be24	Fix: add /v1/document/upload_info -> /api/v1/documents/upload back (#16264 )	2026-06-23 17:47:55 +08:00
buua436	aba5d172bd	feat: add whatsapp web qr chat channel (#16238 ) Adds a WhatsApp chat channel backed by a QR-based web login flow so users can connect without manual token setup.	2026-06-23 17:45:31 +08:00
buua436	b409cfc3d5	feat: add dingtalk chat channel (#16183 ) ### What does this PR do? This PR adds a new DingTalk chat channel integration and hardens the inbound callback path. ### Summary - Adds DingTalk as a selectable chat channel in the UI and backend channel registry. - Adds the DingTalk chat channel icon asset. - Acknowledges DingTalk Stream callbacks and deduplicates repeated inbound messages to avoid duplicate replies.	2026-06-18 20:06:00 +08:00
Wang Qi	5ca1686ac7	Fix that agent cannot be the same name (#16192 ) Fix that agent cannot be the same name	2026-06-18 19:10:21 +08:00
qinling0210	563d855780	Implement OpenAI chat completions in GO (#16177 ) ### What problem does this PR solve? Implement OpenAI chat completions in GO POST /api/v1/openai/<chat_id>/chat/completions OpenAI chat cli: internal/development.md ### Type of change - [x] Refactoring	2026-06-18 18:07:27 +08:00
buua436	a2de7d0060	fix: chat channel defaults and feishu shutdown (#16176 ) This PR keeps the chat-channel default values and Feishu shutdown behavior consistent after the rebase.	2026-06-18 17:44:48 +08:00
Lynn	47bd9dd049	Fix: replace tenant_llm apis (#16131 ) Replace tenant_llm apis with provider-instance apis.	2026-06-18 16:38:32 +08:00
buua436	ea70663f09	feat: support wecom websocket channel (#16175 ) Added WeCom chat channel websocket mode alongside the existing webhook mode, plus frontend support for selecting the connection type.	2026-06-18 13:10:09 +08:00
buua436	43d121ad38	feat: add qqbot chat channel (#16140 ) ### What problem does this PR solve? Adds qqbot as a built-in chat channel so it can be discovered and started by the channel bootstrapper and shown in the chat channel settings UI. ### Type of change - [x] New Feature (non-breaking change which adds functionality)	2026-06-17 18:49:38 +08:00
buua436	be869f5d96	fix: chat channel runtime (#16129 ) ### What problem does this PR solve? Fix chat channel message routing to use the connected `chat_id`, and make the Feishu websocket client bind to the thread-local event loop. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-17 15:52:13 +08:00
buua436	78b4906f7a	fix: tighten embedding truncation threshold (#16123 ) ### What problem does this PR solve? Use a 95% max_length threshold before truncating embedding inputs, which reduces the chance of provider-side invalid-parameter errors on near-limit chunks. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue)	2026-06-17 14:18:02 +08:00
euvre	9bd53ce675	fix: return full record in get_ingestion_log (#16120 ) ### What problem does this PR solve? The `get_ingestion_log` endpoint (both Python `dataset_api_service.get_ingestion_log` and Go `DatasetService.GetIngestionLog`) was returning only the dataset-level field set, which omits critical fields such as `dsl`, `document_id`, `parser_id`, `document_name`, `pipeline_id`, etc. This caused the front-end dataflow-result page to be unable to render the pipeline timeline and chunks when viewing a single ingestion log, regardless of whether the log was a dataset-level operation (graph/raptor/mindmap) or a per-file parse. ### Background `PipelineOperationLogService` provides two field sets: \| Method \| Fields \| \|---\|---\| \| `get_dataset_logs_fields` \| Minimal set (progress, status, timestamps, etc.) \| \| `get_file_logs_fields` \| Superset — includes `document_id`, `dsl`, `parser_id`, `document_name`, `pipeline_id`, … \| When listing logs, the API correctly distinguishes dataset-level vs file-level logs and uses the appropriate converter. However, when fetching a single log by ID, both the Python and Go implementations were hardcoded to the dataset-level set, dropping the extra fields that the front-end needs.	2026-06-17 13:03:51 +08:00

1 2 3 4 5 ...

1800 Commits