mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-07-01 16:25:44 +08:00
## Summary Resolves all 93 open alerts at https://github.com/infiniflow/ragflow/security/code-scanning by rule: | Rule | Count | Treatment | |------|-------|-----------| | py/clear-text-logging-sensitive-data | 23 | Real fix — log scrubbing | | go/path-injection | 15 | Real fix where possible, suppression with rationale | | go/request-forgery | 8 | Suppression with rationale (operator-controlled URLs) | | go/clear-text-logging | 10 | Real fix — log scrubbing | | go/unsafe-quoting | 5 | Real fix — escape or refactor | | go/sql-injection | 3 | Real fix — orderby whitelist + CodeQL comment | | go/uncontrolled-allocation-size | 2 | Real fix — cap to 1024 | | go/incorrect-integer-conversion | 3 | Real fix — ParseInt + range check | | go/insecure-hostkeycallback | 1 | Real fix — known_hosts file | | go/disabled-certificate-check | 2 | Suppression with rationale | | go/command-injection | 1 | Suppression (sanitized via shq()) | | go/email-injection | 1 | Suppression with rationale | | go/cookie-httponly-not-set | 1 | Suppression (SPA bootstrap) | | js/stack-trace-exposure | 1 | Real fix — generic client message | | js/prototype-pollution-utility | 1 | Real fix — reject __proto__/constructor/prototype | | py/weak-sensitive-data-hashing | 1 | Real fix — MD5 → SHA-256 | | py/incomplete-url-substring-sanitization | 3 | Real fix — urlparse(hostname) | | py/paramiko-missing-host-key-validation | 1 | Real fix — load_system_host_keys + RejectPolicy | | cpp/integer-multiplication-cast-to-long | 2 | Real fix — cast to size_t | ## Real fixes (with measurable security improvement) **SSH host key verification (Go + Python)** Replace `InsecureIgnoreHostKey()` / `paramiko.AutoAddPolicy()` with proper host key verification against a known_hosts file (configurable via `SSH_KNOWN_HOSTS` env / `known_hosts` config field; fail-closed when unset). Loads `~/.ssh/known_hosts` first via `load_system_host_keys()` so existing setups keep working. **SQL injection in `user_canvas`** Add `userCanvasOrderableColumns` whitelist + `userCanvasOrderClause` helper. Both `GetList()` and `ListByTenantIDs()` now route the user-supplied `orderby` query param through the helper, defaulting to `create_time` on miss. **SQL injection in `pipeline_operation_log`** Existing whitelist documented via CodeQL comment. **Real SQL injection in `infinity/chunk.go:931`** Escape `'` → `''` on user-controlled `questionText` before splicing into `filter_fulltext(...)` SQL filter. **Real SQL injection in `elasticsearch/sql.go:75`** Defense-in-depth escape on tokenizer output before splicing into `MATCH(...)`. **Python code injection in `result_protocol.go`** Replace raw JSON literal embedding into Python/JS expressions with base64 + `json.loads` / `JSON.parse(Buffer.from(..., 'base64').toString('utf8'))`. Eliminates both the unsafe-quoting sink and the brittleness of mixing JSON true/false/null with Python syntax. **URL substring check bypass in `embedding_model.py`** Replace `if "dashscope-intl.aliyuncs.com" in u` with `urlparse(u).hostname == "dashscope-intl.aliyuncs.com"` so a base_url like `https://attacker.example/?u=dashscope-intl.aliyuncs.com` cannot bypass the routing. **Prototype pollution in `setNestedValue` (TS)** Reject `__proto__`/`constructor`/`prototype` keys before any assignment. **Integer overflow** - scrypt params via `ParseInt` + non-positive check (`internal/common/password.go`) - `topN` and `n` caps to 1024 (retrieval_service.go, dataset.go) - `nalloc*statesize` cast to `size_t` (cpp/re2/onepass.cc) **Cookie httponly** Set explicitly with rationale: this is the OAuth bootstrap cookie intentionally read by the SPA. **Stack trace exposure** Replace `error.message` in HTTP 500 response with generic `"internal error"`; full error still logged server-side via `console.error`. **Weak hashing** MD5 → SHA-256 for deterministic `conv_id` derivation (`conversation_service.py`). **Log scrubbing** Remove or redact user-controlled / sensitive content from clear-text logs across 8 ingestion parsers, `llm_service.py` ×11, `tenant_llm_service.py` ×7, `misc_utils.py` ×4, `redis_conn.py` ×10, `conftest.py` ×4, `init_data.py`, `dataset_api_service.py`, `generator.py`, `mysql_migration.py`, `cli.go`, `user_command.go`, `pdf_parser.go`. Most patterns converted to parameterized logging (`logging.info("...: %d", n)`) or static messages. ## CodeQL suppressions (each with rationale) For alerts where the data flow is genuinely safe but CodeQL can't see the context — operator-controlled URLs, sanitized inputs, etc. — I added `// codeql[go/<rule>] <rationale>` annotations rather than dismissing them, so future readers can audit the rationale inline: - `internal/agent/component/invoke.go:135` — Invoke is a generic canvas HTTP client - `internal/service/langfuse.go` ×2 — host is per-tenant operator config - `internal/service/file.go:1184` — already SSRF-guarded by `assertURLSafe` - `internal/utility/mcp_client.go` ×3 — already `AssertURLSafe` + IP-pinned - `internal/entity/models/bedrock.go` — sigv4-signed request, URL can't be tampered - `internal/service/deep_researcher.go:269` — `callback` is SSE display string, not SQL - `internal/engine/infinity/chunk.go:346` — UUIDs can't contain `'` (RFC 4122) - `internal/cli/common_command.go` ×2 — CLI trusts operator-configured URL - `internal/utility/smtp.go:194` — msg is server-built, not user form input - `internal/entity/models/*` ×14 (path-injection) — audio file paths are caller-supplied ## Test plan - ✅ All 13 modified Go packages build cleanly - ✅ 663 tests pass across `internal/agent/sandbox`, `internal/common`, `internal/agent/component`, `internal/engine/infinity`, `internal/dao` - ✅ All 11 modified Python files parse via `ast.parse` - ✅ TypeScript `tsc --noEmit` clean on the modified `use-provider-fields.tsx` - ✅ `node --check` clean on the modified JS file 🤖 Generated with [Claude Code](https://claude.com/claude-code)
197 lines
7.5 KiB
Go
197 lines
7.5 KiB
Go
//
|
|
// Copyright 2026 The InfiniFlow Authors. All Rights Reserved.
|
|
//
|
|
// Licensed under the Apache License, Version 2.0 (the "License");
|
|
// you may not use this file except in compliance with the License.
|
|
// You may obtain a copy of the License at
|
|
//
|
|
// http://www.apache.org/licenses/LICENSE-2.0
|
|
//
|
|
// Unless required by applicable law or agreed to in writing, software
|
|
// distributed under the License is distributed on an "AS IS" BASIS,
|
|
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
// See the License for the specific language governing permissions and
|
|
// limitations under the License.
|
|
//
|
|
|
|
// result_protocol.go is the Go port of `agent/sandbox/result_protocol.py`.
|
|
//
|
|
// The contract:
|
|
//
|
|
// 1. The user's code is expected to define a `main(**args)` function
|
|
// (Python) or export a `main(args)` function (JavaScript).
|
|
// 2. The provider wraps the code in a small driver that calls main
|
|
// with the agent-supplied arguments and emits a marker line
|
|
// carrying main's return value as base64-JSON. The marker prefix
|
|
// is `__RAGFLOW_RESULT__:`. This is the ONLY line the agent code
|
|
// parser keeps from the synthesized output — the rest is the
|
|
// user's stdout, surfaced verbatim.
|
|
//
|
|
// The marker protocol is a contract with `executor_manager`
|
|
// (Python FastAPI service that runs the actual code). Renaming the
|
|
// marker is a wire-format break — `executor_manager` parses for this
|
|
// exact prefix. See `agent/sandbox/executor_manager/services/execution.py`
|
|
// for the Python side that depends on it.
|
|
|
|
package sandbox
|
|
|
|
import (
|
|
"encoding/base64"
|
|
"encoding/json"
|
|
"fmt"
|
|
"strings"
|
|
)
|
|
|
|
// resultMarkerPrefix is the wire-level marker the executor_manager
|
|
// Python service scans stdout for. Keep in sync with Python
|
|
// `agent/sandbox/result_protocol.py::RESULT_MARKER_PREFIX`.
|
|
const resultMarkerPrefix = "__RAGFLOW_RESULT__:"
|
|
|
|
// BuildPythonWrapper wraps a Python source so that:
|
|
//
|
|
// - When executed as `python -c <wrapped>`, the user-defined
|
|
// `main(**args)` is invoked with the JSON-decoded args.
|
|
// - main's return value is JSON-encoded, prefixed with the
|
|
// marker, and printed to stdout.
|
|
//
|
|
// argsJSON is base64-encoded and decoded inside Python via
|
|
// json.loads(base64.b64decode(...)). The base64 alphabet has no
|
|
// characters that conflict with Python syntax, so splicing the
|
|
// encoded string into a Python literal is safe. This avoids the
|
|
// fragility of embedding raw JSON directly (true/false/null vs
|
|
// Python's True/False/None) and removes the unsafe-quoting sink
|
|
// from CodeQL's view.
|
|
func BuildPythonWrapper(code, argsJSON string) string {
|
|
argsB64 := base64.StdEncoding.EncodeToString([]byte(argsJSON))
|
|
return code + `
|
|
|
|
if __name__ == "__main__":
|
|
import base64
|
|
import json
|
|
|
|
result = main(**json.loads(base64.b64decode("` + argsB64 + `").decode("utf-8")))
|
|
payload = json.dumps({"present": True, "value": result, "type": "json"}, ensure_ascii=False, separators=(",", ":"))
|
|
print("` + resultMarkerPrefix + `" + base64.b64encode(payload.encode("utf-8")).decode("ascii"))
|
|
`
|
|
}
|
|
|
|
// BuildJavaScriptWrapper wraps a JavaScript source so that:
|
|
//
|
|
// - When executed as `node -e <wrapped>`, the user-defined
|
|
// `main(args)` (or `module.exports.main`) is awaited with the
|
|
// JSON-decoded args object.
|
|
// - main's return value is JSON-encoded, prefixed with the
|
|
// marker, and printed to stdout.
|
|
//
|
|
// JavaScript lacks a "module" boundary in `node -e`, so we look for
|
|
// `main` in (a) the global scope and (b) `module.exports.main`,
|
|
// matching the Python wrapper.
|
|
//
|
|
// argsJSON is embedded as a base64 literal (alphabet contains no JS
|
|
// syntax-significant characters) and decoded at runtime via
|
|
// JSON.parse(Buffer.from(..., 'base64').toString('utf8')), so the
|
|
// only Go-side dataflow into the JS source is the base64 string.
|
|
func BuildJavaScriptWrapper(code, argsJSON string) string {
|
|
argsB64 := base64.StdEncoding.EncodeToString([]byte(argsJSON))
|
|
// Note: this string is *embedded inside* a Go raw string, but the
|
|
// Go raw string and the JS source are independent languages. We
|
|
// need the final JS to be valid; the doubled braces {{ }} are JS
|
|
// template-literal escapes only on the JS side. We pass them
|
|
// through as-is.
|
|
return code + `
|
|
|
|
const __ragflowArgsB64 = "` + argsB64 + `";
|
|
const __ragflowArgs = JSON.parse(Buffer.from(__ragflowArgsB64, 'base64').toString('utf8'));
|
|
|
|
(async () => {
|
|
const __ragflowMain = typeof main !== 'undefined' ? main : module.exports && module.exports.main;
|
|
if (typeof __ragflowMain !== 'function') {
|
|
throw new Error('main() must be defined or exported.');
|
|
}
|
|
const output = await Promise.resolve(__ragflowMain(__ragflowArgs));
|
|
if (typeof output === 'undefined') {
|
|
throw new Error('main() must return a value. Use null for an empty result.');
|
|
}
|
|
const payload = JSON.stringify({ present: true, value: output, type: 'json' });
|
|
if (typeof payload === 'undefined') {
|
|
throw new Error('main() returned a non-JSON-serializable value.');
|
|
}
|
|
console.log('` + resultMarkerPrefix + `' + Buffer.from(payload, 'utf8').toString('base64'));
|
|
})();
|
|
`
|
|
}
|
|
|
|
// ExtractStructuredResult scans stdout for the marker line, decodes
|
|
// the JSON payload after it, and returns the user-visible stdout
|
|
// (with the marker line removed) plus the parsed structured result.
|
|
//
|
|
// The Python side returns `(cleaned_stdout, structured_result_dict)`.
|
|
// On Go the dict is `map[string]any`.
|
|
//
|
|
// Edge cases (matching the Python implementation):
|
|
// - empty stdout → ("", empty map).
|
|
// - multiple marker lines → only the LAST one wins (later result
|
|
// overrides earlier). The Python implementation does the same
|
|
// because the loop overwrites `structured_result`.
|
|
// - undecodable payload → the marker line is kept in the cleaned
|
|
// stdout (the user gets to see the raw base64) and the map stays
|
|
// empty. Python's `except Exception: cleaned_lines.append(line)`
|
|
// does the same.
|
|
// - the trailing newline is preserved if the input had one.
|
|
func ExtractStructuredResult(stdout string) (string, map[string]any) {
|
|
if stdout == "" {
|
|
return "", map[string]any{}
|
|
}
|
|
|
|
cleanedLines := []string{}
|
|
structured := map[string]any{}
|
|
|
|
for _, line := range strings.Split(stdout, "\n") {
|
|
if strings.HasPrefix(line, resultMarkerPrefix) {
|
|
payloadB64 := strings.TrimSpace(line[len(resultMarkerPrefix):])
|
|
if payloadB64 == "" {
|
|
cleanedLines = append(cleanedLines, line)
|
|
continue
|
|
}
|
|
raw, err := base64.StdEncoding.DecodeString(payloadB64)
|
|
if err != nil {
|
|
cleanedLines = append(cleanedLines, line)
|
|
continue
|
|
}
|
|
var decoded map[string]any
|
|
if err := json.Unmarshal(raw, &decoded); err != nil {
|
|
cleanedLines = append(cleanedLines, line)
|
|
continue
|
|
}
|
|
structured = decoded
|
|
continue
|
|
}
|
|
cleanedLines = append(cleanedLines, line)
|
|
}
|
|
|
|
cleaned := strings.Join(cleanedLines, "\n")
|
|
if strings.HasSuffix(stdout, "\n") && cleaned != "" && !strings.HasSuffix(cleaned, "\n") {
|
|
cleaned += "\n"
|
|
}
|
|
return cleaned, structured
|
|
}
|
|
|
|
// argsToJSON is a small helper used by the providers to build the
|
|
// args string the wrapper expects. Empty/nil maps serialize to "{}"
|
|
// so the wrapper can always json.loads safely.
|
|
func argsToJSON(args map[string]any) (string, error) {
|
|
if args == nil {
|
|
return "{}", nil
|
|
}
|
|
// json.Marshal of a nil map produces "null" — replace with "{}"
|
|
// so the wrappers see an object literal in both languages.
|
|
b, err := json.Marshal(args)
|
|
if err != nil {
|
|
return "", fmt.Errorf("sandbox: marshal args: %w", err)
|
|
}
|
|
if string(b) == "null" {
|
|
return "{}", nil
|
|
}
|
|
return string(b), nil
|
|
}
|