mirror of
https://github.com/infiniflow/ragflow.git
synced 2026-07-03 09:11:59 +08:00
## Summary Resolves all 93 open alerts at https://github.com/infiniflow/ragflow/security/code-scanning by rule: | Rule | Count | Treatment | |------|-------|-----------| | py/clear-text-logging-sensitive-data | 23 | Real fix — log scrubbing | | go/path-injection | 15 | Real fix where possible, suppression with rationale | | go/request-forgery | 8 | Suppression with rationale (operator-controlled URLs) | | go/clear-text-logging | 10 | Real fix — log scrubbing | | go/unsafe-quoting | 5 | Real fix — escape or refactor | | go/sql-injection | 3 | Real fix — orderby whitelist + CodeQL comment | | go/uncontrolled-allocation-size | 2 | Real fix — cap to 1024 | | go/incorrect-integer-conversion | 3 | Real fix — ParseInt + range check | | go/insecure-hostkeycallback | 1 | Real fix — known_hosts file | | go/disabled-certificate-check | 2 | Suppression with rationale | | go/command-injection | 1 | Suppression (sanitized via shq()) | | go/email-injection | 1 | Suppression with rationale | | go/cookie-httponly-not-set | 1 | Suppression (SPA bootstrap) | | js/stack-trace-exposure | 1 | Real fix — generic client message | | js/prototype-pollution-utility | 1 | Real fix — reject __proto__/constructor/prototype | | py/weak-sensitive-data-hashing | 1 | Real fix — MD5 → SHA-256 | | py/incomplete-url-substring-sanitization | 3 | Real fix — urlparse(hostname) | | py/paramiko-missing-host-key-validation | 1 | Real fix — load_system_host_keys + RejectPolicy | | cpp/integer-multiplication-cast-to-long | 2 | Real fix — cast to size_t | ## Real fixes (with measurable security improvement) **SSH host key verification (Go + Python)** Replace `InsecureIgnoreHostKey()` / `paramiko.AutoAddPolicy()` with proper host key verification against a known_hosts file (configurable via `SSH_KNOWN_HOSTS` env / `known_hosts` config field; fail-closed when unset). Loads `~/.ssh/known_hosts` first via `load_system_host_keys()` so existing setups keep working. **SQL injection in `user_canvas`** Add `userCanvasOrderableColumns` whitelist + `userCanvasOrderClause` helper. Both `GetList()` and `ListByTenantIDs()` now route the user-supplied `orderby` query param through the helper, defaulting to `create_time` on miss. **SQL injection in `pipeline_operation_log`** Existing whitelist documented via CodeQL comment. **Real SQL injection in `infinity/chunk.go:931`** Escape `'` → `''` on user-controlled `questionText` before splicing into `filter_fulltext(...)` SQL filter. **Real SQL injection in `elasticsearch/sql.go:75`** Defense-in-depth escape on tokenizer output before splicing into `MATCH(...)`. **Python code injection in `result_protocol.go`** Replace raw JSON literal embedding into Python/JS expressions with base64 + `json.loads` / `JSON.parse(Buffer.from(..., 'base64').toString('utf8'))`. Eliminates both the unsafe-quoting sink and the brittleness of mixing JSON true/false/null with Python syntax. **URL substring check bypass in `embedding_model.py`** Replace `if "dashscope-intl.aliyuncs.com" in u` with `urlparse(u).hostname == "dashscope-intl.aliyuncs.com"` so a base_url like `https://attacker.example/?u=dashscope-intl.aliyuncs.com` cannot bypass the routing. **Prototype pollution in `setNestedValue` (TS)** Reject `__proto__`/`constructor`/`prototype` keys before any assignment. **Integer overflow** - scrypt params via `ParseInt` + non-positive check (`internal/common/password.go`) - `topN` and `n` caps to 1024 (retrieval_service.go, dataset.go) - `nalloc*statesize` cast to `size_t` (cpp/re2/onepass.cc) **Cookie httponly** Set explicitly with rationale: this is the OAuth bootstrap cookie intentionally read by the SPA. **Stack trace exposure** Replace `error.message` in HTTP 500 response with generic `"internal error"`; full error still logged server-side via `console.error`. **Weak hashing** MD5 → SHA-256 for deterministic `conv_id` derivation (`conversation_service.py`). **Log scrubbing** Remove or redact user-controlled / sensitive content from clear-text logs across 8 ingestion parsers, `llm_service.py` ×11, `tenant_llm_service.py` ×7, `misc_utils.py` ×4, `redis_conn.py` ×10, `conftest.py` ×4, `init_data.py`, `dataset_api_service.py`, `generator.py`, `mysql_migration.py`, `cli.go`, `user_command.go`, `pdf_parser.go`. Most patterns converted to parameterized logging (`logging.info("...: %d", n)`) or static messages. ## CodeQL suppressions (each with rationale) For alerts where the data flow is genuinely safe but CodeQL can't see the context — operator-controlled URLs, sanitized inputs, etc. — I added `// codeql[go/<rule>] <rationale>` annotations rather than dismissing them, so future readers can audit the rationale inline: - `internal/agent/component/invoke.go:135` — Invoke is a generic canvas HTTP client - `internal/service/langfuse.go` ×2 — host is per-tenant operator config - `internal/service/file.go:1184` — already SSRF-guarded by `assertURLSafe` - `internal/utility/mcp_client.go` ×3 — already `AssertURLSafe` + IP-pinned - `internal/entity/models/bedrock.go` — sigv4-signed request, URL can't be tampered - `internal/service/deep_researcher.go:269` — `callback` is SSE display string, not SQL - `internal/engine/infinity/chunk.go:346` — UUIDs can't contain `'` (RFC 4122) - `internal/cli/common_command.go` ×2 — CLI trusts operator-configured URL - `internal/utility/smtp.go:194` — msg is server-built, not user form input - `internal/entity/models/*` ×14 (path-injection) — audio file paths are caller-supplied ## Test plan - ✅ All 13 modified Go packages build cleanly - ✅ 663 tests pass across `internal/agent/sandbox`, `internal/common`, `internal/agent/component`, `internal/engine/infinity`, `internal/dao` - ✅ All 11 modified Python files parse via `ast.parse` - ✅ TypeScript `tsc --noEmit` clean on the modified `use-provider-fields.tsx` - ✅ `node --check` clean on the modified JS file 🤖 Generated with [Claude Code](https://claude.com/claude-code)
240 lines
7.7 KiB
Go
240 lines
7.7 KiB
Go
//
|
|
// Copyright 2026 The InfiniFlow Authors. All Rights Reserved.
|
|
//
|
|
// Licensed under the Apache License, Version 2.0 (the "License");
|
|
// you may not use this file except in compliance with the License.
|
|
// You may obtain a copy of the License at
|
|
//
|
|
// http://www.apache.org/licenses/LICENSE-2.0
|
|
//
|
|
// Unless required by applicable law or agreed to in writing, software
|
|
// distributed under the License is distributed on an "AS IS" BASIS,
|
|
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
// See the License for the specific language governing permissions and
|
|
// limitations under the License.
|
|
//
|
|
|
|
package sandbox
|
|
|
|
import (
|
|
"encoding/base64"
|
|
"encoding/json"
|
|
"strings"
|
|
"testing"
|
|
)
|
|
|
|
func TestBuildPythonWrapper_ContainsMainAndArgs(t *testing.T) {
|
|
t.Parallel()
|
|
wrapped := BuildPythonWrapper("def main(x): return x * 2", `{"x": 21}`)
|
|
if !strings.Contains(wrapped, "def main(x): return x * 2") {
|
|
t.Errorf("wrapper missing user code; got: %s", wrapped)
|
|
}
|
|
// argsJSON is base64-encoded in the wrapper, so the call site
|
|
// must use json.loads(base64.b64decode(...)).
|
|
if !strings.Contains(wrapped, "main(**json.loads(base64.b64decode(") {
|
|
t.Errorf("wrapper missing main(**json.loads(base64.b64decode(...))) call; got: %s", wrapped)
|
|
}
|
|
if !strings.Contains(wrapped, resultMarkerPrefix) {
|
|
t.Errorf("wrapper missing result marker; got: %s", wrapped)
|
|
}
|
|
// The marker line carries a base64-encoded JSON payload. We do
|
|
// not decode it here — the round-trip is exercised in
|
|
// TestExtractStructuredResult_RoundTrip.
|
|
}
|
|
|
|
func TestBuildJavaScriptWrapper_ContainsMainAndArgs(t *testing.T) {
|
|
t.Parallel()
|
|
wrapped := BuildJavaScriptWrapper("async function main(args) { return args.x * 2 }", `{"x": 21}`)
|
|
if !strings.Contains(wrapped, "async function main(args)") {
|
|
t.Errorf("wrapper missing user code; got: %s", wrapped)
|
|
}
|
|
if !strings.Contains(wrapped, "const __ragflowArgsB64 = ") {
|
|
t.Errorf("wrapper missing base64 args literal; got: %s", wrapped)
|
|
}
|
|
if !strings.Contains(wrapped, "JSON.parse(Buffer.from(__ragflowArgsB64") {
|
|
t.Errorf("wrapper missing args decoding; got: %s", wrapped)
|
|
}
|
|
if !strings.Contains(wrapped, resultMarkerPrefix) {
|
|
t.Errorf("wrapper missing result marker; got: %s", wrapped)
|
|
}
|
|
}
|
|
|
|
func TestExtractStructuredResult_RoundTrip(t *testing.T) {
|
|
t.Parallel()
|
|
|
|
// Simulate what the wrapper prints: stdout with a marker
|
|
// carrying a base64 JSON payload, then any trailing output.
|
|
payload := map[string]any{"present": true, "value": 42, "type": "json"}
|
|
raw, _ := json.Marshal(payload)
|
|
markerLine := resultMarkerPrefix + base64.StdEncoding.EncodeToString(raw)
|
|
stdout := "Hello, world!\n" + markerLine + "\nTrailing line\n"
|
|
|
|
cleaned, structured := ExtractStructuredResult(stdout)
|
|
if structured == nil {
|
|
t.Fatalf("structured result is nil; cleaned=%q", cleaned)
|
|
}
|
|
if v, ok := structured["value"].(float64); !ok || v != 42 {
|
|
t.Errorf("structured[value] = %v, want 42", structured["value"])
|
|
}
|
|
if strings.Contains(cleaned, resultMarkerPrefix) {
|
|
t.Errorf("cleaned stdout still contains marker: %q", cleaned)
|
|
}
|
|
if !strings.Contains(cleaned, "Hello, world!") || !strings.Contains(cleaned, "Trailing line") {
|
|
t.Errorf("cleaned stdout lost user lines: %q", cleaned)
|
|
}
|
|
}
|
|
|
|
func TestExtractStructuredResult_NoMarker(t *testing.T) {
|
|
t.Parallel()
|
|
stdout := "Plain output\nNothing fancy"
|
|
cleaned, structured := ExtractStructuredResult(stdout)
|
|
if cleaned != stdout {
|
|
t.Errorf("cleaned = %q, want %q", cleaned, stdout)
|
|
}
|
|
if len(structured) != 0 {
|
|
t.Errorf("structured = %v, want empty map", structured)
|
|
}
|
|
}
|
|
|
|
func TestExtractStructuredResult_Empty(t *testing.T) {
|
|
t.Parallel()
|
|
cleaned, structured := ExtractStructuredResult("")
|
|
if cleaned != "" {
|
|
t.Errorf("cleaned = %q, want empty", cleaned)
|
|
}
|
|
if structured == nil {
|
|
t.Errorf("structured is nil, want empty map (not nil)")
|
|
}
|
|
if len(structured) != 0 {
|
|
t.Errorf("structured = %v, want empty", structured)
|
|
}
|
|
}
|
|
|
|
func TestExtractStructuredResult_UndecodableMarker(t *testing.T) {
|
|
t.Parallel()
|
|
stdout := "good line\n" + resultMarkerPrefix + "!!!not-base64!!!\nmore output"
|
|
cleaned, structured := ExtractStructuredResult(stdout)
|
|
// Undecodable marker lines are kept in cleaned stdout (Python
|
|
// behavior: except branch appends the line back). The user can
|
|
// see the raw base64 and debug.
|
|
if !strings.Contains(cleaned, "!!!not-base64!!!") {
|
|
t.Errorf("cleaned dropped the undecodable marker line: %q", cleaned)
|
|
}
|
|
if len(structured) != 0 {
|
|
t.Errorf("structured = %v, want empty", structured)
|
|
}
|
|
}
|
|
|
|
func TestExtractStructuredResult_MultipleMarkers_LastWins(t *testing.T) {
|
|
t.Parallel()
|
|
// Python implementation overwrites structured_result on each
|
|
// marker line, so the last marker wins.
|
|
payload1 := map[string]any{"present": true, "value": "first", "type": "json"}
|
|
payload2 := map[string]any{"present": true, "value": "second", "type": "json"}
|
|
raw1, _ := json.Marshal(payload1)
|
|
raw2, _ := json.Marshal(payload2)
|
|
stdout := resultMarkerPrefix + base64.StdEncoding.EncodeToString(raw1) + "\n" +
|
|
resultMarkerPrefix + base64.StdEncoding.EncodeToString(raw2) + "\n"
|
|
_, structured := ExtractStructuredResult(stdout)
|
|
if v, _ := structured["value"].(string); v != "second" {
|
|
t.Errorf("structured[value] = %v, want 'second'", structured["value"])
|
|
}
|
|
}
|
|
|
|
func TestExtractStructuredResult_PreservesTrailingNewline(t *testing.T) {
|
|
t.Parallel()
|
|
payload := map[string]any{"present": true, "value": 1, "type": "json"}
|
|
raw, _ := json.Marshal(payload)
|
|
stdout := "line\n" + resultMarkerPrefix + base64.StdEncoding.EncodeToString(raw) + "\n"
|
|
cleaned, _ := ExtractStructuredResult(stdout)
|
|
if !strings.HasSuffix(cleaned, "\n") {
|
|
t.Errorf("cleaned lost trailing newline: %q", cleaned)
|
|
}
|
|
}
|
|
|
|
func TestArgsToJSON(t *testing.T) {
|
|
t.Parallel()
|
|
|
|
t.Run("nil map", func(t *testing.T) {
|
|
s, err := argsToJSON(nil)
|
|
if err != nil {
|
|
t.Fatalf("err: %v", err)
|
|
}
|
|
if s != "{}" {
|
|
t.Errorf("got %q, want %q", s, "{}")
|
|
}
|
|
})
|
|
|
|
t.Run("empty map", func(t *testing.T) {
|
|
s, err := argsToJSON(map[string]any{})
|
|
if err != nil {
|
|
t.Fatalf("err: %v", err)
|
|
}
|
|
if s != "{}" {
|
|
t.Errorf("got %q, want %q", s, "{}")
|
|
}
|
|
})
|
|
|
|
t.Run("populated map", func(t *testing.T) {
|
|
s, err := argsToJSON(map[string]any{"x": 1, "y": "hi"})
|
|
if err != nil {
|
|
t.Fatalf("err: %v", err)
|
|
}
|
|
// The exact JSON key order is non-deterministic; check both
|
|
// fields independently.
|
|
var got map[string]any
|
|
if err := json.Unmarshal([]byte(s), &got); err != nil {
|
|
t.Fatalf("invalid JSON: %v", err)
|
|
}
|
|
if got["x"].(float64) != 1 {
|
|
t.Errorf("x = %v, want 1", got["x"])
|
|
}
|
|
if got["y"].(string) != "hi" {
|
|
t.Errorf("y = %v, want 'hi'", got["y"])
|
|
}
|
|
})
|
|
}
|
|
|
|
func TestNormalizeLanguage(t *testing.T) {
|
|
t.Parallel()
|
|
|
|
cases := []struct {
|
|
in, want string
|
|
}{
|
|
{"python", "python"},
|
|
{"python3", "python"},
|
|
{"Python", "python"},
|
|
{"javascript", "nodejs"},
|
|
{"js", "nodejs"},
|
|
{"nodejs", "nodejs"},
|
|
{"node", "nodejs"},
|
|
{"", ""},
|
|
{"ruby", ""},
|
|
}
|
|
for _, tc := range cases {
|
|
if got := normalizeLanguage(tc.in); got != tc.want {
|
|
t.Errorf("normalizeLanguage(%q) = %q, want %q", tc.in, got, tc.want)
|
|
}
|
|
}
|
|
}
|
|
|
|
func TestValidateTimeout(t *testing.T) {
|
|
t.Parallel()
|
|
|
|
if _, err := validateTimeout(0); err == nil {
|
|
t.Errorf("validateTimeout(0) returned no error, want one")
|
|
}
|
|
if _, err := validateTimeout(-1); err == nil {
|
|
t.Errorf("validateTimeout(-1) returned no error, want one")
|
|
}
|
|
if _, err := validateTimeout(700); err == nil {
|
|
t.Errorf("validateTimeout(700) returned no error, want one (max is 600)")
|
|
}
|
|
if got, err := validateTimeout(30); err != nil || got != 30 {
|
|
t.Errorf("validateTimeout(30) = %d, %v; want 30, nil", got, err)
|
|
}
|
|
if got, err := validateTimeout(600); err != nil || got != 600 {
|
|
t.Errorf("validateTimeout(600) = %d, %v; want 600, nil", got, err)
|
|
}
|
|
}
|