Files
ragflow/common/misc_utils.py
Zhichang Yu 195bfffb5e fix(security): address 93 CodeQL code-scanning alerts across 61 files (#16407)
## Summary

Resolves all 93 open alerts at
https://github.com/infiniflow/ragflow/security/code-scanning by rule:

| Rule | Count | Treatment |
|------|-------|-----------|
| py/clear-text-logging-sensitive-data | 23 | Real fix — log scrubbing |
| go/path-injection | 15 | Real fix where possible, suppression with
rationale |
| go/request-forgery | 8 | Suppression with rationale
(operator-controlled URLs) |
| go/clear-text-logging | 10 | Real fix — log scrubbing |
| go/unsafe-quoting | 5 | Real fix — escape or refactor |
| go/sql-injection | 3 | Real fix — orderby whitelist + CodeQL comment |
| go/uncontrolled-allocation-size | 2 | Real fix — cap to 1024 |
| go/incorrect-integer-conversion | 3 | Real fix — ParseInt + range
check |
| go/insecure-hostkeycallback | 1 | Real fix — known_hosts file |
| go/disabled-certificate-check | 2 | Suppression with rationale |
| go/command-injection | 1 | Suppression (sanitized via shq()) |
| go/email-injection | 1 | Suppression with rationale |
| go/cookie-httponly-not-set | 1 | Suppression (SPA bootstrap) |
| js/stack-trace-exposure | 1 | Real fix — generic client message |
| js/prototype-pollution-utility | 1 | Real fix — reject
__proto__/constructor/prototype |
| py/weak-sensitive-data-hashing | 1 | Real fix — MD5 → SHA-256 |
| py/incomplete-url-substring-sanitization | 3 | Real fix —
urlparse(hostname) |
| py/paramiko-missing-host-key-validation | 1 | Real fix —
load_system_host_keys + RejectPolicy |
| cpp/integer-multiplication-cast-to-long | 2 | Real fix — cast to
size_t |

## Real fixes (with measurable security improvement)

**SSH host key verification (Go + Python)**  
Replace `InsecureIgnoreHostKey()` / `paramiko.AutoAddPolicy()` with
proper host key verification against a known_hosts file (configurable
via `SSH_KNOWN_HOSTS` env / `known_hosts` config field; fail-closed when
unset). Loads `~/.ssh/known_hosts` first via `load_system_host_keys()`
so existing setups keep working.

**SQL injection in `user_canvas`**  
Add `userCanvasOrderableColumns` whitelist + `userCanvasOrderClause`
helper. Both `GetList()` and `ListByTenantIDs()` now route the
user-supplied `orderby` query param through the helper, defaulting to
`create_time` on miss.

**SQL injection in `pipeline_operation_log`**  
Existing whitelist documented via CodeQL comment.

**Real SQL injection in `infinity/chunk.go:931`**  
Escape `'` → `''` on user-controlled `questionText` before splicing into
`filter_fulltext(...)` SQL filter.

**Real SQL injection in `elasticsearch/sql.go:75`**  
Defense-in-depth escape on tokenizer output before splicing into
`MATCH(...)`.

**Python code injection in `result_protocol.go`**  
Replace raw JSON literal embedding into Python/JS expressions with
base64 + `json.loads` / `JSON.parse(Buffer.from(...,
'base64').toString('utf8'))`. Eliminates both the unsafe-quoting sink
and the brittleness of mixing JSON true/false/null with Python syntax.

**URL substring check bypass in `embedding_model.py`**  
Replace `if "dashscope-intl.aliyuncs.com" in u` with
`urlparse(u).hostname == "dashscope-intl.aliyuncs.com"` so a base_url
like `https://attacker.example/?u=dashscope-intl.aliyuncs.com` cannot
bypass the routing.

**Prototype pollution in `setNestedValue` (TS)**  
Reject `__proto__`/`constructor`/`prototype` keys before any assignment.

**Integer overflow**  
- scrypt params via `ParseInt` + non-positive check
(`internal/common/password.go`)
- `topN` and `n` caps to 1024 (retrieval_service.go, dataset.go)
- `nalloc*statesize` cast to `size_t` (cpp/re2/onepass.cc)

**Cookie httponly**  
Set explicitly with rationale: this is the OAuth bootstrap cookie
intentionally read by the SPA.

**Stack trace exposure**  
Replace `error.message` in HTTP 500 response with generic `"internal
error"`; full error still logged server-side via `console.error`.

**Weak hashing**  
MD5 → SHA-256 for deterministic `conv_id` derivation
(`conversation_service.py`).

**Log scrubbing**  
Remove or redact user-controlled / sensitive content from clear-text
logs across 8 ingestion parsers, `llm_service.py` ×11,
`tenant_llm_service.py` ×7, `misc_utils.py` ×4, `redis_conn.py` ×10,
`conftest.py` ×4, `init_data.py`, `dataset_api_service.py`,
`generator.py`, `mysql_migration.py`, `cli.go`, `user_command.go`,
`pdf_parser.go`. Most patterns converted to parameterized logging
(`logging.info("...: %d", n)`) or static messages.

## CodeQL suppressions (each with rationale)

For alerts where the data flow is genuinely safe but CodeQL can't see
the context — operator-controlled URLs, sanitized inputs, etc. — I added
`// codeql[go/<rule>] <rationale>` annotations rather than dismissing
them, so future readers can audit the rationale inline:

- `internal/agent/component/invoke.go:135` — Invoke is a generic canvas
HTTP client
- `internal/service/langfuse.go` ×2 — host is per-tenant operator config
- `internal/service/file.go:1184` — already SSRF-guarded by
`assertURLSafe`
- `internal/utility/mcp_client.go` ×3 — already `AssertURLSafe` +
IP-pinned
- `internal/entity/models/bedrock.go` — sigv4-signed request, URL can't
be tampered
- `internal/service/deep_researcher.go:269` — `callback` is SSE display
string, not SQL
- `internal/engine/infinity/chunk.go:346` — UUIDs can't contain `'` (RFC
4122)
- `internal/cli/common_command.go` ×2 — CLI trusts operator-configured
URL
- `internal/utility/smtp.go:194` — msg is server-built, not user form
input
- `internal/entity/models/*` ×14 (path-injection) — audio file paths are
caller-supplied

## Test plan

-  All 13 modified Go packages build cleanly
-  663 tests pass across `internal/agent/sandbox`, `internal/common`,
`internal/agent/component`, `internal/engine/infinity`, `internal/dao`
-  All 11 modified Python files parse via `ast.parse`
-  TypeScript `tsc --noEmit` clean on the modified
`use-provider-fields.tsx`
-  `node --check` clean on the modified JS file

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-06-29 09:45:16 +08:00

262 lines
9.0 KiB
Python

#
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import asyncio
import base64
import contextvars
import functools
import hashlib
import logging
import os
import subprocess
import sys
import threading
import uuid
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor
logger = logging.getLogger(__name__)
def get_uuid():
return uuid.uuid1().hex
# OAuth avatar fetch: bounded size; each redirect hop is SSRF-checked and DNS-pinned
# (see common.ssrf_guard).
_OAUTH_AVATAR_MAX_BYTES = int(os.environ.get("RAGFLOW_OAUTH_AVATAR_MAX_BYTES", str(5 * 1024 * 1024)))
_OAUTH_AVATAR_MAX_REDIRECTS = int(os.environ.get("RAGFLOW_OAUTH_AVATAR_MAX_REDIRECTS", "5"))
_REDIRECT_STATUS = frozenset({301, 302, 303, 307, 308})
async def download_img(url):
"""Fetch an image URL and return a data URI, or empty string on failure / SSRF block.
URLs must resolve only to globally routable addresses; redirects are followed
only up to ``_OAUTH_AVATAR_MAX_REDIRECTS`` with each target validated.
"""
if not url:
return ""
if not isinstance(url, str):
url = str(url)
url = url.strip()
if not url:
return ""
current_url = url
redirect_hops = 0
# Match common/http_client.py defaults without importing http_client (avoids
# pulling settings and keeps this path usable in lightweight test envs).
request_timeout = float(os.environ.get("HTTP_CLIENT_TIMEOUT", "15"))
proxy = os.environ.get("HTTP_CLIENT_PROXY")
user_agent = os.environ.get("HTTP_CLIENT_USER_AGENT", "ragflow-http-client")
from common.ssrf_guard import assert_url_is_safe, pin_dns_global
while redirect_hops <= _OAUTH_AVATAR_MAX_REDIRECTS:
try:
hostname, pin_ip = assert_url_is_safe(current_url)
except ValueError as exc:
logger.warning("download_img rejected URL (SSRF guard): %s", exc)
return ""
import httpx
timeout = httpx.Timeout(request_timeout)
headers = {}
if user_agent:
headers["User-Agent"] = user_agent
async def _stream_one_get() -> tuple[str, str | None]:
"""Return ``('redirect', new_url)``, ``('data', data_uri)``, or ``('fail', None)``."""
with pin_dns_global(hostname, pin_ip):
async with httpx.AsyncClient(
timeout=timeout,
follow_redirects=False,
proxy=proxy,
) as client:
async with client.stream("GET", current_url, headers=headers or None) as response:
if response.status_code in _REDIRECT_STATUS:
await response.aclose()
location = response.headers.get("location")
if not location:
logger.warning(
"download_img redirect missing Location header: status=%s redirect_hops=%s",
response.status_code,
redirect_hops,
)
return ("fail", None)
return ("redirect", urljoin(current_url, location))
if response.status_code != 200:
logger.warning(
"download_img non-200 response: status=%s redirect_hops=%s",
response.status_code,
redirect_hops,
)
return ("fail", None)
body = bytearray()
async for chunk in response.aiter_bytes():
if len(body) + len(chunk) > _OAUTH_AVATAR_MAX_BYTES:
logger.warning(
"download_img response exceeded max size: max_bytes=%s",
_OAUTH_AVATAR_MAX_BYTES,
)
await response.aclose()
return ("fail", None)
body.extend(chunk)
content_type = response.headers.get("Content-Type", "image/jpeg")
data_uri = (
"data:"
+ content_type
+ ";base64,"
+ base64.b64encode(bytes(body)).decode("utf-8")
)
return ("data", data_uri)
try:
kind, payload = await asyncio.wait_for(_stream_one_get(), timeout=request_timeout)
except asyncio.TimeoutError:
logger.warning(
"download_img total wall-clock timeout: redirect_hops=%s timeout=%s",
redirect_hops,
request_timeout,
)
return ""
except Exception as exc:
logger.warning(
"download_img request failed: redirect_hops=%s err=%s",
redirect_hops,
exc,
)
return ""
if kind == "redirect":
current_url = str(payload)
redirect_hops += 1
continue
if kind == "fail":
return ""
return str(payload)
logger.warning(
"download_img redirect hop limit exceeded: redirect_hops=%s max_redirects=%s",
redirect_hops,
_OAUTH_AVATAR_MAX_REDIRECTS,
)
return ""
def hash_str2int(line: str, mod: int = 10 ** 8) -> int:
return int(hashlib.sha1(line.encode("utf-8")).hexdigest(), 16) % mod
def convert_bytes(size_in_bytes: int) -> str:
"""
Format size in bytes.
"""
if size_in_bytes == 0:
return "0 B"
units = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
i = 0
size = float(size_in_bytes)
while size >= 1024 and i < len(units) - 1:
size /= 1024
i += 1
if i == 0 or size >= 100:
return f"{size:.0f} {units[i]}"
elif size >= 10:
return f"{size:.1f} {units[i]}"
else:
return f"{size:.2f} {units[i]}"
def once(func):
"""
A thread-safe decorator that ensures the decorated function runs exactly once,
caching and returning its result for all subsequent calls. This prevents
race conditions in multi-thread environments by using a lock to protect
the execution state.
Args:
func (callable): The function to be executed only once.
Returns:
callable: A wrapper function that executes `func` on the first call
and returns the cached result thereafter.
Example:
@once
def compute_expensive_value():
print("Computing...")
return 42
# First call: executes and prints
# Subsequent calls: return 42 without executing
"""
executed = False
result = None
lock = threading.Lock()
def wrapper(*args, **kwargs):
nonlocal executed, result
with lock:
if not executed:
executed = True
result = func(*args, **kwargs)
return result
return wrapper
@once
def pip_install_torch():
device = os.getenv("DEVICE", "cpu")
if device=="cpu":
return
logging.info("Installing pytorch")
pkg_names = ["torch>=2.5.0,<3.0.0"]
subprocess.check_call([sys.executable, "-m", "pip", "install", *pkg_names])
@once
def _thread_pool_executor():
max_workers_env = os.getenv("THREAD_POOL_MAX_WORKERS", "128")
try:
max_workers = int(max_workers_env)
except ValueError:
max_workers = 128
if max_workers < 1:
max_workers = 1
return ThreadPoolExecutor(max_workers=max_workers)
async def thread_pool_exec(func, *args, **kwargs):
# loop.run_in_executor() submits the callable without propagating the caller's
# contextvars (unlike asyncio.to_thread, which copies the context). Copy the
# current context and run the callable inside it so ContextVars set by the
# caller (e.g. tracing / per-request state) are visible in the worker thread.
loop = asyncio.get_running_loop()
ctx = contextvars.copy_context()
if kwargs:
inner = functools.partial(func, *args, **kwargs)
return await loop.run_in_executor(_thread_pool_executor(), ctx.run, inner)
return await loop.run_in_executor(_thread_pool_executor(), ctx.run, func, *args)