Files
ragflow/api/utils/configs.py

57 lines
1.6 KiB
Python
Raw Permalink Normal View History

#
# Copyright 2025 The InfiniFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import io
import base64
import pickle
from api.utils.common import bytes_to_string, string_to_bytes
safe_module = {
'numpy',
'rag_flow'
}
class RestrictedUnpickler(pickle.Unpickler):
def find_class(self, module, name):
import importlib
if module.split('.')[0] in safe_module:
_module = importlib.import_module(module)
return getattr(_module, name)
# Forbid everything else.
raise pickle.UnpicklingError("global '%s.%s' is forbidden" %
(module, name))
def restricted_loads(src):
"""Helper function analogous to pickle.loads()."""
return RestrictedUnpickler(io.BytesIO(src)).load()
def serialize_b64(src, to_str=False):
dest = base64.b64encode(pickle.dumps(src))
if not to_str:
return dest
else:
return bytes_to_string(dest)
def deserialize_b64(src):
src = base64.b64decode(
string_to_bytes(src) if isinstance(
src, str) else src)
security: always use RestrictedUnpickler in deserialize_b64 (CWE-502) (#14803) ## Summary Harden `api/utils/configs.deserialize_b64` so that it always routes pickle data through the existing `RestrictedUnpickler` (`restricted_loads`) rather than falling back to bare `pickle.loads()`. - **CWE-502** — Deserialization of Untrusted Data - **File / function**: `api/utils/configs.py` → `deserialize_b64` - **Caller**: `SerializedField.python_value` in `api/db/db_models.py` (invoked by Peewee whenever a pickled DB column is read) ## The issue Before this change, `deserialize_b64` consulted a `use_deserialize_safe_module` config flag that **defaults to `False`** and is not set anywhere in the repository: ```python use_deserialize_safe_module = get_base_config('use_deserialize_safe_module', False) if use_deserialize_safe_module: return restricted_loads(src) return pickle.loads(src) # <-- default path ``` So the default code path was unrestricted `pickle.loads()` on bytes read from a MySQL `SerializedField(serialized_type=PICKLE)` column. Any attacker who can influence those bytes (SQL injection elsewhere, compromised DB credentials, a backup restored from an untrusted source, or a compromised replication peer) can craft a pickle payload that achieves arbitrary code execution on the ragflow application server when the field is next read. Today no model in-tree instantiates a `SerializedField` with the default PICKLE type — only `JsonSerializedField` is used in practice — so the attack surface is currently **latent** rather than actively reachable through an HTTP endpoint. But the insecure-by-default behaviour is a sharp edge: any future field that uses the default PICKLE serialization would silently inherit RCE-on-read semantics. ## The fix ```diff - use_deserialize_safe_module = get_base_config( - 'use_deserialize_safe_module', False) - if use_deserialize_safe_module: - return restricted_loads(src) - return pickle.loads(src) + return restricted_loads(src) ``` `restricted_loads` is the existing `RestrictedUnpickler` already defined in the same file, which limits permitted modules to `numpy` and `rag_flow`. The config flag (and the now-dead `get_base_config` import) are removed. Diff is 1 insertion / 6 deletions, scoped to a single function. ## Testing - Built a malicious pickle whose `__reduce__` resolves to `posix.system('id')`. Pre-fix: executes. Post-fix: `restricted_loads` raises `UnpicklingError: global 'posix.system' is forbidden`. - Round-tripped a benign `numpy.ndarray` through `serialize_b64` → `deserialize_b64`. Values preserved bit-for-bit. - Confirmed `use_deserialize_safe_module` is not set in any config file in the tree, so removing the flag does not change any operator-facing knob that was actually in use. ## A note on `restricted_loads` itself The existing `SECURITY.md` notes that `restricted_loads`'s `numpy` allow-list can still be reached via `numpy.f2py.diagnose.run_command`. This PR does **not** attempt to fix that — it is a separate hardening question about tightening the allow-list to specific symbols rather than whole modules. The change here strictly improves on the status quo (bare `pickle.loads`) and brings the default path in line with what the `restricted_loads` helper was clearly designed for. Happy to follow up with a separate PR narrowing the allow-list if that direction is welcome. ## Adversarial review Before submitting, we tried to argue this finding away. The two strongest objections are (1) "no field uses PICKLE today, so this is unreachable" — true, but the default behaviour of a security-sensitive helper still matters because new fields silently inherit it; and (2) "the attacker already needs DB write access, which is game over" — partially true, but pickle-RCE meaningfully escalates *data tampering* into *code execution on the application host* (filesystem, internal network, in-process secrets), which is not equivalent. The fix is one line of real code, has no behavioural cost for legitimate callers, and removes an insecure default. We decided it was worth filing. --- <sub>_Submitted by Sebastion — autonomous open-source security research from [Foundation Machines](https://foundationmachines.ai). Free for public repos via the [Sebastion AI GitHub App](https://github.com/marketplace/sebastion-ai)._</sub>
2026-05-15 03:58:27 +01:00
return restricted_loads(src)