Files
ragflow/tools/scripts
Zhichang Yu 730f33b1f9 fix(security): address 93 CodeQL code-scanning alerts across 61 files (#16407)
## Summary

Resolves all 93 open alerts at
https://github.com/infiniflow/ragflow/security/code-scanning by rule:

| Rule | Count | Treatment |
|------|-------|-----------|
| py/clear-text-logging-sensitive-data | 23 | Real fix — log scrubbing |
| go/path-injection | 15 | Real fix where possible, suppression with
rationale |
| go/request-forgery | 8 | Suppression with rationale
(operator-controlled URLs) |
| go/clear-text-logging | 10 | Real fix — log scrubbing |
| go/unsafe-quoting | 5 | Real fix — escape or refactor |
| go/sql-injection | 3 | Real fix — orderby whitelist + CodeQL comment |
| go/uncontrolled-allocation-size | 2 | Real fix — cap to 1024 |
| go/incorrect-integer-conversion | 3 | Real fix — ParseInt + range
check |
| go/insecure-hostkeycallback | 1 | Real fix — known_hosts file |
| go/disabled-certificate-check | 2 | Suppression with rationale |
| go/command-injection | 1 | Suppression (sanitized via shq()) |
| go/email-injection | 1 | Suppression with rationale |
| go/cookie-httponly-not-set | 1 | Suppression (SPA bootstrap) |
| js/stack-trace-exposure | 1 | Real fix — generic client message |
| js/prototype-pollution-utility | 1 | Real fix — reject
__proto__/constructor/prototype |
| py/weak-sensitive-data-hashing | 1 | Real fix — MD5 → SHA-256 |
| py/incomplete-url-substring-sanitization | 3 | Real fix —
urlparse(hostname) |
| py/paramiko-missing-host-key-validation | 1 | Real fix —
load_system_host_keys + RejectPolicy |
| cpp/integer-multiplication-cast-to-long | 2 | Real fix — cast to
size_t |

## Real fixes (with measurable security improvement)

**SSH host key verification (Go + Python)**  
Replace `InsecureIgnoreHostKey()` / `paramiko.AutoAddPolicy()` with
proper host key verification against a known_hosts file (configurable
via `SSH_KNOWN_HOSTS` env / `known_hosts` config field; fail-closed when
unset). Loads `~/.ssh/known_hosts` first via `load_system_host_keys()`
so existing setups keep working.

**SQL injection in `user_canvas`**  
Add `userCanvasOrderableColumns` whitelist + `userCanvasOrderClause`
helper. Both `GetList()` and `ListByTenantIDs()` now route the
user-supplied `orderby` query param through the helper, defaulting to
`create_time` on miss.

**SQL injection in `pipeline_operation_log`**  
Existing whitelist documented via CodeQL comment.

**Real SQL injection in `infinity/chunk.go:931`**  
Escape `'` → `''` on user-controlled `questionText` before splicing into
`filter_fulltext(...)` SQL filter.

**Real SQL injection in `elasticsearch/sql.go:75`**  
Defense-in-depth escape on tokenizer output before splicing into
`MATCH(...)`.

**Python code injection in `result_protocol.go`**  
Replace raw JSON literal embedding into Python/JS expressions with
base64 + `json.loads` / `JSON.parse(Buffer.from(...,
'base64').toString('utf8'))`. Eliminates both the unsafe-quoting sink
and the brittleness of mixing JSON true/false/null with Python syntax.

**URL substring check bypass in `embedding_model.py`**  
Replace `if "dashscope-intl.aliyuncs.com" in u` with
`urlparse(u).hostname == "dashscope-intl.aliyuncs.com"` so a base_url
like `https://attacker.example/?u=dashscope-intl.aliyuncs.com` cannot
bypass the routing.

**Prototype pollution in `setNestedValue` (TS)**  
Reject `__proto__`/`constructor`/`prototype` keys before any assignment.

**Integer overflow**  
- scrypt params via `ParseInt` + non-positive check
(`internal/common/password.go`)
- `topN` and `n` caps to 1024 (retrieval_service.go, dataset.go)
- `nalloc*statesize` cast to `size_t` (cpp/re2/onepass.cc)

**Cookie httponly**  
Set explicitly with rationale: this is the OAuth bootstrap cookie
intentionally read by the SPA.

**Stack trace exposure**  
Replace `error.message` in HTTP 500 response with generic `"internal
error"`; full error still logged server-side via `console.error`.

**Weak hashing**  
MD5 → SHA-256 for deterministic `conv_id` derivation
(`conversation_service.py`).

**Log scrubbing**  
Remove or redact user-controlled / sensitive content from clear-text
logs across 8 ingestion parsers, `llm_service.py` ×11,
`tenant_llm_service.py` ×7, `misc_utils.py` ×4, `redis_conn.py` ×10,
`conftest.py` ×4, `init_data.py`, `dataset_api_service.py`,
`generator.py`, `mysql_migration.py`, `cli.go`, `user_command.go`,
`pdf_parser.go`. Most patterns converted to parameterized logging
(`logging.info("...: %d", n)`) or static messages.

## CodeQL suppressions (each with rationale)

For alerts where the data flow is genuinely safe but CodeQL can't see
the context — operator-controlled URLs, sanitized inputs, etc. — I added
`// codeql[go/<rule>] <rationale>` annotations rather than dismissing
them, so future readers can audit the rationale inline:

- `internal/agent/component/invoke.go:135` — Invoke is a generic canvas
HTTP client
- `internal/service/langfuse.go` ×2 — host is per-tenant operator config
- `internal/service/file.go:1184` — already SSRF-guarded by
`assertURLSafe`
- `internal/utility/mcp_client.go` ×3 — already `AssertURLSafe` +
IP-pinned
- `internal/entity/models/bedrock.go` — sigv4-signed request, URL can't
be tampered
- `internal/service/deep_researcher.go:269` — `callback` is SSE display
string, not SQL
- `internal/engine/infinity/chunk.go:346` — UUIDs can't contain `'` (RFC
4122)
- `internal/cli/common_command.go` ×2 — CLI trusts operator-configured
URL
- `internal/utility/smtp.go:194` — msg is server-built, not user form
input
- `internal/entity/models/*` ×14 (path-injection) — audio file paths are
caller-supplied

## Test plan

-  All 13 modified Go packages build cleanly
-  663 tests pass across `internal/agent/sandbox`, `internal/common`,
`internal/agent/component`, `internal/engine/infinity`, `internal/dao`
-  All 11 modified Python files parse via `ast.parse`
-  TypeScript `tsc --noEmit` clean on the modified
`use-provider-fields.tsx`
-  `node --check` clean on the modified JS file

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-06-27 19:48:29 +08:00
..
2026-05-25 14:00:08 +08:00

Database Scripts

This directory contains database-related utility scripts for RAGFlow.

  • mysql_migration.py: Data migration between tables with stage-based execution
  • db_schema_sync.py: Database schema synchronization using peewee-migrate

mysql_migration.py

A flexible MySQL data migration tool for migrating data between tables with stage-based execution.

Overview

This script provides stage-based data migration between MySQL tables. Currently supports:

  • tenant_model_provider
  • tenant_model_instance
  • tenant_model

Migration Stages

Stage Source Table Target Table Description
tenant_model_provider tenant_llm tenant_model_provider Extracts distinct (tenant_id, llm_factory) pairs
tenant_model_instance tenant_llm + tenant_model_provider tenant_model_instance Creates instances with distinct (tenant_id, llm_factory, api_key)
tenant_model tenant_llm + tenant_model_provider + tenant_model_instance tenant_model Migrates model configurations (only status='0' records)

Stage Dependencies

tenant_model_provider (no dependencies)
        ↓
tenant_model_instance (depends on tenant_model_provider)
        ↓
tenant_model (depends on tenant_model_provider and tenant_model_instance)

Field Mapping Rules

tenant_model_provider

Target Field Source Rule
id - Random 32-character UUID1
provider_name tenant_llm.llm_factory Direct mapping
tenant_id tenant_llm.tenant_id Direct mapping
  • Deduplication: Groups by (tenant_id, llm_factory) and takes distinct pairs

tenant_model_instance

Target Field Source Rule
id - Random 32-character UUID1
instance_name tenant_llm.llm_factory Direct mapping
provider_id tenant_model_provider.id JOIN on tenant_id and provider_name=llm_factory
api_key tenant_llm.api_key Direct mapping
status tenant_llm.status Direct mapping
  • Deduplication: Groups by (tenant_id, llm_factory, api_key) and takes distinct records

tenant_model

Target Field Source Rule
id - Random 32-character UUID1
model_name tenant_llm.llm_name Direct mapping
provider_id tenant_model_provider.id JOIN on tenant_id and provider_name=llm_factory
instance_id tenant_model_instance.id JOIN on provider_id and api_key
model_type tenant_llm.model_type Direct mapping
status tenant_llm.status Direct mapping
  • Filter: Only migrates records where tenant_llm.status='0'

Usage

Command Line Arguments

python mysql_migration.py [OPTIONS]
Option Short Description Default
--host - MySQL host localhost
--port - MySQL port 3306
--user - MySQL user root
--password - MySQL password (empty)
--database - MySQL database name rag_flow
--config -c Path to YAML config file -
--stages -s Comma-separated list of stages to run -
--list-stages -l List available stages and exit -
--execute -e Execute full migration (create tables and migrate data) False
--create-table-only - Only create target tables, skip data migration False

Note

: MySQL connection can be configured via command line arguments (--host, --port, --user, --password, --database) or via a YAML config file (--config). Command line arguments take precedence over config file values.

Execution Modes

The script has three mutually exclusive modes:

  1. Dry-Run Mode (default): Check only, no database writes

    # Using config file
    python mysql_migration.py --stages tenant_model_provider --config config.yaml
    
    # Using command line MySQL connection
    python mysql_migration.py --stages tenant_model_provider --host localhost --port 3306 --user root
    
  2. Create Table Only Mode: Create target tables without migrating data

    python mysql_migration.py --stages tenant_model_provider --config config.yaml --create-table-only
    
  3. Execute Mode: Create tables and migrate data

    python mysql_migration.py --stages tenant_model_provider --config config.yaml --execute
    

Configuration File

Create a YAML configuration file with MySQL connection settings:

database:
  host: localhost
  port: 3306
  user: root
  password: your_password
  name: rag_flow

Alternative keys are also supported:

mysql:
  host: localhost
  port: 3306
  user: root
  password: your_password
  database: rag_flow

Examples

# List all available stages
python mysql_migration.py --list-stages

# Dry run single stage using command line MySQL connection
python mysql_migration.py --stages tenant_model_provider --host localhost --port 3306 --user root --password secret

# Dry run single stage using config file
python mysql_migration.py --stages tenant_model_provider --config /path/to/config.yaml

# Create tables only for multiple stages
python mysql_migration.py --stages tenant_model_provider,tenant_model_instance --config /path/to/config.yaml --create-table-only

# Execute full migration for all stages (in dependency order)
python mysql_migration.py --stages tenant_model_provider,tenant_model_instance,tenant_model --config /path/to/config.yaml --execute

# Use config file with command line password override
python mysql_migration.py --stages tenant_model_provider --config /path/to/config.yaml --password mypassword --execute

Output Interpretation

Stage Execution Log

Each stage displays a header showing progress:

============================================================
Stage [1/3]: tenant_model_provider
============================================================

The stage then performs:

  1. Check phase: Verifies source/target tables exist and counts records to migrate
  2. Execute phase: Creates tables (if needed) and migrates data in batches

Dry-Run Output

In dry-run mode, the script outputs what it would do without writing:

[DRY RUN] Would insert 150 records
  instance_name=OpenAI, provider_id=abc123, api_key=***
  ... and 145 more records

Migration Summary

After all stages complete, a summary is printed:

============================================================
Migration Summary
============================================================
Total Duration: 2.45s
Total Rows Processed: 350
Tables Operated: tenant_model_provider, tenant_model_instance
------------------------------------------------------------
Stage Details:
  [tenant_model_provider] Tables: tenant_model_provider, Rows: 50, Duration: 0.82s
  [tenant_model_instance] Tables: tenant_model_instance, Rows: 300, Duration: 1.63s
============================================================

Common Messages

Message Meaning
No new data to migrate All records already exist in target table
[DRY RUN] Target table does not exist Target table missing, use --execute or --create-table-onlyto create
Dependency table does not exist Required table from previous stage missing
Inserted batch X: Y records Successfully inserted batch of records

db_schema_sync.py

A database schema synchronization tool that uses peewee-migrate to detect and manage schema changes.

Overview

This script:

  1. Reads model definitions from api/db/db_models.py
  2. Compares with existing database tables specified via command line
  3. Generates migration files in tools/migrate/{version}/

Detected Change Types

Change Type Description Auto-included?
New table Model class with no corresponding DB table Yes
New field Model field not present in DB table Yes
Field type change Model field type differs from DB column type Yes
Removed field DB column not present in model definition No (requires --drop)

Warning

: Removed fields are not included in migrations by default. You must explicitly use --drop to generate DROP COLUMN statements, as this operation permanently deletes data.

Prerequisites

Install peewee-migrate:

pip install peewee-migrate

Usage

Command Line Arguments

python db_schema_sync.py [OPTIONS]
Option Short Description
--host - MySQL host (required)
--port - MySQL port (default: 3306)
--user - MySQL user (required)
--password - MySQL password (required)
--database - MySQL database name (required)
--version -v Version number in format vxx.xx.xx (required)
--list -l List all migrations
--create - Create a new migration (auto-detect changes)
--migrate -m Run pending migrations
--diff -d Show schema differences
--name -n Migration name (default: auto)
--drop - Include DROP COLUMN for fields removed from models (destructive - permanently deletes data!)

Version Format

Version must be in format vxx.xx.xx where xx are digits:

  • Valid: v0.26.1, v1.0.0, v10.20.30
  • Invalid: 0.26.1, v0.25, v0.26.1.1

Migration File Location

Migration files are stored in:

tools/migrate/{version_dir}/

Where {version_dir} is the version with . replaced by _.

Example: Version v0.26.1 → Directory tools/migrate/v0_26_1/

Examples

# List all migrations
python db_schema_sync.py --list \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

# Create a new auto-detected migration (new tables, new fields, type changes only)
python db_schema_sync.py --create \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

# Create a migration including dropped fields (destructive!)
python db_schema_sync.py --create --drop \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

# Create a named migration
python db_schema_sync.py --create --name add_user_table \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

# Run all pending migrations
python db_schema_sync.py --migrate \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

# Show schema differences (including removed fields)
python db_schema_sync.py --diff \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

How It Works

  1. Load Models: Imports all model classes from api/db/db_models.py
  2. Connect Database: Creates MySQL connection from command line arguments
  3. Detect Changes: Compares model definitions with actual database schema:
    • New tables → create_model
    • New fields → ALTER TABLE ADD COLUMN
    • Field type changes → ALTER TABLE MODIFY COLUMN
    • Removed fields → ALTER TABLE DROP COLUMN (only with --drop)
  4. Generate Migration: Creates Python migration file with migrate() and rollback() functions

Rollback Behavior

Forward Operation Rollback Operation
CREATE TABLE remove_model
ADD COLUMN DROP COLUMN
MODIFY COLUMN MODIFY COLUMN (restore original type)
DROP COLUMN ADD COLUMN (restore column definition; data is lost)

Note

: Rolling back a DROP COLUMN will re-add the column structure, but the data that was in it cannot be recovered.