mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Files

Zhichang Yu 730f33b1f9 fix(security): address 93 CodeQL code-scanning alerts across 61 files (#16407 )

## Summary

Resolves all 93 open alerts at
https://github.com/infiniflow/ragflow/security/code-scanning by rule:

| Rule | Count | Treatment |
|------|-------|-----------|
| py/clear-text-logging-sensitive-data | 23 | Real fix — log scrubbing |
| go/path-injection | 15 | Real fix where possible, suppression with
rationale |
| go/request-forgery | 8 | Suppression with rationale
(operator-controlled URLs) |
| go/clear-text-logging | 10 | Real fix — log scrubbing |
| go/unsafe-quoting | 5 | Real fix — escape or refactor |
| go/sql-injection | 3 | Real fix — orderby whitelist + CodeQL comment |
| go/uncontrolled-allocation-size | 2 | Real fix — cap to 1024 |
| go/incorrect-integer-conversion | 3 | Real fix — ParseInt + range
check |
| go/insecure-hostkeycallback | 1 | Real fix — known_hosts file |
| go/disabled-certificate-check | 2 | Suppression with rationale |
| go/command-injection | 1 | Suppression (sanitized via shq()) |
| go/email-injection | 1 | Suppression with rationale |
| go/cookie-httponly-not-set | 1 | Suppression (SPA bootstrap) |
| js/stack-trace-exposure | 1 | Real fix — generic client message |
| js/prototype-pollution-utility | 1 | Real fix — reject
__proto__/constructor/prototype |
| py/weak-sensitive-data-hashing | 1 | Real fix — MD5 → SHA-256 |
| py/incomplete-url-substring-sanitization | 3 | Real fix —
urlparse(hostname) |
| py/paramiko-missing-host-key-validation | 1 | Real fix —
load_system_host_keys + RejectPolicy |
| cpp/integer-multiplication-cast-to-long | 2 | Real fix — cast to
size_t |

## Real fixes (with measurable security improvement)

**SSH host key verification (Go + Python)**  
Replace `InsecureIgnoreHostKey()` / `paramiko.AutoAddPolicy()` with
proper host key verification against a known_hosts file (configurable
via `SSH_KNOWN_HOSTS` env / `known_hosts` config field; fail-closed when
unset). Loads `~/.ssh/known_hosts` first via `load_system_host_keys()`
so existing setups keep working.

**SQL injection in `user_canvas`**  
Add `userCanvasOrderableColumns` whitelist + `userCanvasOrderClause`
helper. Both `GetList()` and `ListByTenantIDs()` now route the
user-supplied `orderby` query param through the helper, defaulting to
`create_time` on miss.

**SQL injection in `pipeline_operation_log`**  
Existing whitelist documented via CodeQL comment.

**Real SQL injection in `infinity/chunk.go:931`**  
Escape `'` → `''` on user-controlled `questionText` before splicing into
`filter_fulltext(...)` SQL filter.

**Real SQL injection in `elasticsearch/sql.go:75`**  
Defense-in-depth escape on tokenizer output before splicing into
`MATCH(...)`.

**Python code injection in `result_protocol.go`**  
Replace raw JSON literal embedding into Python/JS expressions with
base64 + `json.loads` / `JSON.parse(Buffer.from(...,
'base64').toString('utf8'))`. Eliminates both the unsafe-quoting sink
and the brittleness of mixing JSON true/false/null with Python syntax.

**URL substring check bypass in `embedding_model.py`**  
Replace `if "dashscope-intl.aliyuncs.com" in u` with
`urlparse(u).hostname == "dashscope-intl.aliyuncs.com"` so a base_url
like `https://attacker.example/?u=dashscope-intl.aliyuncs.com` cannot
bypass the routing.

**Prototype pollution in `setNestedValue` (TS)**  
Reject `__proto__`/`constructor`/`prototype` keys before any assignment.

**Integer overflow**  
- scrypt params via `ParseInt` + non-positive check
(`internal/common/password.go`)
- `topN` and `n` caps to 1024 (retrieval_service.go, dataset.go)
- `nalloc*statesize` cast to `size_t` (cpp/re2/onepass.cc)

**Cookie httponly**  
Set explicitly with rationale: this is the OAuth bootstrap cookie
intentionally read by the SPA.

**Stack trace exposure**  
Replace `error.message` in HTTP 500 response with generic `"internal
error"`; full error still logged server-side via `console.error`.

**Weak hashing**  
MD5 → SHA-256 for deterministic `conv_id` derivation
(`conversation_service.py`).

**Log scrubbing**  
Remove or redact user-controlled / sensitive content from clear-text
logs across 8 ingestion parsers, `llm_service.py` ×11,
`tenant_llm_service.py` ×7, `misc_utils.py` ×4, `redis_conn.py` ×10,
`conftest.py` ×4, `init_data.py`, `dataset_api_service.py`,
`generator.py`, `mysql_migration.py`, `cli.go`, `user_command.go`,
`pdf_parser.go`. Most patterns converted to parameterized logging
(`logging.info("...: %d", n)`) or static messages.

## CodeQL suppressions (each with rationale)

For alerts where the data flow is genuinely safe but CodeQL can't see
the context — operator-controlled URLs, sanitized inputs, etc. — I added
`// codeql[go/<rule>] <rationale>` annotations rather than dismissing
them, so future readers can audit the rationale inline:

- `internal/agent/component/invoke.go:135` — Invoke is a generic canvas
HTTP client
- `internal/service/langfuse.go` ×2 — host is per-tenant operator config
- `internal/service/file.go:1184` — already SSRF-guarded by
`assertURLSafe`
- `internal/utility/mcp_client.go` ×3 — already `AssertURLSafe` +
IP-pinned
- `internal/entity/models/bedrock.go` — sigv4-signed request, URL can't
be tampered
- `internal/service/deep_researcher.go:269` — `callback` is SSE display
string, not SQL
- `internal/engine/infinity/chunk.go:346` — UUIDs can't contain `'` (RFC
4122)
- `internal/cli/common_command.go` ×2 — CLI trusts operator-configured
URL
- `internal/utility/smtp.go:194` — msg is server-built, not user form
input
- `internal/entity/models/*` ×14 (path-injection) — audio file paths are
caller-supplied

## Test plan

- ✅ All 13 modified Go packages build cleanly
- ✅ 663 tests pass across `internal/agent/sandbox`, `internal/common`,
`internal/agent/component`, `internal/engine/infinity`, `internal/dao`
- ✅ All 11 modified Python files parse via `ast.parse`
- ✅ TypeScript `tsc --noEmit` clean on the modified
`use-provider-fields.tsx`
- ✅ `node --check` clean on the modified JS file

🤖 Generated with [Claude Code](https://claude.com/claude-code)

2026-06-27 19:48:29 +08:00

db_schema_sync.py

Docs: Update version references to v0.26.1 in READMEs and docs (#16158 )

2026-06-17 19:35:32 +08:00

gen-proto.sh

Go: add ingestion server (#15094 )

2026-05-25 14:00:08 +08:00

INSTALL_SCRIPTS_README.md

Go: rename ragflow_cli to ragflow-cli (#16270 )

2026-06-23 19:20:49 +08:00

install.ps1

Go: rename ragflow_cli to ragflow-cli (#16270 )

2026-06-23 19:20:49 +08:00

install.sh

Go: rename ragflow_cli to ragflow-cli (#16270 )

2026-06-23 19:20:49 +08:00

mysql_migration.py

fix(security): address 93 CodeQL code-scanning alerts across 61 files (#16407 )

2026-06-27 19:48:29 +08:00

README.md

Docs: Update version references to v0.26.1 in READMEs and docs (#16158 )

2026-06-17 19:35:32 +08:00

README.md

Database Scripts

This directory contains database-related utility scripts for RAGFlow.

mysql_migration.py: Data migration between tables with stage-based execution
db_schema_sync.py: Database schema synchronization using peewee-migrate

mysql_migration.py

A flexible MySQL data migration tool for migrating data between tables with stage-based execution.

Overview

This script provides stage-based data migration between MySQL tables. Currently supports:

tenant_model_provider
tenant_model_instance
tenant_model

Migration Stages

Stage	Source Table	Target Table	Description
`tenant_model_provider`	`tenant_llm`	`tenant_model_provider`	Extracts distinct `(tenant_id, llm_factory)` pairs
`tenant_model_instance`	`tenant_llm` + `tenant_model_provider`	`tenant_model_instance`	Creates instances with distinct `(tenant_id, llm_factory, api_key)`
`tenant_model`	`tenant_llm` + `tenant_model_provider` + `tenant_model_instance`	`tenant_model`	Migrates model configurations (only `status='0'` records)

Stage Dependencies

tenant_model_provider (no dependencies)
        ↓
tenant_model_instance (depends on tenant_model_provider)
        ↓
tenant_model (depends on tenant_model_provider and tenant_model_instance)

Field Mapping Rules

tenant_model_provider

Target Field	Source	Rule
`id`	-	Random 32-character UUID1
`provider_name`	`tenant_llm.llm_factory`	Direct mapping
`tenant_id`	`tenant_llm.tenant_id`	Direct mapping

Deduplication: Groups by (tenant_id, llm_factory) and takes distinct pairs

tenant_model_instance

Target Field	Source	Rule
`id`	-	Random 32-character UUID1
`instance_name`	`tenant_llm.llm_factory`	Direct mapping
`provider_id`	`tenant_model_provider.id`	JOIN on `tenant_id` and `provider_name=llm_factory`
`api_key`	`tenant_llm.api_key`	Direct mapping
`status`	`tenant_llm.status`	Direct mapping

Deduplication: Groups by (tenant_id, llm_factory, api_key) and takes distinct records

tenant_model

Target Field	Source	Rule
`id`	-	Random 32-character UUID1
`model_name`	`tenant_llm.llm_name`	Direct mapping
`provider_id`	`tenant_model_provider.id`	JOIN on `tenant_id` and `provider_name=llm_factory`
`instance_id`	`tenant_model_instance.id`	JOIN on `provider_id` and `api_key`
`model_type`	`tenant_llm.model_type`	Direct mapping
`status`	`tenant_llm.status`	Direct mapping

Filter: Only migrates records where tenant_llm.status='0'

Usage

Command Line Arguments

python mysql_migration.py [OPTIONS]

Option	Short	Description	Default
`--host`	-	MySQL host	`localhost`
`--port`	-	MySQL port	`3306`
`--user`	-	MySQL user	`root`
`--password`	-	MySQL password	(empty)
`--database`	-	MySQL database name	`rag_flow`
`--config`	`-c`	Path to YAML config file	-
`--stages`	`-s`	Comma-separated list of stages to run	-
`--list-stages`	`-l`	List available stages and exit	-
`--execute`	`-e`	Execute full migration (create tables and migrate data)	`False`
`--create-table-only`	-	Only create target tables, skip data migration	`False`

Note

: MySQL connection can be configured via command line arguments (--host, --port, --user, --password, --database) or via a YAML config file (--config). Command line arguments take precedence over config file values.

Execution Modes

The script has three mutually exclusive modes:

Dry-Run Mode (default): Check only, no database writes

# Using config file
python mysql_migration.py --stages tenant_model_provider --config config.yaml

# Using command line MySQL connection
python mysql_migration.py --stages tenant_model_provider --host localhost --port 3306 --user root

Create Table Only Mode: Create target tables without migrating data

python mysql_migration.py --stages tenant_model_provider --config config.yaml --create-table-only

Execute Mode: Create tables and migrate data

python mysql_migration.py --stages tenant_model_provider --config config.yaml --execute

Configuration File

Create a YAML configuration file with MySQL connection settings:

database:
  host: localhost
  port: 3306
  user: root
  password: your_password
  name: rag_flow

Alternative keys are also supported:

mysql:
  host: localhost
  port: 3306
  user: root
  password: your_password
  database: rag_flow

Examples

# List all available stages
python mysql_migration.py --list-stages

# Dry run single stage using command line MySQL connection
python mysql_migration.py --stages tenant_model_provider --host localhost --port 3306 --user root --password secret

# Dry run single stage using config file
python mysql_migration.py --stages tenant_model_provider --config /path/to/config.yaml

# Create tables only for multiple stages
python mysql_migration.py --stages tenant_model_provider,tenant_model_instance --config /path/to/config.yaml --create-table-only

# Execute full migration for all stages (in dependency order)
python mysql_migration.py --stages tenant_model_provider,tenant_model_instance,tenant_model --config /path/to/config.yaml --execute

# Use config file with command line password override
python mysql_migration.py --stages tenant_model_provider --config /path/to/config.yaml --password mypassword --execute

Output Interpretation

Stage Execution Log

Each stage displays a header showing progress:

============================================================
Stage [1/3]: tenant_model_provider
============================================================

The stage then performs:

Check phase: Verifies source/target tables exist and counts records to migrate
Execute phase: Creates tables (if needed) and migrates data in batches

Dry-Run Output

In dry-run mode, the script outputs what it would do without writing:

[DRY RUN] Would insert 150 records
  instance_name=OpenAI, provider_id=abc123, api_key=***
  ... and 145 more records

Migration Summary

After all stages complete, a summary is printed:

============================================================
Migration Summary
============================================================
Total Duration: 2.45s
Total Rows Processed: 350
Tables Operated: tenant_model_provider, tenant_model_instance
------------------------------------------------------------
Stage Details:
  [tenant_model_provider] Tables: tenant_model_provider, Rows: 50, Duration: 0.82s
  [tenant_model_instance] Tables: tenant_model_instance, Rows: 300, Duration: 1.63s
============================================================

Common Messages

Message	Meaning
`No new data to migrate`	All records already exist in target table
`[DRY RUN] Target table does not exist`	Target table missing, use `--execute` or `--create-table-only`to create
`Dependency table does not exist`	Required table from previous stage missing
`Inserted batch X: Y records`	Successfully inserted batch of records

db_schema_sync.py

A database schema synchronization tool that uses peewee-migrate to detect and manage schema changes.

Overview

This script:

Reads model definitions from api/db/db_models.py
Compares with existing database tables specified via command line
Generates migration files in tools/migrate/{version}/

Detected Change Types

Change Type	Description	Auto-included?
New table	Model class with no corresponding DB table	Yes
New field	Model field not present in DB table	Yes
Field type change	Model field type differs from DB column type	Yes
Removed field	DB column not present in model definition	No (requires `--drop`)

Warning

: Removed fields are not included in migrations by default. You must explicitly use --drop to generate DROP COLUMN statements, as this operation permanently deletes data.

Prerequisites

Install peewee-migrate:

pip install peewee-migrate

Usage

Command Line Arguments

python db_schema_sync.py [OPTIONS]

Option	Short	Description
`--host`	-	MySQL host (required)
`--port`	-	MySQL port (default: 3306)
`--user`	-	MySQL user (required)
`--password`	-	MySQL password (required)
`--database`	-	MySQL database name (required)
`--version`	`-v`	Version number in format `vxx.xx.xx` (required)
`--list`	`-l`	List all migrations
`--create`	-	Create a new migration (auto-detect changes)
`--migrate`	`-m`	Run pending migrations
`--diff`	`-d`	Show schema differences
`--name`	`-n`	Migration name (default: auto)
`--drop`	-	Include `DROP COLUMN` for fields removed from models (destructive - permanently deletes data!)

Version Format

Version must be in format vxx.xx.xx where xx are digits:

Valid: v0.26.1, v1.0.0, v10.20.30
Invalid: 0.26.1, v0.25, v0.26.1.1

Migration File Location

Migration files are stored in:

tools/migrate/{version_dir}/

Where {version_dir} is the version with . replaced by _.

Example: Version v0.26.1 → Directory tools/migrate/v0_26_1/

Examples

# List all migrations
python db_schema_sync.py --list \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

# Create a new auto-detected migration (new tables, new fields, type changes only)
python db_schema_sync.py --create \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

# Create a migration including dropped fields (destructive!)
python db_schema_sync.py --create --drop \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

# Create a named migration
python db_schema_sync.py --create --name add_user_table \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

# Run all pending migrations
python db_schema_sync.py --migrate \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

# Show schema differences (including removed fields)
python db_schema_sync.py --diff \
    --host localhost --port 3306 --user root --password xxx --database rag_flow \
    --version v0.26.1

How It Works

Load Models: Imports all model classes from api/db/db_models.py
Connect Database: Creates MySQL connection from command line arguments
Detect Changes: Compares model definitions with actual database schema:
- New tables → create_model
- New fields → ALTER TABLE ADD COLUMN
- Field type changes → ALTER TABLE MODIFY COLUMN
- Removed fields → ALTER TABLE DROP COLUMN (only with --drop)
Generate Migration: Creates Python migration file with migrate() and rollback() functions

Rollback Behavior

Forward Operation	Rollback Operation
`CREATE TABLE`	`remove_model`
`ADD COLUMN`	`DROP COLUMN`
`MODIFY COLUMN`	`MODIFY COLUMN` (restore original type)
`DROP COLUMN`	`ADD COLUMN` (restore column definition; data is lost)

Note

: Rolling back a DROP COLUMN will re-add the column structure, but the data that was in it cannot be recovered.