Files
ragflow/internal/engine
Zhichang Yu 730f33b1f9 fix(security): address 93 CodeQL code-scanning alerts across 61 files (#16407)
## Summary

Resolves all 93 open alerts at
https://github.com/infiniflow/ragflow/security/code-scanning by rule:

| Rule | Count | Treatment |
|------|-------|-----------|
| py/clear-text-logging-sensitive-data | 23 | Real fix — log scrubbing |
| go/path-injection | 15 | Real fix where possible, suppression with
rationale |
| go/request-forgery | 8 | Suppression with rationale
(operator-controlled URLs) |
| go/clear-text-logging | 10 | Real fix — log scrubbing |
| go/unsafe-quoting | 5 | Real fix — escape or refactor |
| go/sql-injection | 3 | Real fix — orderby whitelist + CodeQL comment |
| go/uncontrolled-allocation-size | 2 | Real fix — cap to 1024 |
| go/incorrect-integer-conversion | 3 | Real fix — ParseInt + range
check |
| go/insecure-hostkeycallback | 1 | Real fix — known_hosts file |
| go/disabled-certificate-check | 2 | Suppression with rationale |
| go/command-injection | 1 | Suppression (sanitized via shq()) |
| go/email-injection | 1 | Suppression with rationale |
| go/cookie-httponly-not-set | 1 | Suppression (SPA bootstrap) |
| js/stack-trace-exposure | 1 | Real fix — generic client message |
| js/prototype-pollution-utility | 1 | Real fix — reject
__proto__/constructor/prototype |
| py/weak-sensitive-data-hashing | 1 | Real fix — MD5 → SHA-256 |
| py/incomplete-url-substring-sanitization | 3 | Real fix —
urlparse(hostname) |
| py/paramiko-missing-host-key-validation | 1 | Real fix —
load_system_host_keys + RejectPolicy |
| cpp/integer-multiplication-cast-to-long | 2 | Real fix — cast to
size_t |

## Real fixes (with measurable security improvement)

**SSH host key verification (Go + Python)**  
Replace `InsecureIgnoreHostKey()` / `paramiko.AutoAddPolicy()` with
proper host key verification against a known_hosts file (configurable
via `SSH_KNOWN_HOSTS` env / `known_hosts` config field; fail-closed when
unset). Loads `~/.ssh/known_hosts` first via `load_system_host_keys()`
so existing setups keep working.

**SQL injection in `user_canvas`**  
Add `userCanvasOrderableColumns` whitelist + `userCanvasOrderClause`
helper. Both `GetList()` and `ListByTenantIDs()` now route the
user-supplied `orderby` query param through the helper, defaulting to
`create_time` on miss.

**SQL injection in `pipeline_operation_log`**  
Existing whitelist documented via CodeQL comment.

**Real SQL injection in `infinity/chunk.go:931`**  
Escape `'` → `''` on user-controlled `questionText` before splicing into
`filter_fulltext(...)` SQL filter.

**Real SQL injection in `elasticsearch/sql.go:75`**  
Defense-in-depth escape on tokenizer output before splicing into
`MATCH(...)`.

**Python code injection in `result_protocol.go`**  
Replace raw JSON literal embedding into Python/JS expressions with
base64 + `json.loads` / `JSON.parse(Buffer.from(...,
'base64').toString('utf8'))`. Eliminates both the unsafe-quoting sink
and the brittleness of mixing JSON true/false/null with Python syntax.

**URL substring check bypass in `embedding_model.py`**  
Replace `if "dashscope-intl.aliyuncs.com" in u` with
`urlparse(u).hostname == "dashscope-intl.aliyuncs.com"` so a base_url
like `https://attacker.example/?u=dashscope-intl.aliyuncs.com` cannot
bypass the routing.

**Prototype pollution in `setNestedValue` (TS)**  
Reject `__proto__`/`constructor`/`prototype` keys before any assignment.

**Integer overflow**  
- scrypt params via `ParseInt` + non-positive check
(`internal/common/password.go`)
- `topN` and `n` caps to 1024 (retrieval_service.go, dataset.go)
- `nalloc*statesize` cast to `size_t` (cpp/re2/onepass.cc)

**Cookie httponly**  
Set explicitly with rationale: this is the OAuth bootstrap cookie
intentionally read by the SPA.

**Stack trace exposure**  
Replace `error.message` in HTTP 500 response with generic `"internal
error"`; full error still logged server-side via `console.error`.

**Weak hashing**  
MD5 → SHA-256 for deterministic `conv_id` derivation
(`conversation_service.py`).

**Log scrubbing**  
Remove or redact user-controlled / sensitive content from clear-text
logs across 8 ingestion parsers, `llm_service.py` ×11,
`tenant_llm_service.py` ×7, `misc_utils.py` ×4, `redis_conn.py` ×10,
`conftest.py` ×4, `init_data.py`, `dataset_api_service.py`,
`generator.py`, `mysql_migration.py`, `cli.go`, `user_command.go`,
`pdf_parser.go`. Most patterns converted to parameterized logging
(`logging.info("...: %d", n)`) or static messages.

## CodeQL suppressions (each with rationale)

For alerts where the data flow is genuinely safe but CodeQL can't see
the context — operator-controlled URLs, sanitized inputs, etc. — I added
`// codeql[go/<rule>] <rationale>` annotations rather than dismissing
them, so future readers can audit the rationale inline:

- `internal/agent/component/invoke.go:135` — Invoke is a generic canvas
HTTP client
- `internal/service/langfuse.go` ×2 — host is per-tenant operator config
- `internal/service/file.go:1184` — already SSRF-guarded by
`assertURLSafe`
- `internal/utility/mcp_client.go` ×3 — already `AssertURLSafe` +
IP-pinned
- `internal/entity/models/bedrock.go` — sigv4-signed request, URL can't
be tampered
- `internal/service/deep_researcher.go:269` — `callback` is SSE display
string, not SQL
- `internal/engine/infinity/chunk.go:346` — UUIDs can't contain `'` (RFC
4122)
- `internal/cli/common_command.go` ×2 — CLI trusts operator-configured
URL
- `internal/utility/smtp.go:194` — msg is server-built, not user form
input
- `internal/entity/models/*` ×14 (path-injection) — audio file paths are
caller-supplied

## Test plan

-  All 13 modified Go packages build cleanly
-  663 tests pass across `internal/agent/sandbox`, `internal/common`,
`internal/agent/component`, `internal/engine/infinity`, `internal/dao`
-  All 11 modified Python files parse via `ast.parse`
-  TypeScript `tsc --noEmit` clean on the modified
`use-provider-fields.tsx`
-  `node --check` clean on the modified JS file

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-06-27 19:48:29 +08:00
..
2026-03-04 19:17:16 +08:00

Doc Engine Implementation

RAGFlow Go document engine implementation, supporting Elasticsearch and Infinity storage engines.

Directory Structure

internal/engine/
├── engine.go              # DocEngine interface definition
├── engine_factory.go      # Factory function
├── global.go              # Global engine instance management
├── elasticsearch/         # Elasticsearch implementation
│   ├── client.go          # ES client initialization
│   ├── search.go          # Search implementation
│   ├── index.go           # Index operations
│   └── document.go        # Document operations
└── infinity/              # Infinity implementation
    ├── client.go          # Infinity client initialization (placeholder)
    ├── search.go          # Search implementation (placeholder)
    ├── index.go           # Table operations (placeholder)
    └── document.go        # Document operations (placeholder)

Configuration

Using Elasticsearch

Add to conf/service_conf.yaml:

doc_engine:
  type: elasticsearch
  es:
    hosts: "http://localhost:9200"
    username: "elastic"
    password: "infini_rag_flow"

Using Infinity

doc_engine:
  type: infinity
  infinity:
    uri: "localhost:23817"
    postgres_port: 5432
    db_name: "default_db"

Note: Infinity implementation is a placeholder waiting for the official Infinity Go SDK. Only Elasticsearch is fully functional at this time.

Usage

1. Initialize Engine

The engine is automatically initialized on service startup (see cmd/server_main.go):

// Initialize doc engine
if err := engine.Init(&cfg.DocEngine); err != nil {
    log.Fatalf("Failed to initialize doc engine: %v", err)
}
defer engine.Close()

2. Use in Service

In ChunkService:

type ChunkService struct {
    docEngine engine.DocEngine
    engineType config.EngineType
}

func NewChunkService() *ChunkService {
    cfg := config.Get()
    return &ChunkService{
        docEngine:  engine.Get(),
        engineType: cfg.DocEngine.Type,
    }
}

// Search
func (s *ChunkService) RetrievalTest(req *RetrievalTestRequest) (*RetrievalTestResponse, error) {
    ctx := context.Background()

    switch s.engineType {
    case config.EngineElasticsearch:
        // Use Elasticsearch retrieval
        searchReq := &elasticsearch.SearchRequest{
            IndexNames: []string{"chunks"},
            Query:      elasticsearch.BuildMatchTextQuery([]string{"content"}, req.Question, "AUTO"),
            Size:       10,
        }
        result, _ := s.docEngine.Search(ctx, searchReq)
        esResp := result.(*elasticsearch.SearchResponse)
        // Process result...

    case config.EngineInfinity:
        // Infinity not implemented yet
        return nil, fmt.Errorf("infinity not yet implemented")
    }
}

3. Direct Use of Global Engine

import "ragflow/internal/engine"

// Get engine instance
docEngine := engine.Get()

// Search
searchReq := &elasticsearch.SearchRequest{
    IndexNames: []string{"my_index"},
    Query:      elasticsearch.BuildTermQuery("status", "active"),
}
result, err := docEngine.Search(ctx, searchReq)

// Index operations
err = docEngine.CreateIndex(ctx, "my_index", mapping)
err = docEngine.DeleteIndex(ctx, "my_index")
exists, _ := docEngine.IndexExists(ctx, "my_index")

// Document operations
err = docEngine.IndexDocument(ctx, "my_index", "doc_id", docData)
bulkResp, _ := docEngine.BulkIndex(ctx, "my_index", docs)
doc, _ := docEngine.GetDocument(ctx, "my_index", "doc_id")
err = docEngine.DeleteDocument(ctx, "my_index", "doc_id")

API Documentation

DocEngine Interface

type DocEngine interface {
    // Search
    Search(ctx context.Context, req interface{}) (interface{}, error)

    // Index operations
    CreateIndex(ctx context.Context, indexName string, mapping interface{}) error
    DeleteIndex(ctx context.Context, indexName string) error
    IndexExists(ctx context.Context, indexName string) (bool, error)

    // Document operations
    IndexDocument(ctx context.Context, indexName, docID string, doc interface{}) error
    BulkIndex(ctx context.Context, indexName string, docs []interface{}) (interface{}, error)
    GetDocument(ctx context.Context, indexName, docID string) (interface{}, error)
    DeleteDocument(ctx context.Context, indexName, docID string) error

    // Health check
    Ping(ctx context.Context) error
    Close() error
}

Dependencies

Elasticsearch

  • github.com/elastic/go-elasticsearch/v8

Infinity

  • Not available yet - Waiting for official Infinity Go SDK

Notes

  1. Type Conversion: The Search method returns interface{}, requiring type assertion based on engine type
  2. Model Definitions: Each engine has its own request/response models defined in their respective packages
  3. Error Handling: It's recommended to handle errors uniformly in the service layer and return user-friendly error messages
  4. Performance Optimization: For large volumes of documents, prefer using BulkIndex for batch operations
  5. Connection Management: The engine is automatically closed when the program exits, no manual management needed
  6. Infinity Status: Infinity implementation is currently a placeholder. Only Elasticsearch is fully functional.

Extending with New Engines

To add a new document engine (e.g., Milvus, Qdrant):

  1. Create a new directory under internal/engine/, e.g., milvus/
  2. Implement four files: client.go, search.go, index.go, document.go
  3. Add corresponding creation logic in engine_factory.go
  4. Add configuration structure in config.go
  5. Update service layer code to support the new engine

Correspondence with Python Project

Python Module Go Module
common/doc_store/doc_store_base.py internal/engine/engine.go
rag/utils/es_conn.py internal/engine/elasticsearch/
rag/utils/infinity_conn.py internal/engine/infinity/ (placeholder)
common/settings.py internal/config/config.go

Current Status

  • Elasticsearch: Fully implemented and functional
  • Infinity: Placeholder implementation, waiting for official Go SDK
  • 📋 OceanBase: Not implemented (removed from requirements)