From 9280c64518209a91bf39afb974305ea256ff03e7 Mon Sep 17 00:00:00 2001 From: writinwaters <93570324+writinwaters@users.noreply.github.com> Date: Wed, 29 Apr 2026 19:37:24 +0800 Subject: [PATCH] Docs: Updated Title chunker references (#14483) ### What problem does this PR solve? Updated Title chunker references ### Type of change - [x] Documentation Update --- .../database_schema_and_migration.md | 56 +++++++++++++++++++ .../chunker_title.md | 28 +++++++++- web/src/locales/en.ts | 14 ++--- web/src/locales/zh.ts | 4 +- 4 files changed, 91 insertions(+), 11 deletions(-) create mode 100644 docs/administrator/migration/database_schema_and_migration.md diff --git a/docs/administrator/migration/database_schema_and_migration.md b/docs/administrator/migration/database_schema_and_migration.md new file mode 100644 index 0000000000..32ae48c285 --- /dev/null +++ b/docs/administrator/migration/database_schema_and_migration.md @@ -0,0 +1,56 @@ +--- +sidebar_position: 1 +slug: /database_schema_and_migration +sidebar_custom_props: { + categoryIcon: LucideLocateFixed +} +--- + +# Database schema and migration + +Sync schemas and migrate data using official RAGFlow scripts. + +--- + +RAGFlow handles schema updates and migrations automatically at startup. However, for high-volume environments like Kubernetes, massive datasets can cause initialization to exceed 10 minutes, potentially triggering container timeouts or health check failures. To avoid this, you can disable the built-in auto-initialization and manually run these provided scripts to complete database upgrades before launching the service: + +- [mysql_migration.py](#mysql_migrationpy): Migrates data between MySQL tables. +- [db_schema_sync.py](#db_schema_syncpy): Syncs database schemas and manages changes using peewee-migrate. + +## mysql_migration.py + +The [mysql_migration.py](https://github.com/infiniflow/ragflow/blob/main/tools/scripts/mysql_migration.py) script is a specialized tool for re-organizing RAGFlow’s model-related data. It transitions data from older unified tables into a modern, multi-table structure to support advanced model management. + +### Key functions + +- **Sequential migration**: Moves data through three distinct stages—Provider, Instance, and Model—to maintain database integrity and satisfy dependencies. +- **Flexible setup**: Connects to MySQL using either a YAML configuration file or direct command-line arguments. +- **Execution control**: Offers three specific modes: dry-run (preview), table-only (structural setup), and execute (full data move). +- **Automated mapping**: Generates unique IDs and handles complex joins between legacy records and new table structures. +- **Batch logging**: Processes records in sets of 100 and provides a final summary of total duration and row counts. + +### When to use + +- **Version upgrades**: Essential when moving to RAGFlow v0.25 or later to ensure your models are correctly categorized in the new schema. +- **Data normalization**: Necessary when consolidating multiple API keys or LLM providers into the updated system format. +- **Kubernetes deployments**: Useful for setting up the database structure independently using the `--create-table-only` flag before main services start. +- **Migration verification**: Used in dry-run mode to identify any legacy records that still need to be moved to the new tables. + +## db_schema_sync.py + +The [db_schema_sync.py](https://github.com/infiniflow/ragflow/blob/main/tools/scripts/db_schema_sync.py) script is a synchronization utility that ensures your MySQL database structure matches the Peewee ORM models defined in the RAGFlow source code. + +### Key functions + +- **Change detection**: Compares Python model definitions in `api/db/db_models.py` against the live database to identify new tables, added fields, or type mismatches. +- **Migration generation**: Automatically creates Python migration files (containing `migrate()` and `rollback()` logic) in version-specific directories (e.g., `tools/migrate/v0_25_0/`). +- **Schema auditing**: Provides a `--diff` command to view structural discrepancies without applying changes. +- **Execution management**: Applies pending migrations to the database to bring it up to date with the current software version. +- **Safety controls**: Prevents accidental data loss by requiring an explicit `--drop` flag to generate `DROP COLUMN` statements for removed fields. + +### When to use + +- **Version upgrades**: When moving to a new version of RAGFlow that introduces structural database changes. +- **Development**: When modifying `db_models.py` and needing to update your local database without manual SQL. +- **CI/CD pipelines**: To automatically prepare or apply database updates during deployment. +- **Troubleshooting**: When the application fails due to "Unknown column" or "Table not found" errors, indicating a desynchronized schema. \ No newline at end of file diff --git a/docs/guides/agent/agent_component_reference/chunker_title.md b/docs/guides/agent/agent_component_reference/chunker_title.md index 787f660280..8350f3e992 100644 --- a/docs/guides/agent/agent_component_reference/chunker_title.md +++ b/docs/guides/agent/agent_component_reference/chunker_title.md @@ -23,7 +23,30 @@ Placing a **Title chunker** after a **Token chunker** is invalid and will cause ## Configurations -### Hierarchy +### Hierarchy or Group + +Select how a document is split: + +- Hierarchy: Construct a heading tree and produce self-contained chunks, each carrying its full ancestral path (e.g. Part 1 › Chapter 3 › Section 2 + body text). Best for highly structured texts — such as legal statutes, regulations, contracts, and technical specs — where each chunk must be identifiable by its position in the hierarchy. +- Group: Split the document flat at a chosen heading level, merging adjacent small sections to ensure semantic flow. Chunks exclude ancestral path. Best for documents with flowing, contextually connected content — such as books, manuals, reports, and articles — where narrative coherence depends on keeping adjacent paragraphs together. + +#### Separate parent-heading content + +:::tip NOTE +Available only when **Hierarchy** is selected. +::: + +When enabled, chunks include only their heading path and content; content immediately following a parent heading is kept as a separate chunk. + +#### Set first chunk as global context + +:::tip NOTE +Available only when **Hierarchy** is selected. +::: + +Treats the first split as a global heading to maintain consistent context across the document hierarchy. Ideal for resumes where the first section identifies the subject. + +#### H3 Specifies the heading level to define chunk boundaries: @@ -31,8 +54,9 @@ Specifies the heading level to define chunk boundaries: - H2 - H3 (Default) - H4 +- H5 -Click **+ Add** to add heading levels here or update the corresponding **Regular Expressions** fields for custom heading patterns. +Click **+ Add regular expressions** to add heading levels here or update the corresponding **Regular Expressions** fields for custom heading patterns. ### Output diff --git a/web/src/locales/en.ts b/web/src/locales/en.ts index 88d70fe358..bb2875cc58 100644 --- a/web/src/locales/en.ts +++ b/web/src/locales/en.ts @@ -1510,16 +1510,16 @@ Example: Virtual Hosted Style`, author: 'Author', sectionTitle: 'Section title', }, - includeHeadingContent: 'Include heading content', + includeHeadingContent: 'Separate parent-heading content', includeHeadingContentTip: - 'When enabled, content directly under a heading is kept as its own chunk. Child chunks keep only the heading path.', + 'When enabled, chunks include only their heading path and content; content immediately following a parent heading is kept as a separate chunk.', rootAsHeading: 'Set first chunk as global context', rootAsHeadingTip: - 'Treats the initial split as a global heading to maintain consistent context across the document hierarchy. Ideal for resumes where the first section identifies the subject.', - hierarchyTip: `Build a heading tree and produce self-contained chunks, each carrying its full ancestor heading path (e.g. Part 1 › Chapter 3 › Section 2 + body text).\n -Best for: Documents with independent, structurally significant sections — such as legal statutes, regulations, contracts, and technical specifications — where each chunk must be identifiable by its structural position even without surrounding context.`, - groupTip: `Split the document flat at a chosen heading level and automatically merge adjacent small sections to preserve content continuity. No parent-heading path is injected.\n -Best for: Documents with flowing, contextually connected content — such as books, manuals, reports, and articles — where adjacent paragraphs should stay together to maintain narrative coherence.`, + 'Treats the first split as a global heading to maintain consistent context across the document hierarchy. Ideal for resumes where the first section identifies the subject.', + hierarchyTip: `Construct a heading tree and produce self-contained chunks, each carrying its full ancestral path (e.g. Part 1 › Chapter 3 › Section 2 + body text).\n +Best for: Highly structured texts — such as legal statutes, regulations, contracts, and technical specs — where each chunk must be identifiable by its position in the hierarchy.`, + groupTip: `Split the document flat at a chosen heading level, merging adjacent small sections to ensure semantic flow. Chunks exclude ancestral path.\n +Best for: Documents with flowing, contextually connected content — such as books, manuals, reports, and articles — where narrative coherence depends on keeping adjacent paragraphs together.`, enableMultiColumn: 'Detect multi-column layout', enableMultiColumnTip: 'Detect and parse multi-column page layouts to preserve the correct reading order. Turn this on for PDFs or documents with two-column or newspaper-style layouts.', diff --git a/web/src/locales/zh.ts b/web/src/locales/zh.ts index 9d62b1b6bc..1b3eebf5e7 100644 --- a/web/src/locales/zh.ts +++ b/web/src/locales/zh.ts @@ -1261,9 +1261,9 @@ General:实体和关系提取提示来自 GitHub - microsoft/graphrag:基于 author: '作者', sectionTitle: '章节标题', }, - includeHeadingContent: '包含标题内容', + includeHeadingContent: '分离上级标题正文', includeHeadingContentTip: - '启用后,标题下的直接内容将作为一个独立的块保留。子块仅保留标题路径。', + '启用后,每个分块仅保留标题路径和自身内容,与上级标题紧挨着的内容将作为一个独立的块保留。', rootAsHeading: '将首个切片设为 H0 标题', rootAsHeadingTip: '将首个切片设为全局标题,以确保整个文档层级结构中拥有一致的上下文信息。该功能尤其适用于首段包含关键信息的简历。',