web/src/interfaces/request/document.ts

export interface IChangeParserConfigRequestBody {
  pages?: number[][];
  chunk_token_num?: number;
  layout_recognize?: string;
  task_page_size?: number;
  delimiter?: string;
  auto_keywords?: number;
  auto_questions?: number;
  html4excel?: boolean;
  toc_extraction?: boolean;
  image_table_context_window?: number;
  image_context_size?: number;
  table_context_size?: number;
  raptor?: {
    use_raptor?: boolean;
    prompt?: string;
    max_token?: number;
    threshold?: number;
    max_cluster?: number;
    random_seed?: number;
    scope?: string;
    clustering_method?: 'gmm' | 'ahc';
    tree_builder?: 'raptor' | 'psi';
  };
  // Metadata fields
  metadata?: Array<{
    key?: string;
    description?: string;
    enum?: string[];
  }>;
  built_in_metadata?: Array<{
    key?: string;
    description?: string;
    enum?: string[];
  }>;
  enable_metadata?: boolean;
}

export interface IChangeParserRequestBody {
  parser_id: string;
  pipeline_id?: string;
  doc_id?: string;
  parser_config: IChangeParserConfigRequestBody;
}

export interface IDocumentMetaRequestBody {
  documentId: string;
  meta: string; // json format string
}
feat: add pages to ChunkMethodModal (#143) 2024-03-22 16:57:09 +08:00			`export interface IChangeParserConfigRequestBody {`
Refa: only support MinerU-API now (#11977) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring 2025-12-17 12:58:48 +08:00			`pages?: number[][];`
			`chunk_token_num?: number;`
			`layout_recognize?: string;`
			`task_page_size?: number;`
			`delimiter?: string;`
			`auto_keywords?: number;`
			`auto_questions?: number;`
			`html4excel?: boolean;`
			`toc_extraction?: boolean;`
Refa: improve image table context (#12244) ### What problem does this PR solve? Improve image table context. Current strategy in attach_media_context: - Order by position when possible: if any chunk has page/position info, sort by (page, top, left), otherwise keep original order. - Apply only to media chunks: images use image_context_size, tables use table_context_size. - Primary matching: on the same page, choose a text chunk whose vertical span overlaps the media, then pick the one with the closest vertical midpoint. - Fallback matching: if no overlap on that page, choose the nearest text chunk on the same page (page-head uses the next text; page-tail uses the previous text). - Context extraction: inside the chosen text chunk, find a mid-sentence boundary near the text midpoint, then take context_size tokens split before/after (total budget). - No multi-chunk stitching: context comes from a single text chunk to avoid mixing unrelated segments. ### Type of change - [x] Refactoring --------- Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com> 2025-12-26 17:55:32 +08:00			`image_table_context_window?: number;`
			`image_context_size?: number;`
			`table_context_size?: number;`
feat(raptor): add Psi tree builder with original-space ranking and safe migration (#14679) ### What problem does this PR solve? Closes #14674. This PR improves RAPTOR configuration and tree construction while preserving the existing RAPTOR behavior as the default. RAPTOR currently builds summary layers with the original UMAP + GMM clustering path. This PR keeps that default path, and adds: - A hidden backend tree-builder option: - `tree_builder="raptor"`: default, existing RAPTOR behavior. - `tree_builder="psi"`: rank-aware Psi-style tree builder using original embedding-space cosine ranking. - A user-facing clustering method option for the default RAPTOR builder: - `clustering_method="gmm"`: existing default. - `clustering_method="ahc"`: agglomerative hierarchical clustering path. - A RAPTOR UI setting for `Clustering method` and `Max cluster`. ### What changed #### Backend - Added `tree_builder` support for RAPTOR/Psi. - Added `clustering_method` support for GMM/AHC. - Kept existing RAPTOR + GMM as the default. - Added Psi tree building from original-space cosine similarity. - Added bucketed Psi building controls for large inputs: - `raptor.ext.psi_exact_max_leaves` - `raptor.ext.psi_bucket_size` - Added method-aware RAPTOR summary metadata using existing `extra.raptor_method`. - Avoided adding a dedicated DB schema field for experimental method tracking. - Added cleanup/migration logic to avoid mixing stale RAPTOR summary trees. - Added defensive checks for Psi tree construction and summary failures. #### Frontend/UI - Added `Clustering method` in RAPTOR settings with `GMM` and `AHC`. - Added/kept `Max cluster` in RAPTOR settings. - Enlarged max cluster UI limit to `1024`, matching backend validation. - Kept AHC editable even when a RAPTOR task has already finished. - Fixed the UI save payload so `clustering_method` and `tree_builder` are serialized through `parser_config.raptor.ext`, avoiding backend validation errors for extra top-level RAPTOR fields. Example saved RAPTOR config: ```json { "raptor": { "max_cluster": 317, "ext": { "clustering_method": "ahc", "tree_builder": "raptor" } } } Co-authored-by: CaptainTimon <CaptainTimon@users.noreply.github.com> 2026-05-11 15:42:31 -10:00			`raptor?: {`
			`use_raptor?: boolean;`
			`prompt?: string;`
			`max_token?: number;`
			`threshold?: number;`
			`max_cluster?: number;`
			`random_seed?: number;`
			`scope?: string;`
			`clustering_method?: 'gmm' \| 'ahc';`
			`tree_builder?: 'raptor' \| 'psi';`
			`};`
Refactor: Doc change parser (#14327) ### What problem does this PR solve? Before migration Web API: POST /v1/document/change_parser HTTP API: PATCH /api/v1/datasets/<dataset_id>/documents After consolidation, Restful API PATCH /api/v1/datasets/<dataset_id>/documents ### Type of change - [x] Refactoring 2026-04-27 23:42:57 +08:00			`// Metadata fields`
			`metadata?: Array<{`
			`key?: string;`
			`description?: string;`
			`enum?: string[];`
			`}>;`
			`built_in_metadata?: Array<{`
			`key?: string;`
			`description?: string;`
			`enum?: string[];`
			`}>;`
			`enable_metadata?: boolean;`
feat: add pages to ChunkMethodModal (#143) 2024-03-22 16:57:09 +08:00			`}`

			`export interface IChangeParserRequestBody {`
			`parser_id: string;`
feat(raptor): add Psi tree builder with original-space ranking and safe migration (#14679) ### What problem does this PR solve? Closes #14674. This PR improves RAPTOR configuration and tree construction while preserving the existing RAPTOR behavior as the default. RAPTOR currently builds summary layers with the original UMAP + GMM clustering path. This PR keeps that default path, and adds: - A hidden backend tree-builder option: - `tree_builder="raptor"`: default, existing RAPTOR behavior. - `tree_builder="psi"`: rank-aware Psi-style tree builder using original embedding-space cosine ranking. - A user-facing clustering method option for the default RAPTOR builder: - `clustering_method="gmm"`: existing default. - `clustering_method="ahc"`: agglomerative hierarchical clustering path. - A RAPTOR UI setting for `Clustering method` and `Max cluster`. ### What changed #### Backend - Added `tree_builder` support for RAPTOR/Psi. - Added `clustering_method` support for GMM/AHC. - Kept existing RAPTOR + GMM as the default. - Added Psi tree building from original-space cosine similarity. - Added bucketed Psi building controls for large inputs: - `raptor.ext.psi_exact_max_leaves` - `raptor.ext.psi_bucket_size` - Added method-aware RAPTOR summary metadata using existing `extra.raptor_method`. - Avoided adding a dedicated DB schema field for experimental method tracking. - Added cleanup/migration logic to avoid mixing stale RAPTOR summary trees. - Added defensive checks for Psi tree construction and summary failures. #### Frontend/UI - Added `Clustering method` in RAPTOR settings with `GMM` and `AHC`. - Added/kept `Max cluster` in RAPTOR settings. - Enlarged max cluster UI limit to `1024`, matching backend validation. - Kept AHC editable even when a RAPTOR task has already finished. - Fixed the UI save payload so `clustering_method` and `tree_builder` are serialized through `parser_config.raptor.ext`, avoiding backend validation errors for extra top-level RAPTOR fields. Example saved RAPTOR config: ```json { "raptor": { "max_cluster": 317, "ext": { "clustering_method": "ahc", "tree_builder": "raptor" } } } Co-authored-by: CaptainTimon <CaptainTimon@users.noreply.github.com> 2026-05-11 15:42:31 -10:00			`pipeline_id?: string;`
			`doc_id?: string;`
feat: add pages to ChunkMethodModal (#143) 2024-03-22 16:57:09 +08:00			`parser_config: IChangeParserConfigRequestBody;`
			`}`
Feat: Metadata in documents for improve the prompt #3690 (#4462) ### What problem does this PR solve? Feat: Metadata in documents for improve the prompt #3690 ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2025-01-13 17:13:37 +08:00
			`export interface IDocumentMetaRequestBody {`
			`documentId: string;`
			`meta: string; // json format string`
			`}`