docs/guides/agent/agent_component_reference/parser.md

---
sidebar_position: 30
slug: /parser_component
sidebar_custom_props: {
  categoryIcon: LucideFilePlay
}
---
# Parser component

A component that sets the parsing rules for your dataset.

---

A **Parser** component is autopopulated on the ingestion pipeline canvas and required in all ingestion pipeline workflows. Just like the **Extract** stage in the traditional ETL process, a **Parser** component in an ingestion pipeline defines how various file types are parsed into structured data. Click the component to display its configuration panel. In this configuration panel, you set the parsing rules for various file types.

## Configurations

Within the configuration panel, you can add multiple parsers and set the corresponding parsing rules or remove unwanted parsers. Please ensure your set of parsers covers all required file types; otherwise, an error would occur when you select this ingestion pipeline on your dataset's **Files** page.

The **Parser** component supports parsing the following file types:

| File type     | File format              |
|---------------|--------------------------|
| PDF           | PDF                      |
| Spreadsheet   | XLSX, XLS, CSV           |
| Image         | PNG, JPG, JPEG, GIF, TIF |
| Email         | EML                      |
| Text & Markup | TXT, MD, MDX, HTML, JSON |
| Word          | DOCX                     |
| PowerPoint    | PPTX, PPT                |
| Audio         | MP3, WAV                 |
| Video         | MP4, AVI, MKV            |

### Detect multi-column layout

Optimizes the parser to detect and reorder multi-column pages into a logical sequence. Ideal for PDF documents with two-column or newspaper-style layouts.

### Remove original table of contents

Strips the original table of contents from PDF files. Once enabled, the table of contents is not chunked or parsed for retrieval.

### PDF parser

The output of a PDF parser is `json`. In the PDF parser, you select the parsing method that works best with your PDFs.

- DeepDoc: (Default) The default visual model performing OCR, TSR, and DLR tasks on complex PDFs, but can be time-consuming.
- Naive: Skip OCR, TSR, and DLR tasks if *all* your PDFs are plain text.
- [MinerU](https://github.com/opendatalab/MinerU): (Experimental) An open-source tool that converts PDF into machine-readable formats.
- [Docling](https://github.com/docling-project/docling): (Experimental) An open-source document processing tool for gen AI.
- A third-party visual model from a specific model provider.

:::danger IMPORTANT
Starting from v0.22.0, RAGFlow includes MinerU (&ge; 2.6.3) as an optional PDF parser of multiple backends. Please note that RAGFlow acts only as a *remote client* for MinerU, calling the MinerU API to parse documents and reading the returned files. To use this feature:
:::

1. Prepare a reachable MinerU API service (FastAPI server).
2. In the **.env** file or from the **Model providers** page in the UI, configure RAGFlow as a remote client to MinerU:
   - `MINERU_APISERVER`: The MinerU API endpoint (e.g., `http://mineru-host:8886`).
   - `MINERU_BACKEND`: The MinerU backend:
      - `"pipeline"` (default)
      - `"vlm-http-client"`
      - `"vlm-transformers"`
      - `"vlm-vllm-engine"`
      - `"vlm-mlx-engine"`
      - `"vlm-vllm-async-engine"`
      - `"vlm-lmdeploy-engine"`.
   - `MINERU_SERVER_URL`: (optional) The downstream vLLM HTTP server (e.g., `http://vllm-host:30000`). Applicable when `MINERU_BACKEND` is set to `"vlm-http-client"`. 
   - `MINERU_OUTPUT_DIR`: (optional) The local directory for holding the outputs of the MinerU API service (zip/JSON) before ingestion.
   - `MINERU_DELETE_OUTPUT`: Whether to delete temporary output when a temporary directory is used:
     - `1`: Delete.
     - `0`: Retain.
3. In the web UI, navigate to your dataset's **Configuration** page and find the **Ingestion pipeline** section:  
   - If you decide to use a chunking method from the **Built-in** dropdown, ensure it supports PDF parsing, then select **MinerU** from the **PDF parser** dropdown.
   - If you use a custom ingestion pipeline instead, select **MinerU** in the **PDF parser** section of the **Parser** component.

To use an external Docling Serve instance (instead of local in-process Docling), set:

- `DOCLING_SERVER_URL`: The Docling Serve API endpoint (for example, `http://docling-host:5001`).

When `DOCLING_SERVER_URL` is set, RAGFlow sends PDF content to Docling Serve (`/v1/convert/source`, with fallback to `/v1alpha/convert/source`) and ingests the returned markdown/text. If the variable is not set, RAGFlow keeps using local Docling (`USE_DOCLING=true` + installed package) behavior.

:::note
All MinerU environment variables are optional. When set, these values are used to auto-provision a MinerU OCR model for the tenant on first use. To avoid auto-provisioning, skip the environment variable settings and only configure MinerU from the **Model providers** page in the UI.
:::

:::caution WARNING
Third-party visual models are marked **Experimental**, because we have not fully tested these models for the aforementioned data extraction tasks.
:::

### Spreadsheet parser

A spreadsheet parser outputs `html`, preserving the original layout and table structure. You may remove this parser if your dataset contains no spreadsheets.

### Image parser

An Image parser uses a native OCR model for text extraction by default. You may select an alternative VLM model, provided that you have properly configured it on the **Model provider** page.

### Email parser

With the Email parser, you select the fields to parse from Emails, such as **subject** and **body**. The parser will then extract text from these specified fields.

### Text&Markup parser

A Text&Markup parser automatically removes all formatting tags (e.g., those from HTML and Markdown files) to output clean, plain text only.

### Word parser

A Word parser outputs `json`, preserving the original document structure information, including titles, paragraphs, tables, headers, and footers.

### PowerPoint (PPT) parser

A PowerPoint parser extracts content from PowerPoint files into `json`, processing each slide individually and distinguishing between its title, body text, and notes.

### Audio parser

An Audio parser transcribes audio files to text. To use this parser, you must first configure an ASR model on the **Model provider** page.

### Video parser

A Video parser transcribes video files to text. To use this parser, you must first configure a VLM model on the **Model provider** page.

## Output

The global variable names for the output of the **Parser** component, which can be referenced by subsequent components in the ingestion pipeline.

| Variable name | Type            |
|---------------|-----------------|
| `markdown`    | `string`        |
| `text`        | `string`        |
| `html`        | `string`        |
| `json`        | `Array<Object>` |
Doc: minor (#10627) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-10-17 09:47:29 +08:00			`---`
			`sidebar_position: 30`
			`slug: /parser_component`
docs: update docs icons (#12465) ### What problem does this PR solve? Update icons for docs. Trailing spaces are auto truncated by the editor, does not affect real content. ### Type of change - [x] Documentation Update 2026-01-07 10:00:09 +08:00			`sidebar_custom_props: {`
			`categoryIcon: LucideFilePlay`
			`}`
Doc: minor (#10627) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-10-17 09:47:29 +08:00			`---`
Docs: minor (#10630) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-10-17 11:41:19 +08:00			`# Parser component`
Doc: minor (#10627) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-10-17 09:47:29 +08:00
			`A component that sets the parsing rules for your dataset.`

			`---`

Fix typos (#11208) ### What problem does this PR solve? As title ### Type of change - [x] Documentation Update Signed-off-by: Jin Hai <haijin.chn@gmail.com> 2025-11-12 14:20:04 +08:00			`A Parser component is autopopulated on the ingestion pipeline canvas and required in all ingestion pipeline workflows. Just like the Extract stage in the traditional ETL process, a Parser component in an ingestion pipeline defines how various file types are parsed into structured data. Click the component to display its configuration panel. In this configuration panel, you set the parsing rules for various file types.`
Doc: minor (#10627) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-10-17 09:47:29 +08:00
Docs: parser behavior change (#11176) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-11-11 21:10:06 +08:00			`## Configurations`
Doc: minor (#10627) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-10-17 09:47:29 +08:00
Docs: parser behavior change (#11176) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-11-11 21:10:06 +08:00			`Within the configuration panel, you can add multiple parsers and set the corresponding parsing rules or remove unwanted parsers. Please ensure your set of parsers covers all required file types; otherwise, an error would occur when you select this ingestion pipeline on your dataset's Files page.`
Doc: minor (#10627) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-10-17 09:47:29 +08:00
Docs: parser behavior change (#11176) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-11-11 21:10:06 +08:00			`The Parser component supports parsing the following file types:`

			`\| File type \| File format \|`
Fix table format warning in Markdown file (#12002) ### What problem does this PR solve? As title ### Type of change - [x] Documentation Update - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com> 2025-12-17 19:27:47 +08:00			`\|---------------\|--------------------------\|`
Docs: parser behavior change (#11176) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-11-11 21:10:06 +08:00			`\| PDF \| PDF \|`
			`\| Spreadsheet \| XLSX, XLS, CSV \|`
			`\| Image \| PNG, JPG, JPEG, GIF, TIF \|`
			`\| Email \| EML \|`
			`\| Text & Markup \| TXT, MD, MDX, HTML, JSON \|`
			`\| Word \| DOCX \|`
			`\| PowerPoint \| PPTX, PPT \|`
			`\| Audio \| MP3, WAV \|`
			`\| Video \| MP4, AVI, MKV \|`

Doc: two PDF parser optimizers are supported as of v0.25.0. (#14261) ### What problem does this PR solve? Multi-column layout detection is supported in v0.25.0 ### Type of change - [x] Documentation Update 2026-04-22 20:00:06 +08:00			`### Detect multi-column layout`

			`Optimizes the parser to detect and reorder multi-column pages into a logical sequence. Ideal for PDF documents with two-column or newspaper-style layouts.`

			`### Remove original table of contents`

			`Strips the original table of contents from PDF files. Once enabled, the table of contents is not chunked or parsed for retrieval.`

Docs: parser behavior change (#11176) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-11-11 21:10:06 +08:00			`### PDF parser`

			The output of a PDF parser is `json`. In the PDF parser, you select the parsing method that works best with your PDFs.

			`- DeepDoc: (Default) The default visual model performing OCR, TSR, and DLR tasks on complex PDFs, but can be time-consuming.`
			`- Naive: Skip OCR, TSR, and DLR tasks if all your PDFs are plain text.`
			`- [MinerU](https://github.com/opendatalab/MinerU): (Experimental) An open-source tool that converts PDF into machine-readable formats.`
			`- [Docling](https://github.com/docling-project/docling): (Experimental) An open-source document processing tool for gen AI.`
			`- A third-party visual model from a specific model provider.`

Fix typos (#11208) ### What problem does this PR solve? As title ### Type of change - [x] Documentation Update Signed-off-by: Jin Hai <haijin.chn@gmail.com> 2025-11-12 14:20:04 +08:00			`:::danger IMPORTANT`
Docs: How to call MinerU as a remote service (#12004) ### Type of change - [x] Documentation Update 2025-12-19 17:06:32 +08:00			`Starting from v0.22.0, RAGFlow includes MinerU (≥ 2.6.3) as an optional PDF parser of multiple backends. Please note that RAGFlow acts only as a remote client for MinerU, calling the MinerU API to parse documents and reading the returned files. To use this feature:`
Refa: only support MinerU-API now (#11977) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring 2025-12-17 12:58:48 +08:00			`:::`
Docs: parser behavior change (#11176) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-11-11 21:10:06 +08:00
Refa: only support MinerU-API now (#11977) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring 2025-12-17 12:58:48 +08:00			`1. Prepare a reachable MinerU API service (FastAPI server).`
Docs: How to call MinerU as a remote service (#12004) ### Type of change - [x] Documentation Update 2025-12-19 17:06:32 +08:00			`2. In the .env file or from the Model providers page in the UI, configure RAGFlow as a remote client to MinerU:`
			- `MINERU_APISERVER`: The MinerU API endpoint (e.g., `http://mineru-host:8886`).
			- `MINERU_BACKEND`: The MinerU backend:
			- `"pipeline"` (default)
			- `"vlm-http-client"`
			- `"vlm-transformers"`
			- `"vlm-vllm-engine"`
			- `"vlm-mlx-engine"`
			- `"vlm-vllm-async-engine"`
			- `"vlm-lmdeploy-engine"`.
revert white-space changes in docs (#12557) ### What problem does this PR solve? Trailing white-spaces in commit 6814ace1aa1d449b792f2a87d5ee5686e41b3081 got automatically trimmed by code editor may causes documentation typesetting broken. Mostly for double spaces for soft line breaks. ### Type of change - [x] Documentation Update 2026-01-13 09:41:02 +08:00			- `MINERU_SERVER_URL`: (optional) The downstream vLLM HTTP server (e.g., `http://vllm-host:30000`). Applicable when `MINERU_BACKEND` is set to `"vlm-http-client"`.
Docs: How to call MinerU as a remote service (#12004) ### Type of change - [x] Documentation Update 2025-12-19 17:06:32 +08:00			- `MINERU_OUTPUT_DIR`: (optional) The local directory for holding the outputs of the MinerU API service (zip/JSON) before ingestion.
			- `MINERU_DELETE_OUTPUT`: Whether to delete temporary output when a temporary directory is used:
			- `1`: Delete.
			- `0`: Retain.
revert white-space changes in docs (#12557) ### What problem does this PR solve? Trailing white-spaces in commit 6814ace1aa1d449b792f2a87d5ee5686e41b3081 got automatically trimmed by code editor may causes documentation typesetting broken. Mostly for double spaces for soft line breaks. ### Type of change - [x] Documentation Update 2026-01-13 09:41:02 +08:00			`3. In the web UI, navigate to your dataset's Configuration page and find the Ingestion pipeline section:`
Docs: How to call MinerU as a remote service (#12004) ### Type of change - [x] Documentation Update 2025-12-19 17:06:32 +08:00			`- If you decide to use a chunking method from the Built-in dropdown, ensure it supports PDF parsing, then select MinerU from the PDF parser dropdown.`
			`- If you use a custom ingestion pipeline instead, select MinerU in the PDF parser section of the Parser component.`
Refa: only support MinerU-API now (#11977) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring 2025-12-17 12:58:48 +08:00
feat(parser): support external Docling server via DOCLING_SERVER_URL (#13527) ### What problem does this PR solve? This PR adds support for parsing PDFs through an external Docling server, so RAGFlow can connect to remote `docling serve` deployments instead of relying only on local in-process Docling. It addresses the feature request in [#13426](https://github.com/infiniflow/ragflow/issues/13426) and aligns with the external-server usage pattern already used by MinerU. ### Type of change - [ ] Bug Fix (non-breaking change which fixes an issue) - [x] New Feature (non-breaking change which adds functionality) - [x] Documentation Update - [ ] Refactoring - [ ] Performance Improvement - [ ] Other (please describe): ### What is changed? - Add external Docling server support in `DoclingParser`: - Use `DOCLING_SERVER_URL` to enable remote parsing mode. - Try `POST /v1/convert/source` first, and fallback to `/v1alpha/convert/source`. - Keep existing local Docling behavior when `DOCLING_SERVER_URL` is not set. - Wire Docling env settings into parser invocation paths: - `rag/app/naive.py` - `rag/flow/parser/parser.py` - Add Docling env hints in constants and update docs: - `docs/guides/dataset/select_pdf_parser.md` - `docs/guides/agent/agent_component_reference/parser.md` - `docs/faq.mdx` ### Why this approach? This keeps the change focused on one issue and one capability (external Docling connectivity), without introducing unrelated provider-model plumbing. ### Validation - Static checks: - `python -m py_compile` on changed Python files - `python -m ruff check` on changed Python files - Functional checks: - Remote v1 endpoint path works - v1alpha fallback works - Local Docling path remains available when server URL is unset ### Related links - Feature request: [Support external Docling server (issue #13426)](https://github.com/infiniflow/ragflow/issues/13426) - Compare view for this branch: [main...feat/docling-server](https://github.com/infiniflow/ragflow/compare/main...spider-yamet:ragflow:feat/docling-server?expand=1) ##### Fixes [#13426](https://github.com/infiniflow/ragflow/issues/13426) 2026-03-12 18:09:03 +09:00			`To use an external Docling Serve instance (instead of local in-process Docling), set:`

			- `DOCLING_SERVER_URL`: The Docling Serve API endpoint (for example, `http://docling-host:5001`).

			When `DOCLING_SERVER_URL` is set, RAGFlow sends PDF content to Docling Serve (`/v1/convert/source`, with fallback to `/v1alpha/convert/source`) and ingests the returned markdown/text. If the variable is not set, RAGFlow keeps using local Docling (`USE_DOCLING=true` + installed package) behavior.

Refa: only support MinerU-API now (#11977) ### What problem does this PR solve? Only support MinerU-API now, still need to complete frontend for pipeline to allow the configuration of MinerU options. ### Type of change - [x] Refactoring 2025-12-17 12:58:48 +08:00			`:::note`
Docs: How to call MinerU as a remote service (#12004) ### Type of change - [x] Documentation Update 2025-12-19 17:06:32 +08:00			`All MinerU environment variables are optional. When set, these values are used to auto-provision a MinerU OCR model for the tenant on first use. To avoid auto-provisioning, skip the environment variable settings and only configure MinerU from the Model providers page in the UI.`
Docs: parser behavior change (#11176) ### What problem does this PR solve? ### Type of change - [x] Documentation Update 2025-11-11 21:10:06 +08:00			`:::`

			`:::caution WARNING`
			`Third-party visual models are marked Experimental, because we have not fully tested these models for the aforementioned data extraction tasks.`
			`:::`

			`### Spreadsheet parser`

			A spreadsheet parser outputs `html`, preserving the original layout and table structure. You may remove this parser if your dataset contains no spreadsheets.

			`### Image parser`

			`An Image parser uses a native OCR model for text extraction by default. You may select an alternative VLM model, provided that you have properly configured it on the Model provider page.`

			`### Email parser`

			`With the Email parser, you select the fields to parse from Emails, such as subject and body. The parser will then extract text from these specified fields.`

			`### Text&Markup parser`

			`A Text&Markup parser automatically removes all formatting tags (e.g., those from HTML and Markdown files) to output clean, plain text only.`

			`### Word parser`

			A Word parser outputs `json`, preserving the original document structure information, including titles, paragraphs, tables, headers, and footers.

			`### PowerPoint (PPT) parser`

			A PowerPoint parser extracts content from PowerPoint files into `json`, processing each slide individually and distinguishing between its title, body text, and notes.

			`### Audio parser`

			`An Audio parser transcribes audio files to text. To use this parser, you must first configure an ASR model on the Model provider page.`

			`### Video parser`

			`A Video parser transcribes video files to text. To use this parser, you must first configure a VLM model on the Model provider page.`

			`## Output`

			`The global variable names for the output of the Parser component, which can be referenced by subsequent components in the ingestion pipeline.`

Fix table format warning in Markdown file (#12002) ### What problem does this PR solve? As title ### Type of change - [x] Documentation Update - [x] Refactoring Signed-off-by: Jin Hai <haijin.chn@gmail.com> 2025-12-17 19:27:47 +08:00			`\| Variable name \| Type \|`
			`\|---------------\|-----------------\|`
			\| `markdown` \| `string` \|
			\| `text` \| `string` \|
			\| `html` \| `string` \|
			\| `json` \| `Array<Object>` \|