Fix: MinerU vlm-http-client backend output file detection (#14240)

## Problem
When using MinerU with `vlm-http-client` backend, the parser fails to
find the output files because they are located in a `vlm/` subdirectory,
but the `_read_output`
  method doesn't check this location.

  ## Error Message
  [ERROR]MinerU not found.
  [MinerU] Missing output file, tried: ...

  ## Root Cause
The MinerU API with `vlm-http-client` backend returns output files in
the following structure:
  output_dir/
    vlm/
      filename_content_list.json
      filename.md
      images/

  However, the `_read_output` method in `mineru_parser.py` only checks:
  1. `output_dir/filename_content_list.json`
  2. `output_dir/sanitized_filename_content_list.json`
3. `output_dir/sanitized_filename/sanitized_filename_content_list.json`

  It doesn't check the `vlm/` subdirectory.

  ## Solution
  Added two additional fallback paths to check the `vlm/` subdirectory:
  - `output_dir/vlm/filename_content_list.json`
  - `output_dir/vlm/sanitized_filename_content_list.json`

  ## Testing
Tested with MinerU API using `vlm-http-client` backend. The parser now
successfully finds and processes the output files.

  ## Related
  This issue occurs specifically when using:
  - MinerU backend: `vlm-http-client`
  - MinerU server URL configured for remote vLLM inference
This commit is contained in:
刘康伟
2026-05-19 12:28:31 +08:00
committed by GitHub
parent 87d22a4415
commit c6e3a2e713

View File

@@ -539,6 +539,21 @@ class MinerUParser(RAGFlowPdfParser):
if nested_alt.exists():
subdir = nested_alt.parent
json_file = nested_alt
else:
# Try vlm subdirectory (for vlm-http-client backend)
vlm_path = output_dir / "vlm" / f"{file_stem}_content_list.json"
self.logger.info(f"[MinerU] Trying vlm subdirectory: {vlm_path}")
attempted.append(vlm_path)
if vlm_path.exists():
subdir = vlm_path.parent
json_file = vlm_path
else:
vlm_safe = output_dir / "vlm" / f"{safe_stem}_content_list.json"
self.logger.info(f"[MinerU] Trying vlm subdirectory with sanitized name: {vlm_safe}")
attempted.append(vlm_safe)
if vlm_safe.exists():
subdir = vlm_safe.parent
json_file = vlm_safe
if not json_file:
parse_subdir = None