ragflow

mirror of https://github.com/infiniflow/ragflow.git synced 2026-06-29 23:41:12 +08:00

Files

galuis116 6bfaa3f21e Fix: SSRF in markdown parser remote image fetch (#15438 )

### What problem does this PR solve?

`rag/app/naive.py` `Markdown.load_images_from_urls` fetched image URLs
parsed
straight out of an untrusted uploaded markdown document via a raw
`requests.get`,
with no SSRF validation. Markdown chunking always reaches this path
(`return_section_images=True`), so any authenticated user who uploads a
`.md`/`.markdown`/`.mdx` file to a knowledge base could make the server
issue
requests to internal services or cloud-metadata endpoints, e.g.
`![x](http://169.254.169.254/latest/meta-data/...)`. The `image/`
Content-Type
check only gates decoding — the outbound request (the SSRF) always
fires.

This was the one user-controlled fetch site missed by the project's
existing
SSRF-hardening (`common/ssrf_guard.py`, already applied to the crawler,
SearXNG,
RSS connector, MCP/document APIs, and OAuth avatar download).

The fix validates and DNS-pins every hop with
`common.ssrf_guard.assert_url_is_safe`
before connecting, and follows redirects manually so each redirect
target is
re-validated (closing the DNS-rebinding / redirect-bypass window),
mirroring
`common/data_source/rss_connector.py`. Blocked URLs are skipped and
logged like
any other unreachable image, so legitimate public images are unaffected.
Adds a
regression test at `test/unit_test/rag/app/test_markdown_image_ssrf.py`.

Closes #15437 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Co-authored-by: Ubuntu <ubuntu@ubuntu-2204.linuxvmimages.local>
Co-authored-by: galuis116 <galuis116@users.noreply.github.com>

2026-06-16 18:54:55 +08:00

__init__.py

Update comments (#4569 )

2025-01-21 20:52:28 +08:00

audio.py

fix: remove duplicate .wav and .aac in audio supported extensions list (#14791 )

2026-05-13 09:42:31 +08:00

book.py

Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 )

2026-04-27 14:57:20 +08:00

email.py

Fix: Remove hardcoded page limits causing parsing failures on large PDFs (>300 pages) (#14382 )