agent/tools/crawler.py

#
#  Copyright 2024 The InfiniFlow Authors. All Rights Reserved.
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
#
from abc import ABC
import asyncio
from crawl4ai import AsyncWebCrawler
from agent.tools.base import ToolParamBase, ToolBase


class CrawlerParam(ToolParamBase):
    """
    Define the Crawler component parameters.
    """

    def __init__(self):
        super().__init__()
        self.proxy = None
        self.extract_type = "markdown"

    def check(self):
        self.check_valid_value(self.extract_type, "Type of content from the crawler", ["html", "markdown", "content"])


class Crawler(ToolBase, ABC):
    component_name = "Crawler"

    def _run(self, history, **kwargs):
        from common.ssrf_guard import assert_url_is_safe, pin_dns_global

        ans = self.get_input()
        ans = " - ".join(ans["content"]) if "content" in ans else ""
        try:
            _ssrf_hostname, _ssrf_ip = assert_url_is_safe(ans)
        except ValueError:
            return Crawler.be_output("URL not valid")
        try:
            # pin_dns_global is used (not thread-local) because crawl4ai resolves
            # DNS in asyncio executor threads that don't share thread-local state.
            with pin_dns_global(_ssrf_hostname, _ssrf_ip):
                result = asyncio.run(self.get_web(ans))

            return Crawler.be_output(result)

        except Exception as e:
            return Crawler.be_output(f"An unexpected error occurred: {str(e)}")

    async def get_web(self, url):
        if self.check_if_canceled("Crawler async operation"):
            return

        proxy = self._param.proxy if self._param.proxy else None
        async with AsyncWebCrawler(verbose=True, proxy=proxy) as crawler:
            result = await crawler.arun(url=url, bypass_cache=True)

            if self.check_if_canceled("Crawler async operation"):
                return

            if self._param.extract_type == "html":
                return result.cleaned_html
            elif self._param.extract_type == "markdown":
                return result.markdown
            elif self._param.extract_type == "content":
                return result.extracted_content
            return result.markdown
Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00			`#`
			`# Copyright 2024 The InfiniFlow Authors. All Rights Reserved.`
			`#`
			`# Licensed under the Apache License, Version 2.0 (the "License");`
			`# you may not use this file except in compliance with the License.`
			`# You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
			`#`
			`from abc import ABC`
			`import asyncio`
			`from crawl4ai import AsyncWebCrawler`
Feat: Redesign and refactor agent module (#9113) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2025-07-30 19:41:09 +08:00			`from agent.tools.base import ToolParamBase, ToolBase`
Feat: init dataflow. (#9791) ### What problem does this PR solve? #9790 Close #9782 ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2025-08-28 18:40:32 +08:00
Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00
Feat: Redesign and refactor agent module (#9113) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2025-07-30 19:41:09 +08:00			`class CrawlerParam(ToolParamBase):`
Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00			`"""`
			`Define the Crawler component parameters.`
			`"""`

			`def __init__(self):`
			`super().__init__()`
add component invoke (#2967) ### What problem does this PR solve? #2908 ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-22 14:16:44 +08:00			`self.proxy = None`
			`self.extract_type = "markdown"`
Feat: add mechanism to check cancellation in Agent (#10766) ### What problem does this PR solve? Add mechanism to check cancellation in Agent. ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2025-11-11 17:36:48 +08:00
Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00			`def check(self):`
Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2026-04-25 15:30:15 +09:00			`self.check_valid_value(self.extract_type, "Type of content from the crawler", ["html", "markdown", "content"])`
Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00

Feat: Redesign and refactor agent module (#9113) ### What problem does this PR solve? #9082 #6365 <u> WARNING: it's not compatible with the older version of `Agent` module, which means that `Agent` from older versions can not work anymore.</u> ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2025-07-30 19:41:09 +08:00			`class Crawler(ToolBase, ABC):`
Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00			`component_name = "Crawler"`

			`def _run(self, history, **kwargs):`
Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2026-04-25 15:30:15 +09:00			`from common.ssrf_guard import assert_url_is_safe, pin_dns_global`

Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00			`ans = self.get_input()`
			`ans = " - ".join(ans["content"]) if "content" in ans else ""`
Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2026-04-25 15:30:15 +09:00			`try:`
			`_ssrf_hostname, _ssrf_ip = assert_url_is_safe(ans)`
			`except ValueError:`
Fix potential SSRF attack vulnerability (#4334) ### What problem does this PR solve? Fix potential SSRF attack vulnerability ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) Co-authored-by: liuhua <10215101452@stu.ecun.edu.cn> 2025-01-02 18:45:45 +08:00			`return Crawler.be_output("URL not valid")`
Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00			`try:`
Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2026-04-25 15:30:15 +09:00			`# pin_dns_global is used (not thread-local) because crawl4ai resolves`
			`# DNS in asyncio executor threads that don't share thread-local state.`
			`with pin_dns_global(_ssrf_hostname, _ssrf_ip):`
			`result = asyncio.run(self.get_web(ans))`
Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00
			`return Crawler.be_output(result)`
Feat: add mechanism to check cancellation in Agent (#10766) ### What problem does this PR solve? Add mechanism to check cancellation in Agent. ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2025-11-11 17:36:48 +08:00
Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00			`except Exception as e:`
			`return Crawler.be_output(f"An unexpected error occurred: {str(e)}")`

			`async def get_web(self, url):`
Feat: add mechanism to check cancellation in Agent (#10766) ### What problem does this PR solve? Add mechanism to check cancellation in Agent. ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2025-11-11 17:36:48 +08:00			`if self.check_if_canceled("Crawler async operation"):`
			`return`

Add agent component for web crawler (#2878) ### What problem does this PR solve? Add agent component for web crawler ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-21 11:38:41 +08:00			`proxy = self._param.proxy if self._param.proxy else None`
			`async with AsyncWebCrawler(verbose=True, proxy=proxy) as crawler:`
Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2026-04-25 15:30:15 +09:00			`result = await crawler.arun(url=url, bypass_cache=True)`
Feat: add mechanism to check cancellation in Agent (#10766) ### What problem does this PR solve? Add mechanism to check cancellation in Agent. ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2025-11-11 17:36:48 +08:00
			`if self.check_if_canceled("Crawler async operation"):`
			`return`

Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2026-04-25 15:30:15 +09:00			`if self._param.extract_type == "html":`
add component invoke (#2967) ### What problem does this PR solve? #2908 ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-22 14:16:44 +08:00			`return result.cleaned_html`
Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2026-04-25 15:30:15 +09:00			`elif self._param.extract_type == "markdown":`
add component invoke (#2967) ### What problem does this PR solve? #2908 ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-22 14:16:44 +08:00			`return result.markdown`
Fix: validate URL scheme and resolved IP before crawling to prevent SSRF (#14090) ### What problem does this PR solve? The POST /upload_info?url=<url> endpoint accepted a user-supplied URL and passed it directly to AsyncWebCrawler without any validation. There were no restrictions on URL scheme, destination hostname, or resolved IP address. This allowed any authenticated user to instruct the server to make outbound HTTP requests to internal infrastructure — including RFC 1918 private networks, loopback addresses, and cloud metadata services such as http://169.254.169.254 — effectively using the server as a proxy for internal network reconnaissance or credential theft. This PR adds an SSRF guard (_validate_url_for_crawl) that runs before any crawl is initiated. It enforces an allowlist of safe schemes (http/https), resolves the hostname at validation time, and rejects any URL whose resolved IP falls within a private or reserved network range. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2026-04-25 15:30:15 +09:00			`elif self._param.extract_type == "content":`
Fix bugs for agent/tools. (#9930) ### What problem does this PR solve? 1 Fix typos 2 Fix agent/tools/crawler.py return bug. 3 Fix agent/tools/deepl.py component_name bug. ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) - [x] Refactoring - [x] Performance Improvement Signed-off-by: zhanluxianshen <zhanluxianshen@163.com> 2025-09-05 12:31:44 +08:00			`return result.extracted_content`
add component invoke (#2967) ### What problem does this PR solve? #2908 ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-22 14:16:44 +08:00			`return result.markdown`