Initial commit with translated description
This commit is contained in:
227
README.md
Normal file
227
README.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# 🌐 Web Pilot — OpenClaw Skill
|
||||
|
||||
[](https://ko-fi.com/liranudi)
|
||||
|
||||
A web search, page reading, and browser automation skill for [OpenClaw](https://github.com/openclaw/openclaw). No API keys required.
|
||||
|
||||
## ♿ Accessibility
|
||||
|
||||
This skill enables AI agents to **read, navigate, and interact with the web on behalf of users** — making it a powerful accessibility tool for people with visual impairments, motor disabilities, or cognitive challenges.
|
||||
|
||||
- **Screen reading on steroids** — extracts clean, structured text from any webpage, stripping away visual clutter, ads, and navigation noise
|
||||
- **Voice-driven browsing** — when paired with an AI assistant, users can browse the web entirely through natural language ("scroll down", "click Sign In", "read me the Overview section")
|
||||
- **Targeted content extraction** — grab specific sections, search for text, or screenshot regions without needing to visually scan a page
|
||||
- **Form interaction** — fill inputs and submit forms via commands, removing the need for precise mouse/keyboard control
|
||||
- **Cookie banner removal** — automatically dismisses consent popups that are notoriously difficult for screen readers
|
||||
|
||||
## Features
|
||||
|
||||
- **Web Search** — Multi-engine (DuckDuckGo, Brave, Google) with pagination
|
||||
- **Page Reader** — Extract clean text from any URL with JS rendering
|
||||
- **Persistent Browser** — Visible or headless browser with 20+ actions
|
||||
- **Cookie Auto-Dismiss** — Automatically clears cookie consent banners
|
||||
- **File Download** — Download files with auto-detection, PDF text extraction
|
||||
- **Output Formats** — JSON, markdown, or plain text
|
||||
- **Zero API Keys** — Everything runs locally
|
||||
- **Partial Screenshots** — Capture viewport, full page, single elements, or ranges between two elements
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.8+
|
||||
- `pip install requests beautifulsoup4 playwright Pillow`
|
||||
- `playwright install chromium`
|
||||
- Optional: `pip install pdfplumber` for PDF text extraction
|
||||
|
||||
## Installation
|
||||
|
||||
### As an OpenClaw Skill
|
||||
|
||||
```bash
|
||||
cp -r web-pilot/ $(dirname $(which openclaw))/../lib/node_modules/openclaw/skills/web-pilot
|
||||
```
|
||||
|
||||
### Standalone
|
||||
|
||||
```bash
|
||||
git clone https://github.com/LiranUdi/web-pilot.git
|
||||
cd web-pilot
|
||||
pip install requests beautifulsoup4 playwright Pillow
|
||||
playwright install chromium
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Search the Web
|
||||
|
||||
```bash
|
||||
python3 scripts/google_search.py "search term" --pages 3 --engine brave
|
||||
```
|
||||
|
||||
| Flag | Description | Default |
|
||||
|------|-------------|---------|
|
||||
| `--pages N` | Result pages (~10 results each) | 1 |
|
||||
| `--engine` | `duckduckgo`, `brave`, or `google` | duckduckgo |
|
||||
|
||||
**Engine notes:**
|
||||
- **duckduckgo** — Most reliable, no CAPTCHA
|
||||
- **brave** — More results per page, broader sources
|
||||
- **google** — Often blocked by CAPTCHA; last resort
|
||||
|
||||
### 2. Read a Page
|
||||
|
||||
```bash
|
||||
python3 scripts/read_page.py "https://example.com" --max-chars 10000 --format markdown
|
||||
```
|
||||
|
||||
| Flag | Description | Default |
|
||||
|------|-------------|---------|
|
||||
| `--max-chars N` | Max characters to extract | 50000 |
|
||||
| `--visible` | Show browser window | off |
|
||||
| `--format` | `json`, `markdown`, or `text` | json |
|
||||
| `--no-dismiss` | Skip cookie consent auto-dismiss | off |
|
||||
|
||||
### 3. Persistent Browser Session
|
||||
|
||||
The browser session is a long-running process that stays open between commands, enabling stateful multi-step browsing.
|
||||
|
||||
```bash
|
||||
# Open a page (flags: --headless, --proxy <url>, --user-agent <string>)
|
||||
python3 scripts/browser_session.py open "https://example.com"
|
||||
python3 scripts/browser_session.py open "https://example.com" --headless --user-agent "MyBot/1.0"
|
||||
|
||||
# Check current state
|
||||
python3 scripts/browser_session.py status
|
||||
|
||||
# Navigate (returns response status, final URL, load time)
|
||||
python3 scripts/browser_session.py navigate "https://other-site.com"
|
||||
|
||||
# Extract content in different formats
|
||||
python3 scripts/browser_session.py extract --format markdown
|
||||
|
||||
# Scroll
|
||||
python3 scripts/browser_session.py scroll down
|
||||
python3 scripts/browser_session.py scroll up
|
||||
python3 scripts/browser_session.py scroll "#section-id" # scroll to element
|
||||
|
||||
# Wait
|
||||
python3 scripts/browser_session.py wait 2 # wait 2 seconds
|
||||
python3 scripts/browser_session.py wait ".loading-done" # wait for element
|
||||
|
||||
# Fill forms
|
||||
python3 scripts/browser_session.py fill "input[name=q]" "search term"
|
||||
python3 scripts/browser_session.py fill "input[name=q]" "search term" --submit
|
||||
|
||||
# Navigation history
|
||||
python3 scripts/browser_session.py back
|
||||
python3 scripts/browser_session.py forward
|
||||
python3 scripts/browser_session.py reload
|
||||
|
||||
# Execute JavaScript
|
||||
python3 scripts/browser_session.py eval "document.title"
|
||||
|
||||
# Extract all links
|
||||
python3 scripts/browser_session.py links
|
||||
|
||||
# Screenshots
|
||||
python3 scripts/browser_session.py screenshot /tmp/page.png # viewport
|
||||
python3 scripts/browser_session.py screenshot /tmp/full.png --full # full page
|
||||
python3 scripts/browser_session.py screenshot /tmp/el.png --element "h1" # single element
|
||||
python3 scripts/browser_session.py screenshot /tmp/range.png --from "#Overview" --to "#end" # range
|
||||
|
||||
# Export page as PDF (headless only)
|
||||
python3 scripts/browser_session.py pdf /tmp/page.pdf
|
||||
|
||||
# Click elements
|
||||
python3 scripts/browser_session.py click "Sign In"
|
||||
python3 scripts/browser_session.py click "#submit-btn"
|
||||
|
||||
# Search for text in the page
|
||||
python3 scripts/browser_session.py search "pricing"
|
||||
|
||||
# Tab management
|
||||
python3 scripts/browser_session.py tab new "https://docs.example.com"
|
||||
python3 scripts/browser_session.py tab list
|
||||
python3 scripts/browser_session.py tab switch 0
|
||||
python3 scripts/browser_session.py tab close 1
|
||||
|
||||
# Dismiss cookie banners
|
||||
python3 scripts/browser_session.py dismiss-cookies
|
||||
|
||||
# Close
|
||||
python3 scripts/browser_session.py close
|
||||
```
|
||||
|
||||
### 4. Download Files
|
||||
|
||||
```bash
|
||||
python3 scripts/download_file.py "https://example.com/report.pdf" --output ~/docs
|
||||
```
|
||||
|
||||
| Flag | Description | Default |
|
||||
|------|-------------|---------|
|
||||
| `--output DIR` | Save directory | /tmp/downloads |
|
||||
| `--filename` | Override filename | auto-detected |
|
||||
|
||||
For PDFs, returns `extracted_text` if `pdfplumber` or `PyPDF2` is installed.
|
||||
|
||||
## Architecture
|
||||
|
||||
- **Search** — HTTP requests to DuckDuckGo/Brave/Google HTML endpoints
|
||||
- **Page reading** — Playwright + Chromium with read-only DOM TreeWalker
|
||||
- **Browser sessions** — Unix socket server with 4-byte length-prefix framing; forked child keeps browser alive, commands return immediately
|
||||
- **Screenshots** — Range mode uses full-page capture + PIL crop for pixel-perfect section captures
|
||||
- **Cookie dismiss** — Tries common selectors and button text patterns (Accept All, Got It, etc.)
|
||||
- **Downloads** — Streams to disk with auto filename detection from headers/URL
|
||||
|
||||
## Browser Session Reference
|
||||
|
||||
| Action | Description |
|
||||
|--------|-------------|
|
||||
| `open <url>` | Launch browser (flags: `--headless`, `--proxy`, `--user-agent`) |
|
||||
| `navigate <url>` | Go to URL (returns status code, final URL, load time) |
|
||||
| `extract` | Extract page content (`--format json\|markdown\|text`) |
|
||||
| `screenshot <path>` | Capture (`--full`, `--element <sel>`, `--from <sel> --to <sel>`) |
|
||||
| `click <target>` | Click by CSS selector, text, or button/link role |
|
||||
| `scroll <dir\|sel>` | Scroll down/up or to a CSS selector |
|
||||
| `wait <sec\|sel>` | Wait seconds or for element to appear |
|
||||
| `fill <sel> <val>` | Fill input field (optional `--submit`) |
|
||||
| `back` / `forward` / `reload` | Navigation history |
|
||||
| `eval <js>` | Execute JavaScript, return result |
|
||||
| `links` | Extract all links (href + text) |
|
||||
| `search <text>` | Find text in page content |
|
||||
| `pdf <path>` | Export as PDF (headless only) |
|
||||
| `status` | Current URL, title, tab count |
|
||||
| `tab new\|list\|switch\|close` | Multi-tab management |
|
||||
| `dismiss-cookies` | Clear cookie consent banners |
|
||||
| `close` | Shut down browser |
|
||||
|
||||
---
|
||||
|
||||
## For AI Agents (OpenClaw / LLM Integration)
|
||||
|
||||
### Workflow Pattern
|
||||
|
||||
1. **Search** → get URLs
|
||||
2. **Read** or **Open** → extract content
|
||||
3. **Scroll/Click/Navigate/Tab** → interact as needed
|
||||
4. **Search** → find specific info in page
|
||||
5. **Screenshot** → capture visual state (viewport, element, or range)
|
||||
6. **Download** → grab linked files
|
||||
7. **Close** → clean up
|
||||
|
||||
### Important Notes
|
||||
|
||||
- All output defaults to **JSON to stdout**; use `--format` for alternatives
|
||||
- `browser_session.py` is **stateful** — one session at a time, persists between commands
|
||||
- `read_page.py` is **stateless** — opens/closes browser each call
|
||||
- Cookie consent is **auto-dismissed** on open/navigate
|
||||
- Always **close** browser sessions when done
|
||||
- `Pillow` is required for range screenshots (`--from`/`--to`)
|
||||
|
||||
## Support
|
||||
|
||||
If this project is useful to you, consider [buying me a coffee](https://ko-fi.com/liranudi) ☕
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
63
SKILL.md
Normal file
63
SKILL.md
Normal file
@@ -0,0 +1,63 @@
|
||||
---
|
||||
name: web-pilot
|
||||
description: "无需API密钥搜索网页和阅读页面内容。"
|
||||
---
|
||||
|
||||
# Web Pilot
|
||||
|
||||
Four scripts, zero API keys. All output is JSON by default.
|
||||
|
||||
**Dependencies:** `requests`, `beautifulsoup4`, `playwright` (with Chromium).
|
||||
**Optional:** `pdfplumber` or `PyPDF2` for PDF text extraction.
|
||||
|
||||
Install: `pip install requests beautifulsoup4 playwright && playwright install chromium`
|
||||
|
||||
## 1. Search the Web
|
||||
|
||||
```bash
|
||||
python3 scripts/google_search.py "query" --pages N --engine ENGINE
|
||||
```
|
||||
|
||||
- `--engine` — `duckduckgo` (default), `brave`, or `google`
|
||||
- Returns `[{title, url, snippet}, ...]`
|
||||
|
||||
## 2. Read a Page (one-shot)
|
||||
|
||||
```bash
|
||||
python3 scripts/read_page.py "https://url" [--max-chars N] [--visible] [--format json|markdown|text] [--no-dismiss]
|
||||
```
|
||||
|
||||
- `--format` — `json` (default), `markdown`, or `text`
|
||||
- Auto-dismisses cookie consent banners (skip with `--no-dismiss`)
|
||||
|
||||
## 3. Persistent Browser Session
|
||||
|
||||
```bash
|
||||
python3 scripts/browser_session.py open "https://url" # Open + extract
|
||||
python3 scripts/browser_session.py navigate "https://other" # Go to new URL
|
||||
python3 scripts/browser_session.py extract [--format FMT] # Re-read page
|
||||
python3 scripts/browser_session.py screenshot [path] [--full] # Save screenshot
|
||||
python3 scripts/browser_session.py click "Submit" # Click by text/selector
|
||||
python3 scripts/browser_session.py search "keyword" # Search text in page
|
||||
python3 scripts/browser_session.py tab new "https://url" # Open new tab
|
||||
python3 scripts/browser_session.py tab list # List all tabs
|
||||
python3 scripts/browser_session.py tab switch 1 # Switch to tab index
|
||||
python3 scripts/browser_session.py tab close [index] # Close tab
|
||||
python3 scripts/browser_session.py dismiss-cookies # Manually dismiss cookies
|
||||
python3 scripts/browser_session.py close # Close browser
|
||||
```
|
||||
|
||||
- Cookie consent auto-dismissed on open/navigate
|
||||
- Multiple tabs supported — open, switch, close independently
|
||||
- Search returns matching lines with line numbers
|
||||
- Extract supports json/markdown/text output
|
||||
|
||||
## 4. Download Files
|
||||
|
||||
```bash
|
||||
python3 scripts/download_file.py "https://example.com/doc.pdf" [--output DIR] [--filename NAME]
|
||||
```
|
||||
|
||||
- Auto-detects filename from URL/headers
|
||||
- PDFs: extracts text if pdfplumber/PyPDF2 installed
|
||||
- Returns `{status, path, filename, size_bytes, content_type, extracted_text}`
|
||||
6
_meta.json
Normal file
6
_meta.json
Normal file
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"ownerId": "kn72vgg7f9v52jr01p0yamfz1n81b8n5",
|
||||
"slug": "web-pilot",
|
||||
"version": "1.0.0",
|
||||
"publishedAt": 1771349856982
|
||||
}
|
||||
775
scripts/browser_session.py
Normal file
775
scripts/browser_session.py
Normal file
@@ -0,0 +1,775 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Persistent browser session that stays open until told to close.
|
||||
|
||||
Usage:
|
||||
python3 browser_session.py open <url> Open URL in visible browser, extract content
|
||||
python3 browser_session.py navigate <url> Go to new URL, extract content
|
||||
python3 browser_session.py extract [--format FMT] Re-extract content from current page
|
||||
python3 browser_session.py screenshot [path] [--full] Save screenshot
|
||||
python3 browser_session.py click <selector_or_text> Click an element
|
||||
python3 browser_session.py search <text> Search for text in page content
|
||||
python3 browser_session.py tab new <url> Open URL in new tab
|
||||
python3 browser_session.py tab list List all open tabs
|
||||
python3 browser_session.py tab switch <index> Switch to tab by index
|
||||
python3 browser_session.py tab close [index] Close tab (current if no index)
|
||||
python3 browser_session.py close Close browser
|
||||
|
||||
Formats for extract: json (default), markdown, text
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import signal
|
||||
import socket
|
||||
import struct
|
||||
import sys
|
||||
import time
|
||||
|
||||
SOCKET_PATH = "/tmp/web-pilot-browser.sock"
|
||||
PID_FILE = "/tmp/web-pilot-browser.pid"
|
||||
|
||||
EXTRACT_JS = """() => {
|
||||
const SKIP = new Set(['SCRIPT','STYLE','NOSCRIPT','IFRAME','SVG','NAV','FOOTER','HEADER','ASIDE']);
|
||||
const title = document.title || '';
|
||||
const mainEl = document.querySelector('article')
|
||||
|| document.querySelector('main')
|
||||
|| document.querySelector('[role="main"]')
|
||||
|| document.querySelector('#content, .content, .post-content, .entry-content')
|
||||
|| document.body;
|
||||
|
||||
const lines = [];
|
||||
const walker = document.createTreeWalker(mainEl, NodeFilter.SHOW_ELEMENT, {
|
||||
acceptNode(node) {
|
||||
if (SKIP.has(node.tagName)) return NodeFilter.FILTER_REJECT;
|
||||
const tag = node.tagName.toLowerCase();
|
||||
if (['h1','h2','h3','h4','h5','h6','p','li','td','th','pre','blockquote'].includes(tag))
|
||||
return NodeFilter.FILTER_ACCEPT;
|
||||
return NodeFilter.FILTER_SKIP;
|
||||
}
|
||||
});
|
||||
let node;
|
||||
while (node = walker.nextNode()) {
|
||||
const text = node.innerText?.trim();
|
||||
if (!text) continue;
|
||||
const tag = node.tagName.toLowerCase();
|
||||
if (tag.startsWith('h')) lines.push('\\n' + '#'.repeat(parseInt(tag[1])) + ' ' + text + '\\n');
|
||||
else if (tag === 'li') lines.push('- ' + text);
|
||||
else if (tag === 'blockquote') lines.push('> ' + text);
|
||||
else lines.push(text);
|
||||
}
|
||||
let content = lines.join('\\n').trim();
|
||||
if (content.length < 200) content = mainEl.innerText || '';
|
||||
return { title, content };
|
||||
}"""
|
||||
|
||||
# Common cookie consent selectors and text patterns
|
||||
COOKIE_DISMISS_JS = """() => {
|
||||
const selectors = [
|
||||
'button[id*="accept" i]', 'button[id*="consent" i]', 'button[id*="agree" i]',
|
||||
'button[class*="accept" i]', 'button[class*="consent" i]', 'button[class*="agree" i]',
|
||||
'a[id*="accept" i]', 'a[class*="accept" i]',
|
||||
'[data-testid*="accept" i]', '[data-testid*="consent" i]',
|
||||
'.cookie-banner button', '.cookie-notice button', '.cookie-popup button',
|
||||
'#cookie-banner button', '#cookie-notice button', '#cookie-popup button',
|
||||
'.cc-btn.cc-dismiss', '.cc-accept', '#onetrust-accept-btn-handler',
|
||||
'.js-cookie-consent-agree', '[aria-label*="accept" i][aria-label*="cookie" i]',
|
||||
'[aria-label*="Accept all" i]', '[aria-label*="Accept cookies" i]',
|
||||
];
|
||||
|
||||
// Try selectors first
|
||||
for (const sel of selectors) {
|
||||
try {
|
||||
const el = document.querySelector(sel);
|
||||
if (el && el.offsetParent !== null) { el.click(); return { dismissed: true, method: 'selector', selector: sel }; }
|
||||
} catch(e) {}
|
||||
}
|
||||
|
||||
// Try matching button text
|
||||
const patterns = [
|
||||
/^accept all$/i, /accept all cookies/i, /accept cookies/i, /accept & close/i,
|
||||
/^agree$/i, /agree and continue/i, /agree & continue/i,
|
||||
/consent and continue/i, /consent & continue/i,
|
||||
/got it/i, /i understand/i, /i agree/i,
|
||||
/allow all/i, /allow cookies/i, /allow all cookies/i,
|
||||
/^ok$/i, /^okay$/i, /^continue$/i, /^dismiss$/i,
|
||||
/accept and close/i, /accept and continue/i,
|
||||
/nur notwendige/i, /alle akzeptieren/i, /akzeptieren/i,
|
||||
/tout accepter/i, /accepter/i, /accepter et continuer/i,
|
||||
];
|
||||
for (const btn of document.querySelectorAll('button, a[role="button"], [role="button"]')) {
|
||||
const text = btn.innerText?.trim();
|
||||
if (!text || text.length > 50) continue;
|
||||
for (const pat of patterns) {
|
||||
if (pat.test(text) && btn.offsetParent !== null) {
|
||||
btn.click();
|
||||
return { dismissed: true, method: 'text', matched: text };
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return { dismissed: false };
|
||||
}"""
|
||||
|
||||
|
||||
def format_output(result: dict, fmt: str) -> str:
|
||||
"""Format extraction result based on requested format."""
|
||||
if fmt == "text":
|
||||
# Strip markdown-ish formatting
|
||||
content = result.get("content", "")
|
||||
content = re.sub(r'^#+\s+', '', content, flags=re.MULTILINE)
|
||||
content = re.sub(r'^- ', ' ', content, flags=re.MULTILINE)
|
||||
content = re.sub(r'^> ', '', content, flags=re.MULTILINE)
|
||||
return content.strip()
|
||||
elif fmt == "markdown":
|
||||
return f"# {result.get('title', '')}\n\n{result.get('content', '')}"
|
||||
else: # json
|
||||
return json.dumps(result, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def dismiss_cookies(page):
|
||||
"""Try to dismiss cookie consent in main frame and all iframes."""
|
||||
result = page.evaluate(COOKIE_DISMISS_JS)
|
||||
if result.get("dismissed"):
|
||||
page.wait_for_timeout(500)
|
||||
return result
|
||||
# Check iframes (many EU sites put consent in an iframe)
|
||||
for frame in page.frames:
|
||||
if frame == page.main_frame:
|
||||
continue
|
||||
try:
|
||||
result = frame.evaluate(COOKIE_DISMISS_JS)
|
||||
if result.get("dismissed"):
|
||||
page.wait_for_timeout(500)
|
||||
return result
|
||||
except Exception:
|
||||
pass
|
||||
return {"dismissed": False}
|
||||
|
||||
|
||||
def run_server(url: str, headless: bool = False, proxy: str = None, user_agent: str = None):
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
if os.path.exists(SOCKET_PATH):
|
||||
os.remove(SOCKET_PATH)
|
||||
|
||||
pw = sync_playwright().start()
|
||||
launch_opts = {"headless": headless}
|
||||
if proxy:
|
||||
launch_opts["proxy"] = {"server": proxy}
|
||||
browser = pw.chromium.launch(**launch_opts)
|
||||
ua = user_agent or "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||
ctx = browser.new_context(
|
||||
user_agent=ua,
|
||||
locale="en-US",
|
||||
viewport={"width": 1280, "height": 900},
|
||||
)
|
||||
|
||||
# Track pages (tabs)
|
||||
pages = [ctx.new_page()]
|
||||
active_idx = 0
|
||||
|
||||
def active_page():
|
||||
return pages[active_idx]
|
||||
|
||||
active_page().goto(url, timeout=30000, wait_until="domcontentloaded")
|
||||
active_page().wait_for_timeout(1500)
|
||||
|
||||
# Auto-dismiss cookie consent on first load (main frame + iframes)
|
||||
dismiss_cookies(active_page())
|
||||
|
||||
result = active_page().evaluate(EXTRACT_JS)
|
||||
with open("/tmp/web-pilot-initial.json", "w") as f:
|
||||
json.dump(result, f, ensure_ascii=False)
|
||||
|
||||
with open(PID_FILE, "w") as f:
|
||||
f.write(str(os.getpid()))
|
||||
|
||||
sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
sock.bind(SOCKET_PATH)
|
||||
sock.listen(1)
|
||||
sock.settimeout(1.0)
|
||||
|
||||
running = True
|
||||
while running:
|
||||
try:
|
||||
conn, _ = sock.accept()
|
||||
raw = _recv_msg(conn)
|
||||
cmd = json.loads(raw.decode())
|
||||
action = cmd.get("action")
|
||||
|
||||
if action == "close":
|
||||
_send_msg(conn, json.dumps({"status": "closing"}).encode())
|
||||
conn.close()
|
||||
running = False
|
||||
|
||||
elif action == "navigate":
|
||||
t0 = time.time()
|
||||
response = None
|
||||
try:
|
||||
response = active_page().goto(cmd["url"], timeout=30000, wait_until="domcontentloaded")
|
||||
except Exception as nav_err:
|
||||
# Playwright throws on HTTP error codes (4xx/5xx) — still extract what we can
|
||||
pass
|
||||
active_page().wait_for_timeout(1500)
|
||||
load_time = round(time.time() - t0, 3)
|
||||
dismiss_cookies(active_page())
|
||||
result = active_page().evaluate(EXTRACT_JS)
|
||||
result["response_status"] = response.status if response else None
|
||||
result["final_url"] = active_page().url
|
||||
result["load_time_s"] = load_time
|
||||
mc = cmd.get("max_chars")
|
||||
if mc and len(result["content"]) > mc:
|
||||
result["content"] = result["content"][:mc] + "\n\n[...truncated]"
|
||||
_send_msg(conn, json.dumps(result, ensure_ascii=False).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "extract":
|
||||
result = active_page().evaluate(EXTRACT_JS)
|
||||
mc = cmd.get("max_chars")
|
||||
if mc and len(result["content"]) > mc:
|
||||
result["content"] = result["content"][:mc] + "\n\n[...truncated]"
|
||||
fmt = cmd.get("format", "json")
|
||||
output = format_output(result, fmt) if fmt != "json" else json.dumps(result, ensure_ascii=False)
|
||||
_send_msg(conn, output.encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "screenshot":
|
||||
path = cmd.get("path", "/tmp/screenshot.png")
|
||||
full_page = cmd.get("full_page", False)
|
||||
element_sel = cmd.get("element")
|
||||
from_sel = cmd.get("from_sel")
|
||||
to_sel = cmd.get("to_sel")
|
||||
|
||||
if element_sel:
|
||||
# Screenshot a single element
|
||||
el = active_page().query_selector(element_sel)
|
||||
if el:
|
||||
el.screenshot(path=path)
|
||||
_send_msg(conn, json.dumps({
|
||||
"status": "saved", "path": path, "mode": "element",
|
||||
"selector": element_sel,
|
||||
"url": active_page().url, "title": active_page().title(),
|
||||
"tab": active_idx,
|
||||
}).encode())
|
||||
else:
|
||||
_send_msg(conn, json.dumps({
|
||||
"error": f"Element not found: {element_sel}"
|
||||
}).encode())
|
||||
conn.close()
|
||||
elif from_sel and to_sel:
|
||||
# Screenshot a range between two elements using full-page screenshot + crop
|
||||
bounds = active_page().evaluate("""([fromSel, toSel]) => {
|
||||
const elFrom = document.querySelector(fromSel);
|
||||
const elTo = document.querySelector(toSel);
|
||||
if (!elFrom || !elTo) return null;
|
||||
const r1 = elFrom.getBoundingClientRect();
|
||||
const r2 = elTo.getBoundingClientRect();
|
||||
return {
|
||||
y: r1.top + window.scrollY,
|
||||
y2: r2.bottom + window.scrollY,
|
||||
pageWidth: document.documentElement.scrollWidth
|
||||
};
|
||||
}""", [from_sel, to_sel])
|
||||
if bounds:
|
||||
import tempfile
|
||||
# Take full-page screenshot to a temp file
|
||||
tmp = tempfile.mktemp(suffix=".png")
|
||||
active_page().screenshot(path=tmp, full_page=True)
|
||||
# Crop using PIL
|
||||
try:
|
||||
from PIL import Image
|
||||
im = Image.open(tmp)
|
||||
# Playwright full_page screenshots use device pixel ratio
|
||||
scale = im.width / bounds["pageWidth"] if bounds["pageWidth"] else 1
|
||||
top = int(bounds["y"] * scale)
|
||||
bottom = int(bounds["y2"] * scale)
|
||||
cropped = im.crop((0, top, im.width, bottom))
|
||||
cropped.save(path)
|
||||
os.remove(tmp)
|
||||
_send_msg(conn, json.dumps({
|
||||
"status": "saved", "path": path, "mode": "range",
|
||||
"from": from_sel, "to": to_sel,
|
||||
"url": active_page().url, "title": active_page().title(),
|
||||
"tab": active_idx,
|
||||
}).encode())
|
||||
except Exception as e:
|
||||
try: os.remove(tmp)
|
||||
except: pass
|
||||
_send_msg(conn, json.dumps({"error": f"Crop failed: {str(e)}"}).encode())
|
||||
else:
|
||||
_send_msg(conn, json.dumps({"error": f"One or both selectors not found: {from_sel}, {to_sel}"}).encode())
|
||||
conn.close()
|
||||
else:
|
||||
active_page().screenshot(path=path, full_page=full_page)
|
||||
_send_msg(conn, json.dumps({
|
||||
"status": "saved", "path": path, "mode": "full_page" if full_page else "viewport",
|
||||
"url": active_page().url, "title": active_page().title(),
|
||||
"tab": active_idx,
|
||||
}).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "click":
|
||||
target = cmd.get("target", "")
|
||||
clicked = False
|
||||
try:
|
||||
el = active_page().query_selector(target)
|
||||
if el:
|
||||
el.click()
|
||||
clicked = True
|
||||
except Exception:
|
||||
pass
|
||||
if not clicked:
|
||||
try:
|
||||
active_page().get_by_text(target, exact=False).first.click()
|
||||
clicked = True
|
||||
except Exception:
|
||||
pass
|
||||
if not clicked:
|
||||
try:
|
||||
active_page().get_by_role("button", name=target).or_(
|
||||
active_page().get_by_role("link", name=target)
|
||||
).first.click()
|
||||
clicked = True
|
||||
except Exception:
|
||||
pass
|
||||
active_page().wait_for_timeout(1000)
|
||||
result = {"status": "clicked" if clicked else "not_found", "target": target, "url": active_page().url}
|
||||
_send_msg(conn, json.dumps(result, ensure_ascii=False).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "dismiss_cookies":
|
||||
result = dismiss_cookies(active_page())
|
||||
_send_msg(conn, json.dumps(result, ensure_ascii=False).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "search":
|
||||
query = cmd.get("query", "").lower()
|
||||
result = active_page().evaluate(EXTRACT_JS)
|
||||
content = result.get("content", "")
|
||||
lines = content.split("\n")
|
||||
matches = []
|
||||
for i, line in enumerate(lines):
|
||||
if query in line.lower():
|
||||
matches.append({"line": i + 1, "text": line.strip()})
|
||||
_send_msg(conn, json.dumps({
|
||||
"query": query,
|
||||
"matches": len(matches),
|
||||
"results": matches[:50], # cap at 50
|
||||
"url": active_page().url,
|
||||
}, ensure_ascii=False).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "tab_new":
|
||||
new_page = ctx.new_page()
|
||||
pages.append(new_page)
|
||||
active_idx = len(pages) - 1
|
||||
new_page.goto(cmd["url"], timeout=30000, wait_until="domcontentloaded")
|
||||
new_page.wait_for_timeout(1500)
|
||||
dismiss_cookies(new_page)
|
||||
result = new_page.evaluate(EXTRACT_JS)
|
||||
result["tab"] = active_idx
|
||||
result["total_tabs"] = len(pages)
|
||||
_send_msg(conn, json.dumps(result, ensure_ascii=False).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "tab_list":
|
||||
tab_info = []
|
||||
for i, p in enumerate(pages):
|
||||
try:
|
||||
tab_info.append({
|
||||
"index": i,
|
||||
"title": p.title(),
|
||||
"url": p.url,
|
||||
"active": i == active_idx,
|
||||
})
|
||||
except Exception:
|
||||
tab_info.append({"index": i, "title": "(closed)", "url": "", "active": i == active_idx})
|
||||
_send_msg(conn, json.dumps({"tabs": tab_info, "active": active_idx}, ensure_ascii=False).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "tab_switch":
|
||||
idx = cmd.get("index", 0)
|
||||
if 0 <= idx < len(pages):
|
||||
active_idx = idx
|
||||
pages[active_idx].bring_to_front()
|
||||
_send_msg(conn, json.dumps({
|
||||
"status": "switched", "tab": active_idx,
|
||||
"title": pages[active_idx].title(),
|
||||
"url": pages[active_idx].url,
|
||||
}, ensure_ascii=False).encode())
|
||||
else:
|
||||
_send_msg(conn, json.dumps({"error": f"Invalid tab index {idx}. Have {len(pages)} tabs."}).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "tab_close":
|
||||
idx = cmd.get("index", active_idx)
|
||||
if len(pages) <= 1:
|
||||
_send_msg(conn, json.dumps({"error": "Cannot close the last tab. Use 'close' to close the browser."}).encode())
|
||||
elif 0 <= idx < len(pages):
|
||||
pages[idx].close()
|
||||
pages.pop(idx)
|
||||
if active_idx >= len(pages):
|
||||
active_idx = len(pages) - 1
|
||||
elif active_idx > idx:
|
||||
active_idx -= 1
|
||||
pages[active_idx].bring_to_front()
|
||||
_send_msg(conn, json.dumps({
|
||||
"status": "tab_closed", "closed_index": idx,
|
||||
"active": active_idx, "total_tabs": len(pages),
|
||||
}, ensure_ascii=False).encode())
|
||||
else:
|
||||
_send_msg(conn, json.dumps({"error": f"Invalid tab index {idx}"}).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "scroll":
|
||||
direction = cmd.get("direction", "down")
|
||||
if direction == "down":
|
||||
active_page().evaluate("window.scrollBy(0, window.innerHeight)")
|
||||
elif direction == "up":
|
||||
active_page().evaluate("window.scrollBy(0, -window.innerHeight)")
|
||||
else:
|
||||
# Treat as CSS selector
|
||||
active_page().evaluate(f"document.querySelector({json.dumps(direction)})?.scrollIntoView({{behavior:'smooth',block:'center'}})")
|
||||
active_page().wait_for_timeout(300)
|
||||
_send_msg(conn, json.dumps({"status": "scrolled", "direction": direction, "url": active_page().url}).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "wait":
|
||||
target = cmd.get("target", "1")
|
||||
try:
|
||||
seconds = float(target)
|
||||
active_page().wait_for_timeout(int(seconds * 1000))
|
||||
_send_msg(conn, json.dumps({"status": "waited", "seconds": seconds}).encode())
|
||||
except ValueError:
|
||||
# CSS selector
|
||||
try:
|
||||
active_page().wait_for_selector(target, timeout=30000)
|
||||
_send_msg(conn, json.dumps({"status": "found", "selector": target}).encode())
|
||||
except Exception as e:
|
||||
_send_msg(conn, json.dumps({"status": "timeout", "selector": target, "error": str(e)}).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "fill":
|
||||
selector = cmd.get("selector", "")
|
||||
value = cmd.get("value", "")
|
||||
submit = cmd.get("submit", False)
|
||||
try:
|
||||
active_page().fill(selector, value)
|
||||
if submit:
|
||||
active_page().press(selector, "Enter")
|
||||
active_page().wait_for_timeout(1000)
|
||||
_send_msg(conn, json.dumps({"status": "filled", "selector": selector, "submitted": submit, "url": active_page().url}).encode())
|
||||
except Exception as e:
|
||||
_send_msg(conn, json.dumps({"error": str(e)}).encode())
|
||||
conn.close()
|
||||
|
||||
elif action in ("back", "forward", "reload"):
|
||||
if action == "back":
|
||||
active_page().go_back(timeout=30000, wait_until="domcontentloaded")
|
||||
elif action == "forward":
|
||||
active_page().go_forward(timeout=30000, wait_until="domcontentloaded")
|
||||
else:
|
||||
active_page().reload(timeout=30000, wait_until="domcontentloaded")
|
||||
active_page().wait_for_timeout(500)
|
||||
_send_msg(conn, json.dumps({"status": action, "url": active_page().url, "title": active_page().title()}).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "eval":
|
||||
js_code = cmd.get("code", "")
|
||||
try:
|
||||
result = active_page().evaluate(js_code)
|
||||
_send_msg(conn, json.dumps({"status": "ok", "result": result}, ensure_ascii=False, default=str).encode())
|
||||
except Exception as e:
|
||||
_send_msg(conn, json.dumps({"status": "error", "error": str(e)}).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "links":
|
||||
links_js = """() => {
|
||||
return Array.from(document.querySelectorAll('a[href]')).map(a => ({
|
||||
href: a.href, text: (a.innerText || '').trim().substring(0, 200)
|
||||
})).filter(l => l.href && !l.href.startsWith('javascript:'))
|
||||
}"""
|
||||
result = active_page().evaluate(links_js)
|
||||
_send_msg(conn, json.dumps({"links": result, "count": len(result), "url": active_page().url}, ensure_ascii=False).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "pdf":
|
||||
path = cmd.get("path", "/tmp/page.pdf")
|
||||
try:
|
||||
active_page().pdf(path=path)
|
||||
_send_msg(conn, json.dumps({"status": "saved", "path": path}).encode())
|
||||
except Exception as e:
|
||||
_send_msg(conn, json.dumps({"error": str(e)}).encode())
|
||||
conn.close()
|
||||
|
||||
elif action == "status":
|
||||
_send_msg(conn, json.dumps({
|
||||
"url": active_page().url,
|
||||
"title": active_page().title(),
|
||||
"active_tab": active_idx,
|
||||
"total_tabs": len(pages),
|
||||
}).encode())
|
||||
conn.close()
|
||||
|
||||
else:
|
||||
_send_msg(conn, json.dumps({"error": f"unknown action: {action}"}).encode())
|
||||
conn.close()
|
||||
|
||||
except socket.timeout:
|
||||
continue
|
||||
except Exception as e:
|
||||
try:
|
||||
_send_msg(conn, json.dumps({"error": str(e)}).encode())
|
||||
conn.close()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
sock.close()
|
||||
for f in [SOCKET_PATH, PID_FILE]:
|
||||
if os.path.exists(f):
|
||||
os.remove(f)
|
||||
browser.close()
|
||||
pw.stop()
|
||||
|
||||
|
||||
def _recv_exact(sock, n):
|
||||
"""Read exactly n bytes from socket."""
|
||||
buf = b""
|
||||
while len(buf) < n:
|
||||
chunk = sock.recv(n - len(buf))
|
||||
if not chunk:
|
||||
raise ConnectionError("Socket closed while reading")
|
||||
buf += chunk
|
||||
return buf
|
||||
|
||||
|
||||
def _send_msg(sock, data: bytes):
|
||||
"""Send a length-prefixed message."""
|
||||
sock.sendall(struct.pack('>I', len(data)) + data)
|
||||
|
||||
|
||||
def _recv_msg(sock) -> bytes:
|
||||
"""Receive a length-prefixed message."""
|
||||
header = _recv_exact(sock, 4)
|
||||
length = struct.unpack('>I', header)[0]
|
||||
return _recv_exact(sock, length)
|
||||
|
||||
|
||||
def send_command(cmd: dict) -> str:
|
||||
sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
sock.settimeout(60)
|
||||
sock.connect(SOCKET_PATH)
|
||||
_send_msg(sock, json.dumps(cmd).encode())
|
||||
result = _recv_msg(sock)
|
||||
sock.close()
|
||||
return result.decode("utf-8", errors="replace")
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: browser_session.py <open|navigate|extract|screenshot|click|search|tab|close> [args]")
|
||||
sys.exit(1)
|
||||
|
||||
action = sys.argv[1]
|
||||
|
||||
if action == "open":
|
||||
headless = "--headless" in sys.argv
|
||||
# Parse --proxy and --user-agent
|
||||
proxy = None
|
||||
user_agent = None
|
||||
i = 2
|
||||
while i < len(sys.argv):
|
||||
if sys.argv[i] == "--proxy" and i + 1 < len(sys.argv):
|
||||
proxy = sys.argv[i + 1]; i += 2
|
||||
elif sys.argv[i] == "--user-agent" and i + 1 < len(sys.argv):
|
||||
user_agent = sys.argv[i + 1]; i += 2
|
||||
else:
|
||||
i += 1
|
||||
args = [a for a in sys.argv[2:] if not a.startswith("--") and a != proxy and a != user_agent]
|
||||
if not args:
|
||||
print("Usage: browser_session.py open <url> [--headless] [--proxy <url>] [--user-agent <string>]")
|
||||
sys.exit(1)
|
||||
url = args[0]
|
||||
|
||||
# Stale PID/socket cleanup
|
||||
if os.path.exists(SOCKET_PATH):
|
||||
stale = True
|
||||
if os.path.exists(PID_FILE):
|
||||
try:
|
||||
old_pid = int(open(PID_FILE).read().strip())
|
||||
os.kill(old_pid, 0) # check if alive
|
||||
stale = False
|
||||
except (OSError, ValueError):
|
||||
pass
|
||||
if not stale:
|
||||
print(json.dumps({"error": "Browser session already open. Use 'navigate', 'extract', or 'close'."}))
|
||||
sys.exit(1)
|
||||
# Clean up stale files
|
||||
try: os.remove(SOCKET_PATH)
|
||||
except OSError: pass
|
||||
try: os.remove(PID_FILE)
|
||||
except OSError: pass
|
||||
|
||||
pid = os.fork()
|
||||
if pid == 0:
|
||||
os.setsid()
|
||||
sys.stdout = open(os.devnull, "w")
|
||||
sys.stderr = open(os.devnull, "w")
|
||||
run_server(url, headless=headless, proxy=proxy, user_agent=user_agent)
|
||||
sys.exit(0)
|
||||
else:
|
||||
for _ in range(30):
|
||||
if os.path.exists("/tmp/web-pilot-initial.json"):
|
||||
time.sleep(0.2)
|
||||
with open("/tmp/web-pilot-initial.json") as f:
|
||||
result = json.load(f)
|
||||
os.remove("/tmp/web-pilot-initial.json")
|
||||
result["status"] = "browser open"
|
||||
result["note"] = "Commands: navigate, extract, screenshot, click, search, tab, close"
|
||||
print(json.dumps(result, indent=2, ensure_ascii=False))
|
||||
sys.exit(0)
|
||||
time.sleep(0.5)
|
||||
print(json.dumps({"error": "Timeout waiting for browser to start"}))
|
||||
sys.exit(1)
|
||||
|
||||
elif action == "navigate":
|
||||
if len(sys.argv) < 3:
|
||||
print("Usage: browser_session.py navigate <url>")
|
||||
sys.exit(1)
|
||||
print(send_command({"action": "navigate", "url": sys.argv[2], "max_chars": 50000}))
|
||||
|
||||
elif action == "extract":
|
||||
fmt = "json"
|
||||
if "--format" in sys.argv:
|
||||
idx = sys.argv.index("--format")
|
||||
if idx + 1 < len(sys.argv):
|
||||
fmt = sys.argv[idx + 1]
|
||||
print(send_command({"action": "extract", "max_chars": 50000, "format": fmt}))
|
||||
|
||||
elif action == "screenshot":
|
||||
path = "/tmp/screenshot.png"
|
||||
full_page = "--full" in sys.argv
|
||||
element_sel = None
|
||||
from_sel = None
|
||||
to_sel = None
|
||||
# Parse flags
|
||||
args = sys.argv[2:]
|
||||
i = 0
|
||||
positional = []
|
||||
while i < len(args):
|
||||
if args[i] == "--element" and i + 1 < len(args):
|
||||
element_sel = args[i + 1]; i += 2
|
||||
elif args[i] == "--from" and i + 1 < len(args):
|
||||
from_sel = args[i + 1]; i += 2
|
||||
elif args[i] == "--to" and i + 1 < len(args):
|
||||
to_sel = args[i + 1]; i += 2
|
||||
elif args[i] == "--full":
|
||||
i += 1
|
||||
elif not args[i].startswith("--"):
|
||||
positional.append(args[i]); i += 1
|
||||
else:
|
||||
i += 1
|
||||
if positional:
|
||||
path = positional[0]
|
||||
cmd = {"action": "screenshot", "path": path, "full_page": full_page}
|
||||
if element_sel:
|
||||
cmd["element"] = element_sel
|
||||
if from_sel:
|
||||
cmd["from_sel"] = from_sel
|
||||
if to_sel:
|
||||
cmd["to_sel"] = to_sel
|
||||
print(send_command(cmd))
|
||||
|
||||
elif action == "click":
|
||||
if len(sys.argv) < 3:
|
||||
print("Usage: browser_session.py click <selector_or_text>")
|
||||
sys.exit(1)
|
||||
target = " ".join(a for a in sys.argv[2:] if not a.startswith("--"))
|
||||
print(send_command({"action": "click", "target": target}))
|
||||
|
||||
elif action == "search":
|
||||
if len(sys.argv) < 3:
|
||||
print("Usage: browser_session.py search <text>")
|
||||
sys.exit(1)
|
||||
query = " ".join(sys.argv[2:])
|
||||
print(send_command({"action": "search", "query": query}))
|
||||
|
||||
elif action == "tab":
|
||||
if len(sys.argv) < 3:
|
||||
print("Usage: browser_session.py tab <new|list|switch|close> [args]")
|
||||
sys.exit(1)
|
||||
sub = sys.argv[2]
|
||||
if sub == "new":
|
||||
if len(sys.argv) < 4:
|
||||
print("Usage: browser_session.py tab new <url>")
|
||||
sys.exit(1)
|
||||
print(send_command({"action": "tab_new", "url": sys.argv[3]}))
|
||||
elif sub == "list":
|
||||
print(send_command({"action": "tab_list"}))
|
||||
elif sub == "switch":
|
||||
if len(sys.argv) < 4:
|
||||
print("Usage: browser_session.py tab switch <index>")
|
||||
sys.exit(1)
|
||||
print(send_command({"action": "tab_switch", "index": int(sys.argv[3])}))
|
||||
elif sub == "close":
|
||||
idx = int(sys.argv[3]) if len(sys.argv) > 3 else -1
|
||||
cmd = {"action": "tab_close"}
|
||||
if idx >= 0:
|
||||
cmd["index"] = idx
|
||||
print(send_command(cmd))
|
||||
else:
|
||||
print(f"Unknown tab command: {sub}")
|
||||
sys.exit(1)
|
||||
|
||||
elif action == "dismiss-cookies":
|
||||
print(send_command({"action": "dismiss_cookies"}))
|
||||
|
||||
elif action == "scroll":
|
||||
if len(sys.argv) < 3:
|
||||
print("Usage: browser_session.py scroll down|up|<selector>")
|
||||
sys.exit(1)
|
||||
print(send_command({"action": "scroll", "direction": sys.argv[2]}))
|
||||
|
||||
elif action == "wait":
|
||||
if len(sys.argv) < 3:
|
||||
print("Usage: browser_session.py wait <seconds_or_selector>")
|
||||
sys.exit(1)
|
||||
print(send_command({"action": "wait", "target": sys.argv[2]}))
|
||||
|
||||
elif action == "fill":
|
||||
if len(sys.argv) < 4:
|
||||
print("Usage: browser_session.py fill <selector> <value> [--submit]")
|
||||
sys.exit(1)
|
||||
submit = "--submit" in sys.argv
|
||||
print(send_command({"action": "fill", "selector": sys.argv[2], "value": sys.argv[3], "submit": submit}))
|
||||
|
||||
elif action in ("back", "forward", "reload"):
|
||||
print(send_command({"action": action}))
|
||||
|
||||
elif action == "eval":
|
||||
if len(sys.argv) < 3:
|
||||
print("Usage: browser_session.py eval \"<js_code>\"")
|
||||
sys.exit(1)
|
||||
print(send_command({"action": "eval", "code": " ".join(sys.argv[2:])}))
|
||||
|
||||
elif action == "links":
|
||||
print(send_command({"action": "links"}))
|
||||
|
||||
elif action == "pdf":
|
||||
path = sys.argv[2] if len(sys.argv) > 2 else "/tmp/page.pdf"
|
||||
print(send_command({"action": "pdf", "path": path}))
|
||||
|
||||
elif action == "status":
|
||||
print(send_command({"action": "status"}))
|
||||
|
||||
elif action == "close":
|
||||
print(send_command({"action": "close"}))
|
||||
|
||||
else:
|
||||
print(f"Unknown action: {action}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
190
scripts/download_file.py
Normal file
190
scripts/download_file.py
Normal file
@@ -0,0 +1,190 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Download files from URLs. Handles PDFs, images, documents, and any binary content.
|
||||
|
||||
Usage:
|
||||
python3 download_file.py <url> [--output DIR] [--filename NAME]
|
||||
|
||||
Flags:
|
||||
--output DIR Directory to save to (default: /tmp/downloads)
|
||||
--filename NAME Override filename (auto-detected from URL/headers if omitted)
|
||||
|
||||
Outputs JSON {status, path, filename, size_bytes, content_type}.
|
||||
Detects file type from Content-Type header and URL. For PDFs, also extracts
|
||||
text if possible (requires pdfplumber or falls back to basic extraction).
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import urllib.parse
|
||||
|
||||
import requests
|
||||
|
||||
|
||||
def json_error(message: str) -> str:
|
||||
"""Return standardized JSON error format."""
|
||||
return json.dumps({"error": message}, indent=2, ensure_ascii=False)
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||
}
|
||||
|
||||
# File types we handle specially
|
||||
TEXT_EXTRACTABLE = {
|
||||
"application/pdf": "pdf",
|
||||
}
|
||||
|
||||
|
||||
def guess_filename(url: str, resp: requests.Response) -> str:
|
||||
"""Determine filename from Content-Disposition, URL, or Content-Type."""
|
||||
# Check Content-Disposition header
|
||||
cd = resp.headers.get("Content-Disposition", "")
|
||||
if "filename=" in cd:
|
||||
match = re.search(r'filename[*]?=["\']?([^"\';]+)', cd)
|
||||
if match:
|
||||
return match.group(1).strip()
|
||||
|
||||
# Extract from URL path
|
||||
parsed = urllib.parse.urlparse(url)
|
||||
path_name = os.path.basename(parsed.path)
|
||||
if path_name and "." in path_name:
|
||||
return urllib.parse.unquote(path_name)
|
||||
|
||||
# Fall back to content type
|
||||
ct = resp.headers.get("Content-Type", "")
|
||||
ext_map = {
|
||||
"application/pdf": "download.pdf",
|
||||
"image/png": "download.png",
|
||||
"image/jpeg": "download.jpg",
|
||||
"image/gif": "download.gif",
|
||||
"image/webp": "download.webp",
|
||||
"application/zip": "download.zip",
|
||||
"text/html": "download.html",
|
||||
"text/plain": "download.txt",
|
||||
"application/json": "download.json",
|
||||
}
|
||||
for mime, name in ext_map.items():
|
||||
if mime in ct:
|
||||
return name
|
||||
|
||||
return "download.bin"
|
||||
|
||||
|
||||
def extract_pdf_text(filepath: str) -> str:
|
||||
"""Try to extract text from a PDF. Returns empty string on failure."""
|
||||
# Try pdfplumber first
|
||||
try:
|
||||
import pdfplumber
|
||||
text_parts = []
|
||||
with pdfplumber.open(filepath) as pdf:
|
||||
for page in pdf.pages:
|
||||
t = page.extract_text()
|
||||
if t:
|
||||
text_parts.append(t)
|
||||
return "\n\n".join(text_parts)
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# Try PyPDF2
|
||||
try:
|
||||
from PyPDF2 import PdfReader
|
||||
reader = PdfReader(filepath)
|
||||
text_parts = []
|
||||
for page in reader.pages:
|
||||
t = page.extract_text()
|
||||
if t:
|
||||
text_parts.append(t)
|
||||
return "\n\n".join(text_parts)
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
return ""
|
||||
|
||||
|
||||
def download(url: str, output_dir: str = "/tmp/downloads", filename: str = None,
|
||||
proxy: str = None, user_agent: str = None) -> dict:
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
headers = HEADERS.copy()
|
||||
if user_agent:
|
||||
headers["User-Agent"] = user_agent
|
||||
|
||||
proxies = {}
|
||||
if proxy:
|
||||
proxies = {"http": proxy, "https": proxy}
|
||||
|
||||
try:
|
||||
resp = requests.get(url, headers=headers, timeout=30, stream=True,
|
||||
allow_redirects=True, proxies=proxies)
|
||||
except requests.exceptions.SSLError:
|
||||
# Retry without SSL verification if certs are broken
|
||||
resp = requests.get(url, headers=headers, timeout=30, stream=True,
|
||||
allow_redirects=True, proxies=proxies, verify=False)
|
||||
resp.raise_for_status()
|
||||
|
||||
if not filename:
|
||||
filename = guess_filename(url, resp)
|
||||
|
||||
filepath = os.path.join(output_dir, filename)
|
||||
|
||||
# Avoid overwriting — add suffix if exists
|
||||
base, ext = os.path.splitext(filepath)
|
||||
counter = 1
|
||||
while os.path.exists(filepath):
|
||||
filepath = f"{base}_{counter}{ext}"
|
||||
counter += 1
|
||||
|
||||
# Stream to disk
|
||||
total = 0
|
||||
with open(filepath, "wb") as f:
|
||||
for chunk in resp.iter_content(chunk_size=8192):
|
||||
f.write(chunk)
|
||||
total += len(chunk)
|
||||
|
||||
content_type = resp.headers.get("Content-Type", "unknown")
|
||||
result = {
|
||||
"status": "downloaded",
|
||||
"path": filepath,
|
||||
"filename": os.path.basename(filepath),
|
||||
"size_bytes": total,
|
||||
"content_type": content_type,
|
||||
"url": url,
|
||||
}
|
||||
|
||||
# Add redirect URL if redirected
|
||||
if resp.url != url:
|
||||
result["redirect_url"] = resp.url
|
||||
|
||||
# Extract text from PDFs
|
||||
if "pdf" in content_type.lower() or filepath.lower().endswith(".pdf"):
|
||||
text = extract_pdf_text(filepath)
|
||||
if text:
|
||||
result["extracted_text"] = text
|
||||
result["extracted_chars"] = len(text)
|
||||
else:
|
||||
result["extracted_text"] = ""
|
||||
result["note"] = "PDF text extraction failed. Install pdfplumber or PyPDF2 for text extraction."
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Download files from URLs")
|
||||
parser.add_argument("url", help="URL to download")
|
||||
parser.add_argument("--output", default="/tmp/downloads", help="Output directory (default: /tmp/downloads)")
|
||||
parser.add_argument("--filename", default=None, help="Override filename")
|
||||
parser.add_argument("--proxy", help="Proxy URL (e.g., http://proxy:8080)")
|
||||
parser.add_argument("--user-agent", help="Override User-Agent string")
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
result = download(args.url, args.output, args.filename, args.proxy, args.user_agent)
|
||||
print(json.dumps(result, indent=2, ensure_ascii=False))
|
||||
except Exception as e:
|
||||
print(json_error(f"Download failed: {str(e)}"))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
182
scripts/google_search.py
Normal file
182
scripts/google_search.py
Normal file
@@ -0,0 +1,182 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Web search via multiple engines. No API key required.
|
||||
|
||||
Usage:
|
||||
python3 google_search.py "search term" [--pages N] [--engine ENGINE]
|
||||
|
||||
Flags:
|
||||
--pages N Number of result pages (default: 1, ~10 results each)
|
||||
--engine ENGINE Search engine: duckduckgo (default), brave, google
|
||||
Note: google often blocks with CAPTCHA
|
||||
|
||||
Outputs JSON array of {title, url, snippet} per result.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
import urllib.parse
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
|
||||
def json_error(message: str) -> str:
|
||||
"""Return standardized JSON error format."""
|
||||
return json.dumps({"error": message}, indent=2, ensure_ascii=False)
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
}
|
||||
|
||||
|
||||
def search_duckduckgo(query: str, pages: int = 1) -> list[dict]:
|
||||
"""DuckDuckGo HTML endpoint — most reliable, no CAPTCHA."""
|
||||
results = []
|
||||
form_data = {"q": query}
|
||||
|
||||
for page in range(pages):
|
||||
resp = requests.post("https://html.duckduckgo.com/html/", data=form_data, headers=HEADERS, timeout=15)
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
for res in soup.select(".result"):
|
||||
title_el = res.select_one(".result__title a, a.result__a")
|
||||
snippet_el = res.select_one(".result__snippet")
|
||||
if not title_el:
|
||||
continue
|
||||
href = title_el.get("href", "")
|
||||
if "uddg=" in href:
|
||||
href = urllib.parse.unquote(
|
||||
urllib.parse.parse_qs(urllib.parse.urlparse(href).query).get("uddg", [href])[0]
|
||||
)
|
||||
if href.startswith("http"):
|
||||
results.append({
|
||||
"title": title_el.get_text(strip=True),
|
||||
"url": href,
|
||||
"snippet": snippet_el.get_text(strip=True) if snippet_el else "",
|
||||
})
|
||||
|
||||
if page < pages - 1:
|
||||
next_form = None
|
||||
for btn in soup.find_all("input", {"value": "Next"}):
|
||||
if btn.parent and btn.parent.name == "form":
|
||||
next_form = btn.parent
|
||||
break
|
||||
if not next_form:
|
||||
break
|
||||
form_data = {}
|
||||
for inp in next_form.find_all("input"):
|
||||
name = inp.get("name")
|
||||
if name:
|
||||
form_data[name] = inp.get("value", "")
|
||||
time.sleep(1)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def search_brave(query: str, pages: int = 1) -> list[dict]:
|
||||
"""Brave Search HTML — good alternative, sometimes more results."""
|
||||
results = []
|
||||
|
||||
for page in range(pages):
|
||||
offset = page * 10
|
||||
params = {"q": query, "offset": str(offset)}
|
||||
resp = requests.get("https://search.brave.com/search", params=params, headers=HEADERS, timeout=15)
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
for item in soup.select('div[data-type="web"]'):
|
||||
# Title: dedicated title span, or first link text
|
||||
title_el = item.select_one(".title.search-snippet-title, .search-snippet-title")
|
||||
link_el = item.select_one("a[href^='http']")
|
||||
# Description/snippet
|
||||
snippet_el = item.select_one(".generic-snippet .content, .generic-snippet, .snippet-description")
|
||||
|
||||
if not link_el:
|
||||
continue
|
||||
href = link_el.get("href", "")
|
||||
title = title_el.get_text(strip=True) if title_el else link_el.get_text(strip=True)
|
||||
if href.startswith("http") and title:
|
||||
results.append({
|
||||
"title": title,
|
||||
"url": href,
|
||||
"snippet": snippet_el.get_text(strip=True) if snippet_el else "",
|
||||
})
|
||||
|
||||
if page < pages - 1:
|
||||
time.sleep(1)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def search_google(query: str, pages: int = 1) -> list[dict]:
|
||||
"""Google HTML — often blocked by CAPTCHA. Use as fallback."""
|
||||
results = []
|
||||
|
||||
for page in range(pages):
|
||||
start = page * 10
|
||||
params = {"q": query, "start": str(start), "hl": "en"}
|
||||
resp = requests.get("https://www.google.com/search", params=params, headers=HEADERS, timeout=15)
|
||||
resp.raise_for_status()
|
||||
|
||||
if "sorry" in resp.url or "unusual traffic" in resp.text.lower():
|
||||
if not results:
|
||||
raise RuntimeError("Google blocked the request (CAPTCHA). Try --engine duckduckgo or brave.")
|
||||
break
|
||||
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
for h3 in soup.find_all("h3"):
|
||||
parent_a = h3.find_parent("a")
|
||||
if parent_a and parent_a.get("href", "").startswith("http"):
|
||||
# Find snippet near the h3
|
||||
container = h3.find_parent("div", class_="g") or h3.parent
|
||||
snippet_el = container.select_one("div[data-sncf], div.VwiC3b, span.st") if container else None
|
||||
results.append({
|
||||
"title": h3.get_text(strip=True),
|
||||
"url": parent_a["href"],
|
||||
"snippet": snippet_el.get_text(strip=True) if snippet_el else "",
|
||||
})
|
||||
|
||||
if page < pages - 1:
|
||||
time.sleep(1.5)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
ENGINES = {
|
||||
"duckduckgo": search_duckduckgo,
|
||||
"ddg": search_duckduckgo,
|
||||
"brave": search_brave,
|
||||
"google": search_google,
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Web search (multi-engine, no API key)")
|
||||
parser.add_argument("query", help="Search query")
|
||||
parser.add_argument("--pages", type=int, default=1, help="Number of result pages (default: 1)")
|
||||
parser.add_argument("--engine", choices=["duckduckgo", "ddg", "brave", "google"],
|
||||
default="duckduckgo", help="Search engine (default: duckduckgo)")
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
search_fn = ENGINES[args.engine]
|
||||
results = search_fn(args.query, args.pages)
|
||||
|
||||
# Deduplicate
|
||||
seen = set()
|
||||
deduped = []
|
||||
for r in results:
|
||||
if r["url"] not in seen:
|
||||
seen.add(r["url"])
|
||||
deduped.append(r)
|
||||
|
||||
print(json.dumps(deduped, indent=2, ensure_ascii=False))
|
||||
except Exception as e:
|
||||
print(json_error(f"Search failed: {str(e)}"))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
153
scripts/read_page.py
Normal file
153
scripts/read_page.py
Normal file
@@ -0,0 +1,153 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Extract readable content from a web page using Playwright + Chromium.
|
||||
|
||||
Usage:
|
||||
python3 read_page.py <url> [--max-chars N] [--visible] [--format FMT] [--no-dismiss]
|
||||
|
||||
Flags:
|
||||
--max-chars N Max characters to output (default: 50000)
|
||||
--visible Show browser window (non-headless)
|
||||
--format FMT Output format: json (default), markdown, text
|
||||
--no-dismiss Skip cookie consent auto-dismiss
|
||||
|
||||
Outputs content in the requested format.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
EXTRACT_JS = """() => {
|
||||
const SKIP = new Set(['SCRIPT','STYLE','NOSCRIPT','IFRAME','SVG','NAV','FOOTER','HEADER','ASIDE']);
|
||||
const title = document.title || '';
|
||||
const mainEl = document.querySelector('article')
|
||||
|| document.querySelector('main')
|
||||
|| document.querySelector('[role="main"]')
|
||||
|| document.querySelector('#content, .content, .post-content, .entry-content')
|
||||
|| document.body;
|
||||
|
||||
const lines = [];
|
||||
const walker = document.createTreeWalker(mainEl, NodeFilter.SHOW_ELEMENT, {
|
||||
acceptNode(node) {
|
||||
if (SKIP.has(node.tagName)) return NodeFilter.FILTER_REJECT;
|
||||
const tag = node.tagName.toLowerCase();
|
||||
if (['h1','h2','h3','h4','h5','h6','p','li','td','th','pre','blockquote'].includes(tag))
|
||||
return NodeFilter.FILTER_ACCEPT;
|
||||
return NodeFilter.FILTER_SKIP;
|
||||
}
|
||||
});
|
||||
let node;
|
||||
while (node = walker.nextNode()) {
|
||||
const text = node.innerText?.trim();
|
||||
if (!text) continue;
|
||||
const tag = node.tagName.toLowerCase();
|
||||
if (tag.startsWith('h')) lines.push('\\n' + '#'.repeat(parseInt(tag[1])) + ' ' + text + '\\n');
|
||||
else if (tag === 'li') lines.push('- ' + text);
|
||||
else if (tag === 'blockquote') lines.push('> ' + text);
|
||||
else lines.push(text);
|
||||
}
|
||||
let content = lines.join('\\n').trim();
|
||||
if (content.length < 200) content = mainEl.innerText || '';
|
||||
return { title, content };
|
||||
}"""
|
||||
|
||||
COOKIE_DISMISS_JS = """() => {
|
||||
const selectors = [
|
||||
'button[id*="accept" i]', 'button[id*="consent" i]', 'button[id*="agree" i]',
|
||||
'button[class*="accept" i]', 'button[class*="consent" i]', 'button[class*="agree" i]',
|
||||
'a[id*="accept" i]', 'a[class*="accept" i]',
|
||||
'[data-testid*="accept" i]', '[data-testid*="consent" i]',
|
||||
'.cookie-banner button', '.cookie-notice button', '.cookie-popup button',
|
||||
'#cookie-banner button', '#cookie-notice button', '#cookie-popup button',
|
||||
'.cc-btn.cc-dismiss', '.cc-accept', '#onetrust-accept-btn-handler',
|
||||
'.js-cookie-consent-agree', '[aria-label*="accept" i][aria-label*="cookie" i]',
|
||||
'[aria-label*="Accept all" i]', '[aria-label*="Accept cookies" i]',
|
||||
];
|
||||
for (const sel of selectors) {
|
||||
try {
|
||||
const el = document.querySelector(sel);
|
||||
if (el && el.offsetParent !== null) { el.click(); return { dismissed: true }; }
|
||||
} catch(e) {}
|
||||
}
|
||||
const patterns = [
|
||||
/^accept all$/i, /accept all cookies/i, /accept cookies/i, /accept & close/i,
|
||||
/^agree$/i, /agree and continue/i, /agree & continue/i,
|
||||
/consent and continue/i, /consent & continue/i,
|
||||
/got it/i, /i understand/i, /i agree/i,
|
||||
/allow all/i, /allow cookies/i, /allow all cookies/i,
|
||||
/^ok$/i, /^okay$/i, /^continue$/i, /^dismiss$/i,
|
||||
/accept and close/i, /accept and continue/i,
|
||||
/nur notwendige/i, /alle akzeptieren/i, /akzeptieren/i,
|
||||
/tout accepter/i, /accepter/i, /accepter et continuer/i,
|
||||
];
|
||||
for (const btn of document.querySelectorAll('button, a[role="button"], [role="button"]')) {
|
||||
const text = btn.innerText?.trim();
|
||||
if (!text || text.length > 50) continue;
|
||||
for (const pat of patterns) {
|
||||
if (pat.test(text) && btn.offsetParent !== null) { btn.click(); return { dismissed: true }; }
|
||||
}
|
||||
}
|
||||
return { dismissed: false };
|
||||
}"""
|
||||
|
||||
|
||||
def format_output(result: dict, fmt: str) -> str:
|
||||
if fmt == "text":
|
||||
content = result.get("content", "")
|
||||
content = re.sub(r'^#+\s+', '', content, flags=re.MULTILINE)
|
||||
content = re.sub(r'^- ', ' ', content, flags=re.MULTILINE)
|
||||
content = re.sub(r'^> ', '', content, flags=re.MULTILINE)
|
||||
return content.strip()
|
||||
elif fmt == "markdown":
|
||||
return f"# {result.get('title', '')}\n\n{result.get('content', '')}"
|
||||
else:
|
||||
return json.dumps(result, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Web page reader (Playwright + Chromium)")
|
||||
parser.add_argument("url", help="URL to read")
|
||||
parser.add_argument("--max-chars", type=int, default=50000, help="Max characters (default: 50000)")
|
||||
parser.add_argument("--visible", action="store_true", help="Run in visible (non-headless) mode")
|
||||
parser.add_argument("--format", choices=["json", "markdown", "text"], default="json", help="Output format")
|
||||
parser.add_argument("--no-dismiss", action="store_true", help="Skip cookie consent auto-dismiss")
|
||||
args = parser.parse_args()
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=not args.visible)
|
||||
ctx = browser.new_context(
|
||||
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||
locale="en-US",
|
||||
viewport={"width": 1280, "height": 900},
|
||||
)
|
||||
page = ctx.new_page()
|
||||
page.goto(args.url, timeout=30000, wait_until="domcontentloaded")
|
||||
page.wait_for_timeout(1500)
|
||||
|
||||
if not args.no_dismiss:
|
||||
# Try main frame first, then iframes (EU sites often use iframe consent)
|
||||
dismissed = page.evaluate(COOKIE_DISMISS_JS)
|
||||
if not dismissed.get("dismissed"):
|
||||
for frame in page.frames:
|
||||
if frame == page.main_frame:
|
||||
continue
|
||||
try:
|
||||
r = frame.evaluate(COOKIE_DISMISS_JS)
|
||||
if r.get("dismissed"):
|
||||
break
|
||||
except Exception:
|
||||
pass
|
||||
page.wait_for_timeout(500)
|
||||
|
||||
result = page.evaluate(EXTRACT_JS)
|
||||
if len(result["content"]) > args.max_chars:
|
||||
result["content"] = result["content"][:args.max_chars] + "\n\n[...truncated]"
|
||||
|
||||
print(format_output(result, args.format))
|
||||
browser.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user