Initial commit with translated description
This commit is contained in:
227
README.md
Normal file
227
README.md
Normal file
@@ -0,0 +1,227 @@
|
|||||||
|
# 🌐 Web Pilot — OpenClaw Skill
|
||||||
|
|
||||||
|
[](https://ko-fi.com/liranudi)
|
||||||
|
|
||||||
|
A web search, page reading, and browser automation skill for [OpenClaw](https://github.com/openclaw/openclaw). No API keys required.
|
||||||
|
|
||||||
|
## ♿ Accessibility
|
||||||
|
|
||||||
|
This skill enables AI agents to **read, navigate, and interact with the web on behalf of users** — making it a powerful accessibility tool for people with visual impairments, motor disabilities, or cognitive challenges.
|
||||||
|
|
||||||
|
- **Screen reading on steroids** — extracts clean, structured text from any webpage, stripping away visual clutter, ads, and navigation noise
|
||||||
|
- **Voice-driven browsing** — when paired with an AI assistant, users can browse the web entirely through natural language ("scroll down", "click Sign In", "read me the Overview section")
|
||||||
|
- **Targeted content extraction** — grab specific sections, search for text, or screenshot regions without needing to visually scan a page
|
||||||
|
- **Form interaction** — fill inputs and submit forms via commands, removing the need for precise mouse/keyboard control
|
||||||
|
- **Cookie banner removal** — automatically dismisses consent popups that are notoriously difficult for screen readers
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Web Search** — Multi-engine (DuckDuckGo, Brave, Google) with pagination
|
||||||
|
- **Page Reader** — Extract clean text from any URL with JS rendering
|
||||||
|
- **Persistent Browser** — Visible or headless browser with 20+ actions
|
||||||
|
- **Cookie Auto-Dismiss** — Automatically clears cookie consent banners
|
||||||
|
- **File Download** — Download files with auto-detection, PDF text extraction
|
||||||
|
- **Output Formats** — JSON, markdown, or plain text
|
||||||
|
- **Zero API Keys** — Everything runs locally
|
||||||
|
- **Partial Screenshots** — Capture viewport, full page, single elements, or ranges between two elements
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- Python 3.8+
|
||||||
|
- `pip install requests beautifulsoup4 playwright Pillow`
|
||||||
|
- `playwright install chromium`
|
||||||
|
- Optional: `pip install pdfplumber` for PDF text extraction
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
### As an OpenClaw Skill
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp -r web-pilot/ $(dirname $(which openclaw))/../lib/node_modules/openclaw/skills/web-pilot
|
||||||
|
```
|
||||||
|
|
||||||
|
### Standalone
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/LiranUdi/web-pilot.git
|
||||||
|
cd web-pilot
|
||||||
|
pip install requests beautifulsoup4 playwright Pillow
|
||||||
|
playwright install chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### 1. Search the Web
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 scripts/google_search.py "search term" --pages 3 --engine brave
|
||||||
|
```
|
||||||
|
|
||||||
|
| Flag | Description | Default |
|
||||||
|
|------|-------------|---------|
|
||||||
|
| `--pages N` | Result pages (~10 results each) | 1 |
|
||||||
|
| `--engine` | `duckduckgo`, `brave`, or `google` | duckduckgo |
|
||||||
|
|
||||||
|
**Engine notes:**
|
||||||
|
- **duckduckgo** — Most reliable, no CAPTCHA
|
||||||
|
- **brave** — More results per page, broader sources
|
||||||
|
- **google** — Often blocked by CAPTCHA; last resort
|
||||||
|
|
||||||
|
### 2. Read a Page
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 scripts/read_page.py "https://example.com" --max-chars 10000 --format markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
| Flag | Description | Default |
|
||||||
|
|------|-------------|---------|
|
||||||
|
| `--max-chars N` | Max characters to extract | 50000 |
|
||||||
|
| `--visible` | Show browser window | off |
|
||||||
|
| `--format` | `json`, `markdown`, or `text` | json |
|
||||||
|
| `--no-dismiss` | Skip cookie consent auto-dismiss | off |
|
||||||
|
|
||||||
|
### 3. Persistent Browser Session
|
||||||
|
|
||||||
|
The browser session is a long-running process that stays open between commands, enabling stateful multi-step browsing.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Open a page (flags: --headless, --proxy <url>, --user-agent <string>)
|
||||||
|
python3 scripts/browser_session.py open "https://example.com"
|
||||||
|
python3 scripts/browser_session.py open "https://example.com" --headless --user-agent "MyBot/1.0"
|
||||||
|
|
||||||
|
# Check current state
|
||||||
|
python3 scripts/browser_session.py status
|
||||||
|
|
||||||
|
# Navigate (returns response status, final URL, load time)
|
||||||
|
python3 scripts/browser_session.py navigate "https://other-site.com"
|
||||||
|
|
||||||
|
# Extract content in different formats
|
||||||
|
python3 scripts/browser_session.py extract --format markdown
|
||||||
|
|
||||||
|
# Scroll
|
||||||
|
python3 scripts/browser_session.py scroll down
|
||||||
|
python3 scripts/browser_session.py scroll up
|
||||||
|
python3 scripts/browser_session.py scroll "#section-id" # scroll to element
|
||||||
|
|
||||||
|
# Wait
|
||||||
|
python3 scripts/browser_session.py wait 2 # wait 2 seconds
|
||||||
|
python3 scripts/browser_session.py wait ".loading-done" # wait for element
|
||||||
|
|
||||||
|
# Fill forms
|
||||||
|
python3 scripts/browser_session.py fill "input[name=q]" "search term"
|
||||||
|
python3 scripts/browser_session.py fill "input[name=q]" "search term" --submit
|
||||||
|
|
||||||
|
# Navigation history
|
||||||
|
python3 scripts/browser_session.py back
|
||||||
|
python3 scripts/browser_session.py forward
|
||||||
|
python3 scripts/browser_session.py reload
|
||||||
|
|
||||||
|
# Execute JavaScript
|
||||||
|
python3 scripts/browser_session.py eval "document.title"
|
||||||
|
|
||||||
|
# Extract all links
|
||||||
|
python3 scripts/browser_session.py links
|
||||||
|
|
||||||
|
# Screenshots
|
||||||
|
python3 scripts/browser_session.py screenshot /tmp/page.png # viewport
|
||||||
|
python3 scripts/browser_session.py screenshot /tmp/full.png --full # full page
|
||||||
|
python3 scripts/browser_session.py screenshot /tmp/el.png --element "h1" # single element
|
||||||
|
python3 scripts/browser_session.py screenshot /tmp/range.png --from "#Overview" --to "#end" # range
|
||||||
|
|
||||||
|
# Export page as PDF (headless only)
|
||||||
|
python3 scripts/browser_session.py pdf /tmp/page.pdf
|
||||||
|
|
||||||
|
# Click elements
|
||||||
|
python3 scripts/browser_session.py click "Sign In"
|
||||||
|
python3 scripts/browser_session.py click "#submit-btn"
|
||||||
|
|
||||||
|
# Search for text in the page
|
||||||
|
python3 scripts/browser_session.py search "pricing"
|
||||||
|
|
||||||
|
# Tab management
|
||||||
|
python3 scripts/browser_session.py tab new "https://docs.example.com"
|
||||||
|
python3 scripts/browser_session.py tab list
|
||||||
|
python3 scripts/browser_session.py tab switch 0
|
||||||
|
python3 scripts/browser_session.py tab close 1
|
||||||
|
|
||||||
|
# Dismiss cookie banners
|
||||||
|
python3 scripts/browser_session.py dismiss-cookies
|
||||||
|
|
||||||
|
# Close
|
||||||
|
python3 scripts/browser_session.py close
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Download Files
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 scripts/download_file.py "https://example.com/report.pdf" --output ~/docs
|
||||||
|
```
|
||||||
|
|
||||||
|
| Flag | Description | Default |
|
||||||
|
|------|-------------|---------|
|
||||||
|
| `--output DIR` | Save directory | /tmp/downloads |
|
||||||
|
| `--filename` | Override filename | auto-detected |
|
||||||
|
|
||||||
|
For PDFs, returns `extracted_text` if `pdfplumber` or `PyPDF2` is installed.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
- **Search** — HTTP requests to DuckDuckGo/Brave/Google HTML endpoints
|
||||||
|
- **Page reading** — Playwright + Chromium with read-only DOM TreeWalker
|
||||||
|
- **Browser sessions** — Unix socket server with 4-byte length-prefix framing; forked child keeps browser alive, commands return immediately
|
||||||
|
- **Screenshots** — Range mode uses full-page capture + PIL crop for pixel-perfect section captures
|
||||||
|
- **Cookie dismiss** — Tries common selectors and button text patterns (Accept All, Got It, etc.)
|
||||||
|
- **Downloads** — Streams to disk with auto filename detection from headers/URL
|
||||||
|
|
||||||
|
## Browser Session Reference
|
||||||
|
|
||||||
|
| Action | Description |
|
||||||
|
|--------|-------------|
|
||||||
|
| `open <url>` | Launch browser (flags: `--headless`, `--proxy`, `--user-agent`) |
|
||||||
|
| `navigate <url>` | Go to URL (returns status code, final URL, load time) |
|
||||||
|
| `extract` | Extract page content (`--format json\|markdown\|text`) |
|
||||||
|
| `screenshot <path>` | Capture (`--full`, `--element <sel>`, `--from <sel> --to <sel>`) |
|
||||||
|
| `click <target>` | Click by CSS selector, text, or button/link role |
|
||||||
|
| `scroll <dir\|sel>` | Scroll down/up or to a CSS selector |
|
||||||
|
| `wait <sec\|sel>` | Wait seconds or for element to appear |
|
||||||
|
| `fill <sel> <val>` | Fill input field (optional `--submit`) |
|
||||||
|
| `back` / `forward` / `reload` | Navigation history |
|
||||||
|
| `eval <js>` | Execute JavaScript, return result |
|
||||||
|
| `links` | Extract all links (href + text) |
|
||||||
|
| `search <text>` | Find text in page content |
|
||||||
|
| `pdf <path>` | Export as PDF (headless only) |
|
||||||
|
| `status` | Current URL, title, tab count |
|
||||||
|
| `tab new\|list\|switch\|close` | Multi-tab management |
|
||||||
|
| `dismiss-cookies` | Clear cookie consent banners |
|
||||||
|
| `close` | Shut down browser |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## For AI Agents (OpenClaw / LLM Integration)
|
||||||
|
|
||||||
|
### Workflow Pattern
|
||||||
|
|
||||||
|
1. **Search** → get URLs
|
||||||
|
2. **Read** or **Open** → extract content
|
||||||
|
3. **Scroll/Click/Navigate/Tab** → interact as needed
|
||||||
|
4. **Search** → find specific info in page
|
||||||
|
5. **Screenshot** → capture visual state (viewport, element, or range)
|
||||||
|
6. **Download** → grab linked files
|
||||||
|
7. **Close** → clean up
|
||||||
|
|
||||||
|
### Important Notes
|
||||||
|
|
||||||
|
- All output defaults to **JSON to stdout**; use `--format` for alternatives
|
||||||
|
- `browser_session.py` is **stateful** — one session at a time, persists between commands
|
||||||
|
- `read_page.py` is **stateless** — opens/closes browser each call
|
||||||
|
- Cookie consent is **auto-dismissed** on open/navigate
|
||||||
|
- Always **close** browser sessions when done
|
||||||
|
- `Pillow` is required for range screenshots (`--from`/`--to`)
|
||||||
|
|
||||||
|
## Support
|
||||||
|
|
||||||
|
If this project is useful to you, consider [buying me a coffee](https://ko-fi.com/liranudi) ☕
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT
|
||||||
63
SKILL.md
Normal file
63
SKILL.md
Normal file
@@ -0,0 +1,63 @@
|
|||||||
|
---
|
||||||
|
name: web-pilot
|
||||||
|
description: "无需API密钥搜索网页和阅读页面内容。"
|
||||||
|
---
|
||||||
|
|
||||||
|
# Web Pilot
|
||||||
|
|
||||||
|
Four scripts, zero API keys. All output is JSON by default.
|
||||||
|
|
||||||
|
**Dependencies:** `requests`, `beautifulsoup4`, `playwright` (with Chromium).
|
||||||
|
**Optional:** `pdfplumber` or `PyPDF2` for PDF text extraction.
|
||||||
|
|
||||||
|
Install: `pip install requests beautifulsoup4 playwright && playwright install chromium`
|
||||||
|
|
||||||
|
## 1. Search the Web
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 scripts/google_search.py "query" --pages N --engine ENGINE
|
||||||
|
```
|
||||||
|
|
||||||
|
- `--engine` — `duckduckgo` (default), `brave`, or `google`
|
||||||
|
- Returns `[{title, url, snippet}, ...]`
|
||||||
|
|
||||||
|
## 2. Read a Page (one-shot)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 scripts/read_page.py "https://url" [--max-chars N] [--visible] [--format json|markdown|text] [--no-dismiss]
|
||||||
|
```
|
||||||
|
|
||||||
|
- `--format` — `json` (default), `markdown`, or `text`
|
||||||
|
- Auto-dismisses cookie consent banners (skip with `--no-dismiss`)
|
||||||
|
|
||||||
|
## 3. Persistent Browser Session
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 scripts/browser_session.py open "https://url" # Open + extract
|
||||||
|
python3 scripts/browser_session.py navigate "https://other" # Go to new URL
|
||||||
|
python3 scripts/browser_session.py extract [--format FMT] # Re-read page
|
||||||
|
python3 scripts/browser_session.py screenshot [path] [--full] # Save screenshot
|
||||||
|
python3 scripts/browser_session.py click "Submit" # Click by text/selector
|
||||||
|
python3 scripts/browser_session.py search "keyword" # Search text in page
|
||||||
|
python3 scripts/browser_session.py tab new "https://url" # Open new tab
|
||||||
|
python3 scripts/browser_session.py tab list # List all tabs
|
||||||
|
python3 scripts/browser_session.py tab switch 1 # Switch to tab index
|
||||||
|
python3 scripts/browser_session.py tab close [index] # Close tab
|
||||||
|
python3 scripts/browser_session.py dismiss-cookies # Manually dismiss cookies
|
||||||
|
python3 scripts/browser_session.py close # Close browser
|
||||||
|
```
|
||||||
|
|
||||||
|
- Cookie consent auto-dismissed on open/navigate
|
||||||
|
- Multiple tabs supported — open, switch, close independently
|
||||||
|
- Search returns matching lines with line numbers
|
||||||
|
- Extract supports json/markdown/text output
|
||||||
|
|
||||||
|
## 4. Download Files
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 scripts/download_file.py "https://example.com/doc.pdf" [--output DIR] [--filename NAME]
|
||||||
|
```
|
||||||
|
|
||||||
|
- Auto-detects filename from URL/headers
|
||||||
|
- PDFs: extracts text if pdfplumber/PyPDF2 installed
|
||||||
|
- Returns `{status, path, filename, size_bytes, content_type, extracted_text}`
|
||||||
6
_meta.json
Normal file
6
_meta.json
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
{
|
||||||
|
"ownerId": "kn72vgg7f9v52jr01p0yamfz1n81b8n5",
|
||||||
|
"slug": "web-pilot",
|
||||||
|
"version": "1.0.0",
|
||||||
|
"publishedAt": 1771349856982
|
||||||
|
}
|
||||||
775
scripts/browser_session.py
Normal file
775
scripts/browser_session.py
Normal file
@@ -0,0 +1,775 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Persistent browser session that stays open until told to close.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 browser_session.py open <url> Open URL in visible browser, extract content
|
||||||
|
python3 browser_session.py navigate <url> Go to new URL, extract content
|
||||||
|
python3 browser_session.py extract [--format FMT] Re-extract content from current page
|
||||||
|
python3 browser_session.py screenshot [path] [--full] Save screenshot
|
||||||
|
python3 browser_session.py click <selector_or_text> Click an element
|
||||||
|
python3 browser_session.py search <text> Search for text in page content
|
||||||
|
python3 browser_session.py tab new <url> Open URL in new tab
|
||||||
|
python3 browser_session.py tab list List all open tabs
|
||||||
|
python3 browser_session.py tab switch <index> Switch to tab by index
|
||||||
|
python3 browser_session.py tab close [index] Close tab (current if no index)
|
||||||
|
python3 browser_session.py close Close browser
|
||||||
|
|
||||||
|
Formats for extract: json (default), markdown, text
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import signal
|
||||||
|
import socket
|
||||||
|
import struct
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
|
||||||
|
SOCKET_PATH = "/tmp/web-pilot-browser.sock"
|
||||||
|
PID_FILE = "/tmp/web-pilot-browser.pid"
|
||||||
|
|
||||||
|
EXTRACT_JS = """() => {
|
||||||
|
const SKIP = new Set(['SCRIPT','STYLE','NOSCRIPT','IFRAME','SVG','NAV','FOOTER','HEADER','ASIDE']);
|
||||||
|
const title = document.title || '';
|
||||||
|
const mainEl = document.querySelector('article')
|
||||||
|
|| document.querySelector('main')
|
||||||
|
|| document.querySelector('[role="main"]')
|
||||||
|
|| document.querySelector('#content, .content, .post-content, .entry-content')
|
||||||
|
|| document.body;
|
||||||
|
|
||||||
|
const lines = [];
|
||||||
|
const walker = document.createTreeWalker(mainEl, NodeFilter.SHOW_ELEMENT, {
|
||||||
|
acceptNode(node) {
|
||||||
|
if (SKIP.has(node.tagName)) return NodeFilter.FILTER_REJECT;
|
||||||
|
const tag = node.tagName.toLowerCase();
|
||||||
|
if (['h1','h2','h3','h4','h5','h6','p','li','td','th','pre','blockquote'].includes(tag))
|
||||||
|
return NodeFilter.FILTER_ACCEPT;
|
||||||
|
return NodeFilter.FILTER_SKIP;
|
||||||
|
}
|
||||||
|
});
|
||||||
|
let node;
|
||||||
|
while (node = walker.nextNode()) {
|
||||||
|
const text = node.innerText?.trim();
|
||||||
|
if (!text) continue;
|
||||||
|
const tag = node.tagName.toLowerCase();
|
||||||
|
if (tag.startsWith('h')) lines.push('\\n' + '#'.repeat(parseInt(tag[1])) + ' ' + text + '\\n');
|
||||||
|
else if (tag === 'li') lines.push('- ' + text);
|
||||||
|
else if (tag === 'blockquote') lines.push('> ' + text);
|
||||||
|
else lines.push(text);
|
||||||
|
}
|
||||||
|
let content = lines.join('\\n').trim();
|
||||||
|
if (content.length < 200) content = mainEl.innerText || '';
|
||||||
|
return { title, content };
|
||||||
|
}"""
|
||||||
|
|
||||||
|
# Common cookie consent selectors and text patterns
|
||||||
|
COOKIE_DISMISS_JS = """() => {
|
||||||
|
const selectors = [
|
||||||
|
'button[id*="accept" i]', 'button[id*="consent" i]', 'button[id*="agree" i]',
|
||||||
|
'button[class*="accept" i]', 'button[class*="consent" i]', 'button[class*="agree" i]',
|
||||||
|
'a[id*="accept" i]', 'a[class*="accept" i]',
|
||||||
|
'[data-testid*="accept" i]', '[data-testid*="consent" i]',
|
||||||
|
'.cookie-banner button', '.cookie-notice button', '.cookie-popup button',
|
||||||
|
'#cookie-banner button', '#cookie-notice button', '#cookie-popup button',
|
||||||
|
'.cc-btn.cc-dismiss', '.cc-accept', '#onetrust-accept-btn-handler',
|
||||||
|
'.js-cookie-consent-agree', '[aria-label*="accept" i][aria-label*="cookie" i]',
|
||||||
|
'[aria-label*="Accept all" i]', '[aria-label*="Accept cookies" i]',
|
||||||
|
];
|
||||||
|
|
||||||
|
// Try selectors first
|
||||||
|
for (const sel of selectors) {
|
||||||
|
try {
|
||||||
|
const el = document.querySelector(sel);
|
||||||
|
if (el && el.offsetParent !== null) { el.click(); return { dismissed: true, method: 'selector', selector: sel }; }
|
||||||
|
} catch(e) {}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Try matching button text
|
||||||
|
const patterns = [
|
||||||
|
/^accept all$/i, /accept all cookies/i, /accept cookies/i, /accept & close/i,
|
||||||
|
/^agree$/i, /agree and continue/i, /agree & continue/i,
|
||||||
|
/consent and continue/i, /consent & continue/i,
|
||||||
|
/got it/i, /i understand/i, /i agree/i,
|
||||||
|
/allow all/i, /allow cookies/i, /allow all cookies/i,
|
||||||
|
/^ok$/i, /^okay$/i, /^continue$/i, /^dismiss$/i,
|
||||||
|
/accept and close/i, /accept and continue/i,
|
||||||
|
/nur notwendige/i, /alle akzeptieren/i, /akzeptieren/i,
|
||||||
|
/tout accepter/i, /accepter/i, /accepter et continuer/i,
|
||||||
|
];
|
||||||
|
for (const btn of document.querySelectorAll('button, a[role="button"], [role="button"]')) {
|
||||||
|
const text = btn.innerText?.trim();
|
||||||
|
if (!text || text.length > 50) continue;
|
||||||
|
for (const pat of patterns) {
|
||||||
|
if (pat.test(text) && btn.offsetParent !== null) {
|
||||||
|
btn.click();
|
||||||
|
return { dismissed: true, method: 'text', matched: text };
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return { dismissed: false };
|
||||||
|
}"""
|
||||||
|
|
||||||
|
|
||||||
|
def format_output(result: dict, fmt: str) -> str:
|
||||||
|
"""Format extraction result based on requested format."""
|
||||||
|
if fmt == "text":
|
||||||
|
# Strip markdown-ish formatting
|
||||||
|
content = result.get("content", "")
|
||||||
|
content = re.sub(r'^#+\s+', '', content, flags=re.MULTILINE)
|
||||||
|
content = re.sub(r'^- ', ' ', content, flags=re.MULTILINE)
|
||||||
|
content = re.sub(r'^> ', '', content, flags=re.MULTILINE)
|
||||||
|
return content.strip()
|
||||||
|
elif fmt == "markdown":
|
||||||
|
return f"# {result.get('title', '')}\n\n{result.get('content', '')}"
|
||||||
|
else: # json
|
||||||
|
return json.dumps(result, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
|
||||||
|
def dismiss_cookies(page):
|
||||||
|
"""Try to dismiss cookie consent in main frame and all iframes."""
|
||||||
|
result = page.evaluate(COOKIE_DISMISS_JS)
|
||||||
|
if result.get("dismissed"):
|
||||||
|
page.wait_for_timeout(500)
|
||||||
|
return result
|
||||||
|
# Check iframes (many EU sites put consent in an iframe)
|
||||||
|
for frame in page.frames:
|
||||||
|
if frame == page.main_frame:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
result = frame.evaluate(COOKIE_DISMISS_JS)
|
||||||
|
if result.get("dismissed"):
|
||||||
|
page.wait_for_timeout(500)
|
||||||
|
return result
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return {"dismissed": False}
|
||||||
|
|
||||||
|
|
||||||
|
def run_server(url: str, headless: bool = False, proxy: str = None, user_agent: str = None):
|
||||||
|
from playwright.sync_api import sync_playwright
|
||||||
|
|
||||||
|
if os.path.exists(SOCKET_PATH):
|
||||||
|
os.remove(SOCKET_PATH)
|
||||||
|
|
||||||
|
pw = sync_playwright().start()
|
||||||
|
launch_opts = {"headless": headless}
|
||||||
|
if proxy:
|
||||||
|
launch_opts["proxy"] = {"server": proxy}
|
||||||
|
browser = pw.chromium.launch(**launch_opts)
|
||||||
|
ua = user_agent or "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
|
||||||
|
ctx = browser.new_context(
|
||||||
|
user_agent=ua,
|
||||||
|
locale="en-US",
|
||||||
|
viewport={"width": 1280, "height": 900},
|
||||||
|
)
|
||||||
|
|
||||||
|
# Track pages (tabs)
|
||||||
|
pages = [ctx.new_page()]
|
||||||
|
active_idx = 0
|
||||||
|
|
||||||
|
def active_page():
|
||||||
|
return pages[active_idx]
|
||||||
|
|
||||||
|
active_page().goto(url, timeout=30000, wait_until="domcontentloaded")
|
||||||
|
active_page().wait_for_timeout(1500)
|
||||||
|
|
||||||
|
# Auto-dismiss cookie consent on first load (main frame + iframes)
|
||||||
|
dismiss_cookies(active_page())
|
||||||
|
|
||||||
|
result = active_page().evaluate(EXTRACT_JS)
|
||||||
|
with open("/tmp/web-pilot-initial.json", "w") as f:
|
||||||
|
json.dump(result, f, ensure_ascii=False)
|
||||||
|
|
||||||
|
with open(PID_FILE, "w") as f:
|
||||||
|
f.write(str(os.getpid()))
|
||||||
|
|
||||||
|
sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||||
|
sock.bind(SOCKET_PATH)
|
||||||
|
sock.listen(1)
|
||||||
|
sock.settimeout(1.0)
|
||||||
|
|
||||||
|
running = True
|
||||||
|
while running:
|
||||||
|
try:
|
||||||
|
conn, _ = sock.accept()
|
||||||
|
raw = _recv_msg(conn)
|
||||||
|
cmd = json.loads(raw.decode())
|
||||||
|
action = cmd.get("action")
|
||||||
|
|
||||||
|
if action == "close":
|
||||||
|
_send_msg(conn, json.dumps({"status": "closing"}).encode())
|
||||||
|
conn.close()
|
||||||
|
running = False
|
||||||
|
|
||||||
|
elif action == "navigate":
|
||||||
|
t0 = time.time()
|
||||||
|
response = None
|
||||||
|
try:
|
||||||
|
response = active_page().goto(cmd["url"], timeout=30000, wait_until="domcontentloaded")
|
||||||
|
except Exception as nav_err:
|
||||||
|
# Playwright throws on HTTP error codes (4xx/5xx) — still extract what we can
|
||||||
|
pass
|
||||||
|
active_page().wait_for_timeout(1500)
|
||||||
|
load_time = round(time.time() - t0, 3)
|
||||||
|
dismiss_cookies(active_page())
|
||||||
|
result = active_page().evaluate(EXTRACT_JS)
|
||||||
|
result["response_status"] = response.status if response else None
|
||||||
|
result["final_url"] = active_page().url
|
||||||
|
result["load_time_s"] = load_time
|
||||||
|
mc = cmd.get("max_chars")
|
||||||
|
if mc and len(result["content"]) > mc:
|
||||||
|
result["content"] = result["content"][:mc] + "\n\n[...truncated]"
|
||||||
|
_send_msg(conn, json.dumps(result, ensure_ascii=False).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "extract":
|
||||||
|
result = active_page().evaluate(EXTRACT_JS)
|
||||||
|
mc = cmd.get("max_chars")
|
||||||
|
if mc and len(result["content"]) > mc:
|
||||||
|
result["content"] = result["content"][:mc] + "\n\n[...truncated]"
|
||||||
|
fmt = cmd.get("format", "json")
|
||||||
|
output = format_output(result, fmt) if fmt != "json" else json.dumps(result, ensure_ascii=False)
|
||||||
|
_send_msg(conn, output.encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "screenshot":
|
||||||
|
path = cmd.get("path", "/tmp/screenshot.png")
|
||||||
|
full_page = cmd.get("full_page", False)
|
||||||
|
element_sel = cmd.get("element")
|
||||||
|
from_sel = cmd.get("from_sel")
|
||||||
|
to_sel = cmd.get("to_sel")
|
||||||
|
|
||||||
|
if element_sel:
|
||||||
|
# Screenshot a single element
|
||||||
|
el = active_page().query_selector(element_sel)
|
||||||
|
if el:
|
||||||
|
el.screenshot(path=path)
|
||||||
|
_send_msg(conn, json.dumps({
|
||||||
|
"status": "saved", "path": path, "mode": "element",
|
||||||
|
"selector": element_sel,
|
||||||
|
"url": active_page().url, "title": active_page().title(),
|
||||||
|
"tab": active_idx,
|
||||||
|
}).encode())
|
||||||
|
else:
|
||||||
|
_send_msg(conn, json.dumps({
|
||||||
|
"error": f"Element not found: {element_sel}"
|
||||||
|
}).encode())
|
||||||
|
conn.close()
|
||||||
|
elif from_sel and to_sel:
|
||||||
|
# Screenshot a range between two elements using full-page screenshot + crop
|
||||||
|
bounds = active_page().evaluate("""([fromSel, toSel]) => {
|
||||||
|
const elFrom = document.querySelector(fromSel);
|
||||||
|
const elTo = document.querySelector(toSel);
|
||||||
|
if (!elFrom || !elTo) return null;
|
||||||
|
const r1 = elFrom.getBoundingClientRect();
|
||||||
|
const r2 = elTo.getBoundingClientRect();
|
||||||
|
return {
|
||||||
|
y: r1.top + window.scrollY,
|
||||||
|
y2: r2.bottom + window.scrollY,
|
||||||
|
pageWidth: document.documentElement.scrollWidth
|
||||||
|
};
|
||||||
|
}""", [from_sel, to_sel])
|
||||||
|
if bounds:
|
||||||
|
import tempfile
|
||||||
|
# Take full-page screenshot to a temp file
|
||||||
|
tmp = tempfile.mktemp(suffix=".png")
|
||||||
|
active_page().screenshot(path=tmp, full_page=True)
|
||||||
|
# Crop using PIL
|
||||||
|
try:
|
||||||
|
from PIL import Image
|
||||||
|
im = Image.open(tmp)
|
||||||
|
# Playwright full_page screenshots use device pixel ratio
|
||||||
|
scale = im.width / bounds["pageWidth"] if bounds["pageWidth"] else 1
|
||||||
|
top = int(bounds["y"] * scale)
|
||||||
|
bottom = int(bounds["y2"] * scale)
|
||||||
|
cropped = im.crop((0, top, im.width, bottom))
|
||||||
|
cropped.save(path)
|
||||||
|
os.remove(tmp)
|
||||||
|
_send_msg(conn, json.dumps({
|
||||||
|
"status": "saved", "path": path, "mode": "range",
|
||||||
|
"from": from_sel, "to": to_sel,
|
||||||
|
"url": active_page().url, "title": active_page().title(),
|
||||||
|
"tab": active_idx,
|
||||||
|
}).encode())
|
||||||
|
except Exception as e:
|
||||||
|
try: os.remove(tmp)
|
||||||
|
except: pass
|
||||||
|
_send_msg(conn, json.dumps({"error": f"Crop failed: {str(e)}"}).encode())
|
||||||
|
else:
|
||||||
|
_send_msg(conn, json.dumps({"error": f"One or both selectors not found: {from_sel}, {to_sel}"}).encode())
|
||||||
|
conn.close()
|
||||||
|
else:
|
||||||
|
active_page().screenshot(path=path, full_page=full_page)
|
||||||
|
_send_msg(conn, json.dumps({
|
||||||
|
"status": "saved", "path": path, "mode": "full_page" if full_page else "viewport",
|
||||||
|
"url": active_page().url, "title": active_page().title(),
|
||||||
|
"tab": active_idx,
|
||||||
|
}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "click":
|
||||||
|
target = cmd.get("target", "")
|
||||||
|
clicked = False
|
||||||
|
try:
|
||||||
|
el = active_page().query_selector(target)
|
||||||
|
if el:
|
||||||
|
el.click()
|
||||||
|
clicked = True
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
if not clicked:
|
||||||
|
try:
|
||||||
|
active_page().get_by_text(target, exact=False).first.click()
|
||||||
|
clicked = True
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
if not clicked:
|
||||||
|
try:
|
||||||
|
active_page().get_by_role("button", name=target).or_(
|
||||||
|
active_page().get_by_role("link", name=target)
|
||||||
|
).first.click()
|
||||||
|
clicked = True
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
active_page().wait_for_timeout(1000)
|
||||||
|
result = {"status": "clicked" if clicked else "not_found", "target": target, "url": active_page().url}
|
||||||
|
_send_msg(conn, json.dumps(result, ensure_ascii=False).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "dismiss_cookies":
|
||||||
|
result = dismiss_cookies(active_page())
|
||||||
|
_send_msg(conn, json.dumps(result, ensure_ascii=False).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "search":
|
||||||
|
query = cmd.get("query", "").lower()
|
||||||
|
result = active_page().evaluate(EXTRACT_JS)
|
||||||
|
content = result.get("content", "")
|
||||||
|
lines = content.split("\n")
|
||||||
|
matches = []
|
||||||
|
for i, line in enumerate(lines):
|
||||||
|
if query in line.lower():
|
||||||
|
matches.append({"line": i + 1, "text": line.strip()})
|
||||||
|
_send_msg(conn, json.dumps({
|
||||||
|
"query": query,
|
||||||
|
"matches": len(matches),
|
||||||
|
"results": matches[:50], # cap at 50
|
||||||
|
"url": active_page().url,
|
||||||
|
}, ensure_ascii=False).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "tab_new":
|
||||||
|
new_page = ctx.new_page()
|
||||||
|
pages.append(new_page)
|
||||||
|
active_idx = len(pages) - 1
|
||||||
|
new_page.goto(cmd["url"], timeout=30000, wait_until="domcontentloaded")
|
||||||
|
new_page.wait_for_timeout(1500)
|
||||||
|
dismiss_cookies(new_page)
|
||||||
|
result = new_page.evaluate(EXTRACT_JS)
|
||||||
|
result["tab"] = active_idx
|
||||||
|
result["total_tabs"] = len(pages)
|
||||||
|
_send_msg(conn, json.dumps(result, ensure_ascii=False).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "tab_list":
|
||||||
|
tab_info = []
|
||||||
|
for i, p in enumerate(pages):
|
||||||
|
try:
|
||||||
|
tab_info.append({
|
||||||
|
"index": i,
|
||||||
|
"title": p.title(),
|
||||||
|
"url": p.url,
|
||||||
|
"active": i == active_idx,
|
||||||
|
})
|
||||||
|
except Exception:
|
||||||
|
tab_info.append({"index": i, "title": "(closed)", "url": "", "active": i == active_idx})
|
||||||
|
_send_msg(conn, json.dumps({"tabs": tab_info, "active": active_idx}, ensure_ascii=False).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "tab_switch":
|
||||||
|
idx = cmd.get("index", 0)
|
||||||
|
if 0 <= idx < len(pages):
|
||||||
|
active_idx = idx
|
||||||
|
pages[active_idx].bring_to_front()
|
||||||
|
_send_msg(conn, json.dumps({
|
||||||
|
"status": "switched", "tab": active_idx,
|
||||||
|
"title": pages[active_idx].title(),
|
||||||
|
"url": pages[active_idx].url,
|
||||||
|
}, ensure_ascii=False).encode())
|
||||||
|
else:
|
||||||
|
_send_msg(conn, json.dumps({"error": f"Invalid tab index {idx}. Have {len(pages)} tabs."}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "tab_close":
|
||||||
|
idx = cmd.get("index", active_idx)
|
||||||
|
if len(pages) <= 1:
|
||||||
|
_send_msg(conn, json.dumps({"error": "Cannot close the last tab. Use 'close' to close the browser."}).encode())
|
||||||
|
elif 0 <= idx < len(pages):
|
||||||
|
pages[idx].close()
|
||||||
|
pages.pop(idx)
|
||||||
|
if active_idx >= len(pages):
|
||||||
|
active_idx = len(pages) - 1
|
||||||
|
elif active_idx > idx:
|
||||||
|
active_idx -= 1
|
||||||
|
pages[active_idx].bring_to_front()
|
||||||
|
_send_msg(conn, json.dumps({
|
||||||
|
"status": "tab_closed", "closed_index": idx,
|
||||||
|
"active": active_idx, "total_tabs": len(pages),
|
||||||
|
}, ensure_ascii=False).encode())
|
||||||
|
else:
|
||||||
|
_send_msg(conn, json.dumps({"error": f"Invalid tab index {idx}"}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "scroll":
|
||||||
|
direction = cmd.get("direction", "down")
|
||||||
|
if direction == "down":
|
||||||
|
active_page().evaluate("window.scrollBy(0, window.innerHeight)")
|
||||||
|
elif direction == "up":
|
||||||
|
active_page().evaluate("window.scrollBy(0, -window.innerHeight)")
|
||||||
|
else:
|
||||||
|
# Treat as CSS selector
|
||||||
|
active_page().evaluate(f"document.querySelector({json.dumps(direction)})?.scrollIntoView({{behavior:'smooth',block:'center'}})")
|
||||||
|
active_page().wait_for_timeout(300)
|
||||||
|
_send_msg(conn, json.dumps({"status": "scrolled", "direction": direction, "url": active_page().url}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "wait":
|
||||||
|
target = cmd.get("target", "1")
|
||||||
|
try:
|
||||||
|
seconds = float(target)
|
||||||
|
active_page().wait_for_timeout(int(seconds * 1000))
|
||||||
|
_send_msg(conn, json.dumps({"status": "waited", "seconds": seconds}).encode())
|
||||||
|
except ValueError:
|
||||||
|
# CSS selector
|
||||||
|
try:
|
||||||
|
active_page().wait_for_selector(target, timeout=30000)
|
||||||
|
_send_msg(conn, json.dumps({"status": "found", "selector": target}).encode())
|
||||||
|
except Exception as e:
|
||||||
|
_send_msg(conn, json.dumps({"status": "timeout", "selector": target, "error": str(e)}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "fill":
|
||||||
|
selector = cmd.get("selector", "")
|
||||||
|
value = cmd.get("value", "")
|
||||||
|
submit = cmd.get("submit", False)
|
||||||
|
try:
|
||||||
|
active_page().fill(selector, value)
|
||||||
|
if submit:
|
||||||
|
active_page().press(selector, "Enter")
|
||||||
|
active_page().wait_for_timeout(1000)
|
||||||
|
_send_msg(conn, json.dumps({"status": "filled", "selector": selector, "submitted": submit, "url": active_page().url}).encode())
|
||||||
|
except Exception as e:
|
||||||
|
_send_msg(conn, json.dumps({"error": str(e)}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action in ("back", "forward", "reload"):
|
||||||
|
if action == "back":
|
||||||
|
active_page().go_back(timeout=30000, wait_until="domcontentloaded")
|
||||||
|
elif action == "forward":
|
||||||
|
active_page().go_forward(timeout=30000, wait_until="domcontentloaded")
|
||||||
|
else:
|
||||||
|
active_page().reload(timeout=30000, wait_until="domcontentloaded")
|
||||||
|
active_page().wait_for_timeout(500)
|
||||||
|
_send_msg(conn, json.dumps({"status": action, "url": active_page().url, "title": active_page().title()}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "eval":
|
||||||
|
js_code = cmd.get("code", "")
|
||||||
|
try:
|
||||||
|
result = active_page().evaluate(js_code)
|
||||||
|
_send_msg(conn, json.dumps({"status": "ok", "result": result}, ensure_ascii=False, default=str).encode())
|
||||||
|
except Exception as e:
|
||||||
|
_send_msg(conn, json.dumps({"status": "error", "error": str(e)}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "links":
|
||||||
|
links_js = """() => {
|
||||||
|
return Array.from(document.querySelectorAll('a[href]')).map(a => ({
|
||||||
|
href: a.href, text: (a.innerText || '').trim().substring(0, 200)
|
||||||
|
})).filter(l => l.href && !l.href.startsWith('javascript:'))
|
||||||
|
}"""
|
||||||
|
result = active_page().evaluate(links_js)
|
||||||
|
_send_msg(conn, json.dumps({"links": result, "count": len(result), "url": active_page().url}, ensure_ascii=False).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "pdf":
|
||||||
|
path = cmd.get("path", "/tmp/page.pdf")
|
||||||
|
try:
|
||||||
|
active_page().pdf(path=path)
|
||||||
|
_send_msg(conn, json.dumps({"status": "saved", "path": path}).encode())
|
||||||
|
except Exception as e:
|
||||||
|
_send_msg(conn, json.dumps({"error": str(e)}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
elif action == "status":
|
||||||
|
_send_msg(conn, json.dumps({
|
||||||
|
"url": active_page().url,
|
||||||
|
"title": active_page().title(),
|
||||||
|
"active_tab": active_idx,
|
||||||
|
"total_tabs": len(pages),
|
||||||
|
}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
else:
|
||||||
|
_send_msg(conn, json.dumps({"error": f"unknown action: {action}"}).encode())
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
except socket.timeout:
|
||||||
|
continue
|
||||||
|
except Exception as e:
|
||||||
|
try:
|
||||||
|
_send_msg(conn, json.dumps({"error": str(e)}).encode())
|
||||||
|
conn.close()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
sock.close()
|
||||||
|
for f in [SOCKET_PATH, PID_FILE]:
|
||||||
|
if os.path.exists(f):
|
||||||
|
os.remove(f)
|
||||||
|
browser.close()
|
||||||
|
pw.stop()
|
||||||
|
|
||||||
|
|
||||||
|
def _recv_exact(sock, n):
|
||||||
|
"""Read exactly n bytes from socket."""
|
||||||
|
buf = b""
|
||||||
|
while len(buf) < n:
|
||||||
|
chunk = sock.recv(n - len(buf))
|
||||||
|
if not chunk:
|
||||||
|
raise ConnectionError("Socket closed while reading")
|
||||||
|
buf += chunk
|
||||||
|
return buf
|
||||||
|
|
||||||
|
|
||||||
|
def _send_msg(sock, data: bytes):
|
||||||
|
"""Send a length-prefixed message."""
|
||||||
|
sock.sendall(struct.pack('>I', len(data)) + data)
|
||||||
|
|
||||||
|
|
||||||
|
def _recv_msg(sock) -> bytes:
|
||||||
|
"""Receive a length-prefixed message."""
|
||||||
|
header = _recv_exact(sock, 4)
|
||||||
|
length = struct.unpack('>I', header)[0]
|
||||||
|
return _recv_exact(sock, length)
|
||||||
|
|
||||||
|
|
||||||
|
def send_command(cmd: dict) -> str:
|
||||||
|
sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||||
|
sock.settimeout(60)
|
||||||
|
sock.connect(SOCKET_PATH)
|
||||||
|
_send_msg(sock, json.dumps(cmd).encode())
|
||||||
|
result = _recv_msg(sock)
|
||||||
|
sock.close()
|
||||||
|
return result.decode("utf-8", errors="replace")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
if len(sys.argv) < 2:
|
||||||
|
print("Usage: browser_session.py <open|navigate|extract|screenshot|click|search|tab|close> [args]")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
action = sys.argv[1]
|
||||||
|
|
||||||
|
if action == "open":
|
||||||
|
headless = "--headless" in sys.argv
|
||||||
|
# Parse --proxy and --user-agent
|
||||||
|
proxy = None
|
||||||
|
user_agent = None
|
||||||
|
i = 2
|
||||||
|
while i < len(sys.argv):
|
||||||
|
if sys.argv[i] == "--proxy" and i + 1 < len(sys.argv):
|
||||||
|
proxy = sys.argv[i + 1]; i += 2
|
||||||
|
elif sys.argv[i] == "--user-agent" and i + 1 < len(sys.argv):
|
||||||
|
user_agent = sys.argv[i + 1]; i += 2
|
||||||
|
else:
|
||||||
|
i += 1
|
||||||
|
args = [a for a in sys.argv[2:] if not a.startswith("--") and a != proxy and a != user_agent]
|
||||||
|
if not args:
|
||||||
|
print("Usage: browser_session.py open <url> [--headless] [--proxy <url>] [--user-agent <string>]")
|
||||||
|
sys.exit(1)
|
||||||
|
url = args[0]
|
||||||
|
|
||||||
|
# Stale PID/socket cleanup
|
||||||
|
if os.path.exists(SOCKET_PATH):
|
||||||
|
stale = True
|
||||||
|
if os.path.exists(PID_FILE):
|
||||||
|
try:
|
||||||
|
old_pid = int(open(PID_FILE).read().strip())
|
||||||
|
os.kill(old_pid, 0) # check if alive
|
||||||
|
stale = False
|
||||||
|
except (OSError, ValueError):
|
||||||
|
pass
|
||||||
|
if not stale:
|
||||||
|
print(json.dumps({"error": "Browser session already open. Use 'navigate', 'extract', or 'close'."}))
|
||||||
|
sys.exit(1)
|
||||||
|
# Clean up stale files
|
||||||
|
try: os.remove(SOCKET_PATH)
|
||||||
|
except OSError: pass
|
||||||
|
try: os.remove(PID_FILE)
|
||||||
|
except OSError: pass
|
||||||
|
|
||||||
|
pid = os.fork()
|
||||||
|
if pid == 0:
|
||||||
|
os.setsid()
|
||||||
|
sys.stdout = open(os.devnull, "w")
|
||||||
|
sys.stderr = open(os.devnull, "w")
|
||||||
|
run_server(url, headless=headless, proxy=proxy, user_agent=user_agent)
|
||||||
|
sys.exit(0)
|
||||||
|
else:
|
||||||
|
for _ in range(30):
|
||||||
|
if os.path.exists("/tmp/web-pilot-initial.json"):
|
||||||
|
time.sleep(0.2)
|
||||||
|
with open("/tmp/web-pilot-initial.json") as f:
|
||||||
|
result = json.load(f)
|
||||||
|
os.remove("/tmp/web-pilot-initial.json")
|
||||||
|
result["status"] = "browser open"
|
||||||
|
result["note"] = "Commands: navigate, extract, screenshot, click, search, tab, close"
|
||||||
|
print(json.dumps(result, indent=2, ensure_ascii=False))
|
||||||
|
sys.exit(0)
|
||||||
|
time.sleep(0.5)
|
||||||
|
print(json.dumps({"error": "Timeout waiting for browser to start"}))
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
elif action == "navigate":
|
||||||
|
if len(sys.argv) < 3:
|
||||||
|
print("Usage: browser_session.py navigate <url>")
|
||||||
|
sys.exit(1)
|
||||||
|
print(send_command({"action": "navigate", "url": sys.argv[2], "max_chars": 50000}))
|
||||||
|
|
||||||
|
elif action == "extract":
|
||||||
|
fmt = "json"
|
||||||
|
if "--format" in sys.argv:
|
||||||
|
idx = sys.argv.index("--format")
|
||||||
|
if idx + 1 < len(sys.argv):
|
||||||
|
fmt = sys.argv[idx + 1]
|
||||||
|
print(send_command({"action": "extract", "max_chars": 50000, "format": fmt}))
|
||||||
|
|
||||||
|
elif action == "screenshot":
|
||||||
|
path = "/tmp/screenshot.png"
|
||||||
|
full_page = "--full" in sys.argv
|
||||||
|
element_sel = None
|
||||||
|
from_sel = None
|
||||||
|
to_sel = None
|
||||||
|
# Parse flags
|
||||||
|
args = sys.argv[2:]
|
||||||
|
i = 0
|
||||||
|
positional = []
|
||||||
|
while i < len(args):
|
||||||
|
if args[i] == "--element" and i + 1 < len(args):
|
||||||
|
element_sel = args[i + 1]; i += 2
|
||||||
|
elif args[i] == "--from" and i + 1 < len(args):
|
||||||
|
from_sel = args[i + 1]; i += 2
|
||||||
|
elif args[i] == "--to" and i + 1 < len(args):
|
||||||
|
to_sel = args[i + 1]; i += 2
|
||||||
|
elif args[i] == "--full":
|
||||||
|
i += 1
|
||||||
|
elif not args[i].startswith("--"):
|
||||||
|
positional.append(args[i]); i += 1
|
||||||
|
else:
|
||||||
|
i += 1
|
||||||
|
if positional:
|
||||||
|
path = positional[0]
|
||||||
|
cmd = {"action": "screenshot", "path": path, "full_page": full_page}
|
||||||
|
if element_sel:
|
||||||
|
cmd["element"] = element_sel
|
||||||
|
if from_sel:
|
||||||
|
cmd["from_sel"] = from_sel
|
||||||
|
if to_sel:
|
||||||
|
cmd["to_sel"] = to_sel
|
||||||
|
print(send_command(cmd))
|
||||||
|
|
||||||
|
elif action == "click":
|
||||||
|
if len(sys.argv) < 3:
|
||||||
|
print("Usage: browser_session.py click <selector_or_text>")
|
||||||
|
sys.exit(1)
|
||||||
|
target = " ".join(a for a in sys.argv[2:] if not a.startswith("--"))
|
||||||
|
print(send_command({"action": "click", "target": target}))
|
||||||
|
|
||||||
|
elif action == "search":
|
||||||
|
if len(sys.argv) < 3:
|
||||||
|
print("Usage: browser_session.py search <text>")
|
||||||
|
sys.exit(1)
|
||||||
|
query = " ".join(sys.argv[2:])
|
||||||
|
print(send_command({"action": "search", "query": query}))
|
||||||
|
|
||||||
|
elif action == "tab":
|
||||||
|
if len(sys.argv) < 3:
|
||||||
|
print("Usage: browser_session.py tab <new|list|switch|close> [args]")
|
||||||
|
sys.exit(1)
|
||||||
|
sub = sys.argv[2]
|
||||||
|
if sub == "new":
|
||||||
|
if len(sys.argv) < 4:
|
||||||
|
print("Usage: browser_session.py tab new <url>")
|
||||||
|
sys.exit(1)
|
||||||
|
print(send_command({"action": "tab_new", "url": sys.argv[3]}))
|
||||||
|
elif sub == "list":
|
||||||
|
print(send_command({"action": "tab_list"}))
|
||||||
|
elif sub == "switch":
|
||||||
|
if len(sys.argv) < 4:
|
||||||
|
print("Usage: browser_session.py tab switch <index>")
|
||||||
|
sys.exit(1)
|
||||||
|
print(send_command({"action": "tab_switch", "index": int(sys.argv[3])}))
|
||||||
|
elif sub == "close":
|
||||||
|
idx = int(sys.argv[3]) if len(sys.argv) > 3 else -1
|
||||||
|
cmd = {"action": "tab_close"}
|
||||||
|
if idx >= 0:
|
||||||
|
cmd["index"] = idx
|
||||||
|
print(send_command(cmd))
|
||||||
|
else:
|
||||||
|
print(f"Unknown tab command: {sub}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
elif action == "dismiss-cookies":
|
||||||
|
print(send_command({"action": "dismiss_cookies"}))
|
||||||
|
|
||||||
|
elif action == "scroll":
|
||||||
|
if len(sys.argv) < 3:
|
||||||
|
print("Usage: browser_session.py scroll down|up|<selector>")
|
||||||
|
sys.exit(1)
|
||||||
|
print(send_command({"action": "scroll", "direction": sys.argv[2]}))
|
||||||
|
|
||||||
|
elif action == "wait":
|
||||||
|
if len(sys.argv) < 3:
|
||||||
|
print("Usage: browser_session.py wait <seconds_or_selector>")
|
||||||
|
sys.exit(1)
|
||||||
|
print(send_command({"action": "wait", "target": sys.argv[2]}))
|
||||||
|
|
||||||
|
elif action == "fill":
|
||||||
|
if len(sys.argv) < 4:
|
||||||
|
print("Usage: browser_session.py fill <selector> <value> [--submit]")
|
||||||
|
sys.exit(1)
|
||||||
|
submit = "--submit" in sys.argv
|
||||||
|
print(send_command({"action": "fill", "selector": sys.argv[2], "value": sys.argv[3], "submit": submit}))
|
||||||
|
|
||||||
|
elif action in ("back", "forward", "reload"):
|
||||||
|
print(send_command({"action": action}))
|
||||||
|
|
||||||
|
elif action == "eval":
|
||||||
|
if len(sys.argv) < 3:
|
||||||
|
print("Usage: browser_session.py eval \"<js_code>\"")
|
||||||
|
sys.exit(1)
|
||||||
|
print(send_command({"action": "eval", "code": " ".join(sys.argv[2:])}))
|
||||||
|
|
||||||
|
elif action == "links":
|
||||||
|
print(send_command({"action": "links"}))
|
||||||
|
|
||||||
|
elif action == "pdf":
|
||||||
|
path = sys.argv[2] if len(sys.argv) > 2 else "/tmp/page.pdf"
|
||||||
|
print(send_command({"action": "pdf", "path": path}))
|
||||||
|
|
||||||
|
elif action == "status":
|
||||||
|
print(send_command({"action": "status"}))
|
||||||
|
|
||||||
|
elif action == "close":
|
||||||
|
print(send_command({"action": "close"}))
|
||||||
|
|
||||||
|
else:
|
||||||
|
print(f"Unknown action: {action}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
190
scripts/download_file.py
Normal file
190
scripts/download_file.py
Normal file
@@ -0,0 +1,190 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Download files from URLs. Handles PDFs, images, documents, and any binary content.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 download_file.py <url> [--output DIR] [--filename NAME]
|
||||||
|
|
||||||
|
Flags:
|
||||||
|
--output DIR Directory to save to (default: /tmp/downloads)
|
||||||
|
--filename NAME Override filename (auto-detected from URL/headers if omitted)
|
||||||
|
|
||||||
|
Outputs JSON {status, path, filename, size_bytes, content_type}.
|
||||||
|
Detects file type from Content-Type header and URL. For PDFs, also extracts
|
||||||
|
text if possible (requires pdfplumber or falls back to basic extraction).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import urllib.parse
|
||||||
|
|
||||||
|
import requests
|
||||||
|
|
||||||
|
|
||||||
|
def json_error(message: str) -> str:
|
||||||
|
"""Return standardized JSON error format."""
|
||||||
|
return json.dumps({"error": message}, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
HEADERS = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
}
|
||||||
|
|
||||||
|
# File types we handle specially
|
||||||
|
TEXT_EXTRACTABLE = {
|
||||||
|
"application/pdf": "pdf",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def guess_filename(url: str, resp: requests.Response) -> str:
|
||||||
|
"""Determine filename from Content-Disposition, URL, or Content-Type."""
|
||||||
|
# Check Content-Disposition header
|
||||||
|
cd = resp.headers.get("Content-Disposition", "")
|
||||||
|
if "filename=" in cd:
|
||||||
|
match = re.search(r'filename[*]?=["\']?([^"\';]+)', cd)
|
||||||
|
if match:
|
||||||
|
return match.group(1).strip()
|
||||||
|
|
||||||
|
# Extract from URL path
|
||||||
|
parsed = urllib.parse.urlparse(url)
|
||||||
|
path_name = os.path.basename(parsed.path)
|
||||||
|
if path_name and "." in path_name:
|
||||||
|
return urllib.parse.unquote(path_name)
|
||||||
|
|
||||||
|
# Fall back to content type
|
||||||
|
ct = resp.headers.get("Content-Type", "")
|
||||||
|
ext_map = {
|
||||||
|
"application/pdf": "download.pdf",
|
||||||
|
"image/png": "download.png",
|
||||||
|
"image/jpeg": "download.jpg",
|
||||||
|
"image/gif": "download.gif",
|
||||||
|
"image/webp": "download.webp",
|
||||||
|
"application/zip": "download.zip",
|
||||||
|
"text/html": "download.html",
|
||||||
|
"text/plain": "download.txt",
|
||||||
|
"application/json": "download.json",
|
||||||
|
}
|
||||||
|
for mime, name in ext_map.items():
|
||||||
|
if mime in ct:
|
||||||
|
return name
|
||||||
|
|
||||||
|
return "download.bin"
|
||||||
|
|
||||||
|
|
||||||
|
def extract_pdf_text(filepath: str) -> str:
|
||||||
|
"""Try to extract text from a PDF. Returns empty string on failure."""
|
||||||
|
# Try pdfplumber first
|
||||||
|
try:
|
||||||
|
import pdfplumber
|
||||||
|
text_parts = []
|
||||||
|
with pdfplumber.open(filepath) as pdf:
|
||||||
|
for page in pdf.pages:
|
||||||
|
t = page.extract_text()
|
||||||
|
if t:
|
||||||
|
text_parts.append(t)
|
||||||
|
return "\n\n".join(text_parts)
|
||||||
|
except ImportError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Try PyPDF2
|
||||||
|
try:
|
||||||
|
from PyPDF2 import PdfReader
|
||||||
|
reader = PdfReader(filepath)
|
||||||
|
text_parts = []
|
||||||
|
for page in reader.pages:
|
||||||
|
t = page.extract_text()
|
||||||
|
if t:
|
||||||
|
text_parts.append(t)
|
||||||
|
return "\n\n".join(text_parts)
|
||||||
|
except ImportError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
def download(url: str, output_dir: str = "/tmp/downloads", filename: str = None,
|
||||||
|
proxy: str = None, user_agent: str = None) -> dict:
|
||||||
|
os.makedirs(output_dir, exist_ok=True)
|
||||||
|
|
||||||
|
headers = HEADERS.copy()
|
||||||
|
if user_agent:
|
||||||
|
headers["User-Agent"] = user_agent
|
||||||
|
|
||||||
|
proxies = {}
|
||||||
|
if proxy:
|
||||||
|
proxies = {"http": proxy, "https": proxy}
|
||||||
|
|
||||||
|
try:
|
||||||
|
resp = requests.get(url, headers=headers, timeout=30, stream=True,
|
||||||
|
allow_redirects=True, proxies=proxies)
|
||||||
|
except requests.exceptions.SSLError:
|
||||||
|
# Retry without SSL verification if certs are broken
|
||||||
|
resp = requests.get(url, headers=headers, timeout=30, stream=True,
|
||||||
|
allow_redirects=True, proxies=proxies, verify=False)
|
||||||
|
resp.raise_for_status()
|
||||||
|
|
||||||
|
if not filename:
|
||||||
|
filename = guess_filename(url, resp)
|
||||||
|
|
||||||
|
filepath = os.path.join(output_dir, filename)
|
||||||
|
|
||||||
|
# Avoid overwriting — add suffix if exists
|
||||||
|
base, ext = os.path.splitext(filepath)
|
||||||
|
counter = 1
|
||||||
|
while os.path.exists(filepath):
|
||||||
|
filepath = f"{base}_{counter}{ext}"
|
||||||
|
counter += 1
|
||||||
|
|
||||||
|
# Stream to disk
|
||||||
|
total = 0
|
||||||
|
with open(filepath, "wb") as f:
|
||||||
|
for chunk in resp.iter_content(chunk_size=8192):
|
||||||
|
f.write(chunk)
|
||||||
|
total += len(chunk)
|
||||||
|
|
||||||
|
content_type = resp.headers.get("Content-Type", "unknown")
|
||||||
|
result = {
|
||||||
|
"status": "downloaded",
|
||||||
|
"path": filepath,
|
||||||
|
"filename": os.path.basename(filepath),
|
||||||
|
"size_bytes": total,
|
||||||
|
"content_type": content_type,
|
||||||
|
"url": url,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Add redirect URL if redirected
|
||||||
|
if resp.url != url:
|
||||||
|
result["redirect_url"] = resp.url
|
||||||
|
|
||||||
|
# Extract text from PDFs
|
||||||
|
if "pdf" in content_type.lower() or filepath.lower().endswith(".pdf"):
|
||||||
|
text = extract_pdf_text(filepath)
|
||||||
|
if text:
|
||||||
|
result["extracted_text"] = text
|
||||||
|
result["extracted_chars"] = len(text)
|
||||||
|
else:
|
||||||
|
result["extracted_text"] = ""
|
||||||
|
result["note"] = "PDF text extraction failed. Install pdfplumber or PyPDF2 for text extraction."
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Download files from URLs")
|
||||||
|
parser.add_argument("url", help="URL to download")
|
||||||
|
parser.add_argument("--output", default="/tmp/downloads", help="Output directory (default: /tmp/downloads)")
|
||||||
|
parser.add_argument("--filename", default=None, help="Override filename")
|
||||||
|
parser.add_argument("--proxy", help="Proxy URL (e.g., http://proxy:8080)")
|
||||||
|
parser.add_argument("--user-agent", help="Override User-Agent string")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = download(args.url, args.output, args.filename, args.proxy, args.user_agent)
|
||||||
|
print(json.dumps(result, indent=2, ensure_ascii=False))
|
||||||
|
except Exception as e:
|
||||||
|
print(json_error(f"Download failed: {str(e)}"))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
182
scripts/google_search.py
Normal file
182
scripts/google_search.py
Normal file
@@ -0,0 +1,182 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Web search via multiple engines. No API key required.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 google_search.py "search term" [--pages N] [--engine ENGINE]
|
||||||
|
|
||||||
|
Flags:
|
||||||
|
--pages N Number of result pages (default: 1, ~10 results each)
|
||||||
|
--engine ENGINE Search engine: duckduckgo (default), brave, google
|
||||||
|
Note: google often blocks with CAPTCHA
|
||||||
|
|
||||||
|
Outputs JSON array of {title, url, snippet} per result.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import urllib.parse
|
||||||
|
|
||||||
|
import requests
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
|
||||||
|
|
||||||
|
def json_error(message: str) -> str:
|
||||||
|
"""Return standardized JSON error format."""
|
||||||
|
return json.dumps({"error": message}, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
HEADERS = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
"Accept-Language": "en-US,en;q=0.9",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def search_duckduckgo(query: str, pages: int = 1) -> list[dict]:
|
||||||
|
"""DuckDuckGo HTML endpoint — most reliable, no CAPTCHA."""
|
||||||
|
results = []
|
||||||
|
form_data = {"q": query}
|
||||||
|
|
||||||
|
for page in range(pages):
|
||||||
|
resp = requests.post("https://html.duckduckgo.com/html/", data=form_data, headers=HEADERS, timeout=15)
|
||||||
|
resp.raise_for_status()
|
||||||
|
soup = BeautifulSoup(resp.text, "html.parser")
|
||||||
|
|
||||||
|
for res in soup.select(".result"):
|
||||||
|
title_el = res.select_one(".result__title a, a.result__a")
|
||||||
|
snippet_el = res.select_one(".result__snippet")
|
||||||
|
if not title_el:
|
||||||
|
continue
|
||||||
|
href = title_el.get("href", "")
|
||||||
|
if "uddg=" in href:
|
||||||
|
href = urllib.parse.unquote(
|
||||||
|
urllib.parse.parse_qs(urllib.parse.urlparse(href).query).get("uddg", [href])[0]
|
||||||
|
)
|
||||||
|
if href.startswith("http"):
|
||||||
|
results.append({
|
||||||
|
"title": title_el.get_text(strip=True),
|
||||||
|
"url": href,
|
||||||
|
"snippet": snippet_el.get_text(strip=True) if snippet_el else "",
|
||||||
|
})
|
||||||
|
|
||||||
|
if page < pages - 1:
|
||||||
|
next_form = None
|
||||||
|
for btn in soup.find_all("input", {"value": "Next"}):
|
||||||
|
if btn.parent and btn.parent.name == "form":
|
||||||
|
next_form = btn.parent
|
||||||
|
break
|
||||||
|
if not next_form:
|
||||||
|
break
|
||||||
|
form_data = {}
|
||||||
|
for inp in next_form.find_all("input"):
|
||||||
|
name = inp.get("name")
|
||||||
|
if name:
|
||||||
|
form_data[name] = inp.get("value", "")
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def search_brave(query: str, pages: int = 1) -> list[dict]:
|
||||||
|
"""Brave Search HTML — good alternative, sometimes more results."""
|
||||||
|
results = []
|
||||||
|
|
||||||
|
for page in range(pages):
|
||||||
|
offset = page * 10
|
||||||
|
params = {"q": query, "offset": str(offset)}
|
||||||
|
resp = requests.get("https://search.brave.com/search", params=params, headers=HEADERS, timeout=15)
|
||||||
|
resp.raise_for_status()
|
||||||
|
soup = BeautifulSoup(resp.text, "html.parser")
|
||||||
|
|
||||||
|
for item in soup.select('div[data-type="web"]'):
|
||||||
|
# Title: dedicated title span, or first link text
|
||||||
|
title_el = item.select_one(".title.search-snippet-title, .search-snippet-title")
|
||||||
|
link_el = item.select_one("a[href^='http']")
|
||||||
|
# Description/snippet
|
||||||
|
snippet_el = item.select_one(".generic-snippet .content, .generic-snippet, .snippet-description")
|
||||||
|
|
||||||
|
if not link_el:
|
||||||
|
continue
|
||||||
|
href = link_el.get("href", "")
|
||||||
|
title = title_el.get_text(strip=True) if title_el else link_el.get_text(strip=True)
|
||||||
|
if href.startswith("http") and title:
|
||||||
|
results.append({
|
||||||
|
"title": title,
|
||||||
|
"url": href,
|
||||||
|
"snippet": snippet_el.get_text(strip=True) if snippet_el else "",
|
||||||
|
})
|
||||||
|
|
||||||
|
if page < pages - 1:
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def search_google(query: str, pages: int = 1) -> list[dict]:
|
||||||
|
"""Google HTML — often blocked by CAPTCHA. Use as fallback."""
|
||||||
|
results = []
|
||||||
|
|
||||||
|
for page in range(pages):
|
||||||
|
start = page * 10
|
||||||
|
params = {"q": query, "start": str(start), "hl": "en"}
|
||||||
|
resp = requests.get("https://www.google.com/search", params=params, headers=HEADERS, timeout=15)
|
||||||
|
resp.raise_for_status()
|
||||||
|
|
||||||
|
if "sorry" in resp.url or "unusual traffic" in resp.text.lower():
|
||||||
|
if not results:
|
||||||
|
raise RuntimeError("Google blocked the request (CAPTCHA). Try --engine duckduckgo or brave.")
|
||||||
|
break
|
||||||
|
|
||||||
|
soup = BeautifulSoup(resp.text, "html.parser")
|
||||||
|
for h3 in soup.find_all("h3"):
|
||||||
|
parent_a = h3.find_parent("a")
|
||||||
|
if parent_a and parent_a.get("href", "").startswith("http"):
|
||||||
|
# Find snippet near the h3
|
||||||
|
container = h3.find_parent("div", class_="g") or h3.parent
|
||||||
|
snippet_el = container.select_one("div[data-sncf], div.VwiC3b, span.st") if container else None
|
||||||
|
results.append({
|
||||||
|
"title": h3.get_text(strip=True),
|
||||||
|
"url": parent_a["href"],
|
||||||
|
"snippet": snippet_el.get_text(strip=True) if snippet_el else "",
|
||||||
|
})
|
||||||
|
|
||||||
|
if page < pages - 1:
|
||||||
|
time.sleep(1.5)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
ENGINES = {
|
||||||
|
"duckduckgo": search_duckduckgo,
|
||||||
|
"ddg": search_duckduckgo,
|
||||||
|
"brave": search_brave,
|
||||||
|
"google": search_google,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Web search (multi-engine, no API key)")
|
||||||
|
parser.add_argument("query", help="Search query")
|
||||||
|
parser.add_argument("--pages", type=int, default=1, help="Number of result pages (default: 1)")
|
||||||
|
parser.add_argument("--engine", choices=["duckduckgo", "ddg", "brave", "google"],
|
||||||
|
default="duckduckgo", help="Search engine (default: duckduckgo)")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
search_fn = ENGINES[args.engine]
|
||||||
|
results = search_fn(args.query, args.pages)
|
||||||
|
|
||||||
|
# Deduplicate
|
||||||
|
seen = set()
|
||||||
|
deduped = []
|
||||||
|
for r in results:
|
||||||
|
if r["url"] not in seen:
|
||||||
|
seen.add(r["url"])
|
||||||
|
deduped.append(r)
|
||||||
|
|
||||||
|
print(json.dumps(deduped, indent=2, ensure_ascii=False))
|
||||||
|
except Exception as e:
|
||||||
|
print(json_error(f"Search failed: {str(e)}"))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
153
scripts/read_page.py
Normal file
153
scripts/read_page.py
Normal file
@@ -0,0 +1,153 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Extract readable content from a web page using Playwright + Chromium.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 read_page.py <url> [--max-chars N] [--visible] [--format FMT] [--no-dismiss]
|
||||||
|
|
||||||
|
Flags:
|
||||||
|
--max-chars N Max characters to output (default: 50000)
|
||||||
|
--visible Show browser window (non-headless)
|
||||||
|
--format FMT Output format: json (default), markdown, text
|
||||||
|
--no-dismiss Skip cookie consent auto-dismiss
|
||||||
|
|
||||||
|
Outputs content in the requested format.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
|
||||||
|
from playwright.sync_api import sync_playwright
|
||||||
|
|
||||||
|
EXTRACT_JS = """() => {
|
||||||
|
const SKIP = new Set(['SCRIPT','STYLE','NOSCRIPT','IFRAME','SVG','NAV','FOOTER','HEADER','ASIDE']);
|
||||||
|
const title = document.title || '';
|
||||||
|
const mainEl = document.querySelector('article')
|
||||||
|
|| document.querySelector('main')
|
||||||
|
|| document.querySelector('[role="main"]')
|
||||||
|
|| document.querySelector('#content, .content, .post-content, .entry-content')
|
||||||
|
|| document.body;
|
||||||
|
|
||||||
|
const lines = [];
|
||||||
|
const walker = document.createTreeWalker(mainEl, NodeFilter.SHOW_ELEMENT, {
|
||||||
|
acceptNode(node) {
|
||||||
|
if (SKIP.has(node.tagName)) return NodeFilter.FILTER_REJECT;
|
||||||
|
const tag = node.tagName.toLowerCase();
|
||||||
|
if (['h1','h2','h3','h4','h5','h6','p','li','td','th','pre','blockquote'].includes(tag))
|
||||||
|
return NodeFilter.FILTER_ACCEPT;
|
||||||
|
return NodeFilter.FILTER_SKIP;
|
||||||
|
}
|
||||||
|
});
|
||||||
|
let node;
|
||||||
|
while (node = walker.nextNode()) {
|
||||||
|
const text = node.innerText?.trim();
|
||||||
|
if (!text) continue;
|
||||||
|
const tag = node.tagName.toLowerCase();
|
||||||
|
if (tag.startsWith('h')) lines.push('\\n' + '#'.repeat(parseInt(tag[1])) + ' ' + text + '\\n');
|
||||||
|
else if (tag === 'li') lines.push('- ' + text);
|
||||||
|
else if (tag === 'blockquote') lines.push('> ' + text);
|
||||||
|
else lines.push(text);
|
||||||
|
}
|
||||||
|
let content = lines.join('\\n').trim();
|
||||||
|
if (content.length < 200) content = mainEl.innerText || '';
|
||||||
|
return { title, content };
|
||||||
|
}"""
|
||||||
|
|
||||||
|
COOKIE_DISMISS_JS = """() => {
|
||||||
|
const selectors = [
|
||||||
|
'button[id*="accept" i]', 'button[id*="consent" i]', 'button[id*="agree" i]',
|
||||||
|
'button[class*="accept" i]', 'button[class*="consent" i]', 'button[class*="agree" i]',
|
||||||
|
'a[id*="accept" i]', 'a[class*="accept" i]',
|
||||||
|
'[data-testid*="accept" i]', '[data-testid*="consent" i]',
|
||||||
|
'.cookie-banner button', '.cookie-notice button', '.cookie-popup button',
|
||||||
|
'#cookie-banner button', '#cookie-notice button', '#cookie-popup button',
|
||||||
|
'.cc-btn.cc-dismiss', '.cc-accept', '#onetrust-accept-btn-handler',
|
||||||
|
'.js-cookie-consent-agree', '[aria-label*="accept" i][aria-label*="cookie" i]',
|
||||||
|
'[aria-label*="Accept all" i]', '[aria-label*="Accept cookies" i]',
|
||||||
|
];
|
||||||
|
for (const sel of selectors) {
|
||||||
|
try {
|
||||||
|
const el = document.querySelector(sel);
|
||||||
|
if (el && el.offsetParent !== null) { el.click(); return { dismissed: true }; }
|
||||||
|
} catch(e) {}
|
||||||
|
}
|
||||||
|
const patterns = [
|
||||||
|
/^accept all$/i, /accept all cookies/i, /accept cookies/i, /accept & close/i,
|
||||||
|
/^agree$/i, /agree and continue/i, /agree & continue/i,
|
||||||
|
/consent and continue/i, /consent & continue/i,
|
||||||
|
/got it/i, /i understand/i, /i agree/i,
|
||||||
|
/allow all/i, /allow cookies/i, /allow all cookies/i,
|
||||||
|
/^ok$/i, /^okay$/i, /^continue$/i, /^dismiss$/i,
|
||||||
|
/accept and close/i, /accept and continue/i,
|
||||||
|
/nur notwendige/i, /alle akzeptieren/i, /akzeptieren/i,
|
||||||
|
/tout accepter/i, /accepter/i, /accepter et continuer/i,
|
||||||
|
];
|
||||||
|
for (const btn of document.querySelectorAll('button, a[role="button"], [role="button"]')) {
|
||||||
|
const text = btn.innerText?.trim();
|
||||||
|
if (!text || text.length > 50) continue;
|
||||||
|
for (const pat of patterns) {
|
||||||
|
if (pat.test(text) && btn.offsetParent !== null) { btn.click(); return { dismissed: true }; }
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return { dismissed: false };
|
||||||
|
}"""
|
||||||
|
|
||||||
|
|
||||||
|
def format_output(result: dict, fmt: str) -> str:
|
||||||
|
if fmt == "text":
|
||||||
|
content = result.get("content", "")
|
||||||
|
content = re.sub(r'^#+\s+', '', content, flags=re.MULTILINE)
|
||||||
|
content = re.sub(r'^- ', ' ', content, flags=re.MULTILINE)
|
||||||
|
content = re.sub(r'^> ', '', content, flags=re.MULTILINE)
|
||||||
|
return content.strip()
|
||||||
|
elif fmt == "markdown":
|
||||||
|
return f"# {result.get('title', '')}\n\n{result.get('content', '')}"
|
||||||
|
else:
|
||||||
|
return json.dumps(result, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(description="Web page reader (Playwright + Chromium)")
|
||||||
|
parser.add_argument("url", help="URL to read")
|
||||||
|
parser.add_argument("--max-chars", type=int, default=50000, help="Max characters (default: 50000)")
|
||||||
|
parser.add_argument("--visible", action="store_true", help="Run in visible (non-headless) mode")
|
||||||
|
parser.add_argument("--format", choices=["json", "markdown", "text"], default="json", help="Output format")
|
||||||
|
parser.add_argument("--no-dismiss", action="store_true", help="Skip cookie consent auto-dismiss")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
with sync_playwright() as p:
|
||||||
|
browser = p.chromium.launch(headless=not args.visible)
|
||||||
|
ctx = browser.new_context(
|
||||||
|
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||||
|
locale="en-US",
|
||||||
|
viewport={"width": 1280, "height": 900},
|
||||||
|
)
|
||||||
|
page = ctx.new_page()
|
||||||
|
page.goto(args.url, timeout=30000, wait_until="domcontentloaded")
|
||||||
|
page.wait_for_timeout(1500)
|
||||||
|
|
||||||
|
if not args.no_dismiss:
|
||||||
|
# Try main frame first, then iframes (EU sites often use iframe consent)
|
||||||
|
dismissed = page.evaluate(COOKIE_DISMISS_JS)
|
||||||
|
if not dismissed.get("dismissed"):
|
||||||
|
for frame in page.frames:
|
||||||
|
if frame == page.main_frame:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
r = frame.evaluate(COOKIE_DISMISS_JS)
|
||||||
|
if r.get("dismissed"):
|
||||||
|
break
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
page.wait_for_timeout(500)
|
||||||
|
|
||||||
|
result = page.evaluate(EXTRACT_JS)
|
||||||
|
if len(result["content"]) > args.max_chars:
|
||||||
|
result["content"] = result["content"][:args.max_chars] + "\n\n[...truncated]"
|
||||||
|
|
||||||
|
print(format_output(result, args.format))
|
||||||
|
browser.close()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user