235 lines
5.9 KiB
Markdown
235 lines
5.9 KiB
Markdown
|
|
---
|
|||
|
|
name: playwright-scraper-skill
|
|||
|
|
description: "基于Playwright的网页抓取OpenClaw技能。"
|
|||
|
|
version: 1.2.0
|
|||
|
|
author: Simon Chan
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Playwright Scraper Skill
|
|||
|
|
|
|||
|
|
A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Use Case Matrix
|
|||
|
|
|
|||
|
|
| Target Website | Anti-Bot Level | Recommended Method | Script |
|
|||
|
|
|---------------|----------------|-------------------|--------|
|
|||
|
|
| **Regular Sites** | Low | web_fetch tool | N/A (built-in) |
|
|||
|
|
| **Dynamic Sites** | Medium | Playwright Simple | `scripts/playwright-simple.js` |
|
|||
|
|
| **Cloudflare Protected** | High | **Playwright Stealth** ⭐ | `scripts/playwright-stealth.js` |
|
|||
|
|
| **YouTube** | Special | deep-scraper | Install separately |
|
|||
|
|
| **Reddit** | Special | reddit-scraper | Install separately |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📦 Installation
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd playwright-scraper-skill
|
|||
|
|
npm install
|
|||
|
|
npx playwright install chromium
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 Quick Start
|
|||
|
|
|
|||
|
|
### 1️⃣ Simple Sites (No Anti-Bot)
|
|||
|
|
|
|||
|
|
Use OpenClaw's built-in `web_fetch` tool:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Invoke directly in OpenClaw
|
|||
|
|
Hey, fetch me the content from https://example.com
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 2️⃣ Dynamic Sites (Requires JavaScript)
|
|||
|
|
|
|||
|
|
Use **Playwright Simple**:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
node scripts/playwright-simple.js "https://example.com"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Example output:**
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"url": "https://example.com",
|
|||
|
|
"title": "Example Domain",
|
|||
|
|
"content": "...",
|
|||
|
|
"elapsedSeconds": "3.45"
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)
|
|||
|
|
|
|||
|
|
Use **Playwright Stealth**:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Features:**
|
|||
|
|
- Hide automation markers (`navigator.webdriver = false`)
|
|||
|
|
- Realistic User-Agent (iPhone, Android)
|
|||
|
|
- Random delays to mimic human behavior
|
|||
|
|
- Screenshot and HTML saving support
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 4️⃣ YouTube Video Transcripts
|
|||
|
|
|
|||
|
|
Use **deep-scraper** (install separately):
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Install deep-scraper skill
|
|||
|
|
npx clawhub install deep-scraper
|
|||
|
|
|
|||
|
|
# Use it
|
|||
|
|
cd skills/deep-scraper
|
|||
|
|
node assets/youtube_handler.js "https://www.youtube.com/watch?v=VIDEO_ID"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📖 Script Descriptions
|
|||
|
|
|
|||
|
|
### `scripts/playwright-simple.js`
|
|||
|
|
- **Use Case:** Regular dynamic websites
|
|||
|
|
- **Speed:** Fast (3-5 seconds)
|
|||
|
|
- **Anti-Bot:** None
|
|||
|
|
- **Output:** JSON (title, content, URL)
|
|||
|
|
|
|||
|
|
### `scripts/playwright-stealth.js` ⭐
|
|||
|
|
- **Use Case:** Sites with Cloudflare or anti-bot protection
|
|||
|
|
- **Speed:** Medium (5-20 seconds)
|
|||
|
|
- **Anti-Bot:** Medium-High (hides automation, realistic UA)
|
|||
|
|
- **Output:** JSON + Screenshot + HTML file
|
|||
|
|
- **Verified:** 100% success on Discuss.com.hk
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 Best Practices
|
|||
|
|
|
|||
|
|
### 1. Try web_fetch First
|
|||
|
|
If the site doesn't have dynamic loading, use OpenClaw's `web_fetch` tool—it's fastest.
|
|||
|
|
|
|||
|
|
### 2. Need JavaScript? Use Playwright Simple
|
|||
|
|
If you need to wait for JavaScript rendering, use `playwright-simple.js`.
|
|||
|
|
|
|||
|
|
### 3. Getting Blocked? Use Stealth
|
|||
|
|
If you encounter 403 or Cloudflare challenges, use `playwright-stealth.js`.
|
|||
|
|
|
|||
|
|
### 4. Special Sites Need Specialized Skills
|
|||
|
|
- YouTube → deep-scraper
|
|||
|
|
- Reddit → reddit-scraper
|
|||
|
|
- Twitter → bird skill
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 Customization
|
|||
|
|
|
|||
|
|
All scripts support environment variables:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Set screenshot path
|
|||
|
|
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL
|
|||
|
|
|
|||
|
|
# Set wait time (milliseconds)
|
|||
|
|
WAIT_TIME=10000 node scripts/playwright-simple.js URL
|
|||
|
|
|
|||
|
|
# Enable headful mode (show browser)
|
|||
|
|
HEADLESS=false node scripts/playwright-stealth.js URL
|
|||
|
|
|
|||
|
|
# Save HTML
|
|||
|
|
SAVE_HTML=true node scripts/playwright-stealth.js URL
|
|||
|
|
|
|||
|
|
# Custom User-Agent
|
|||
|
|
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Performance Comparison
|
|||
|
|
|
|||
|
|
| Method | Speed | Anti-Bot | Success Rate (Discuss.com.hk) |
|
|||
|
|
|--------|-------|----------|-------------------------------|
|
|||
|
|
| web_fetch | ⚡ Fastest | ❌ None | 0% |
|
|||
|
|
| Playwright Simple | 🚀 Fast | ⚠️ Low | 20% |
|
|||
|
|
| **Playwright Stealth** | ⏱️ Medium | ✅ Medium | **100%** ✅ |
|
|||
|
|
| Puppeteer Stealth | ⏱️ Medium | ✅ Medium-High | ~80% |
|
|||
|
|
| Crawlee (deep-scraper) | 🐢 Slow | ❌ Detected | 0% |
|
|||
|
|
| Chaser (Rust) | ⏱️ Medium | ❌ Detected | 0% |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🛡️ Anti-Bot Techniques Summary
|
|||
|
|
|
|||
|
|
Lessons learned from our testing:
|
|||
|
|
|
|||
|
|
### ✅ Effective Anti-Bot Measures
|
|||
|
|
1. **Hide `navigator.webdriver`** — Essential
|
|||
|
|
2. **Realistic User-Agent** — Use real devices (iPhone, Android)
|
|||
|
|
3. **Mimic Human Behavior** — Random delays, scrolling
|
|||
|
|
4. **Avoid Framework Signatures** — Crawlee, Selenium are easily detected
|
|||
|
|
5. **Use `addInitScript` (Playwright)** — Inject before page load
|
|||
|
|
|
|||
|
|
### ❌ Ineffective Anti-Bot Measures
|
|||
|
|
1. **Only changing User-Agent** — Not enough
|
|||
|
|
2. **Using high-level frameworks (Crawlee)** — More easily detected
|
|||
|
|
3. **Docker isolation** — Doesn't help with Cloudflare
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 Troubleshooting
|
|||
|
|
|
|||
|
|
### Issue: 403 Forbidden
|
|||
|
|
**Solution:** Use `playwright-stealth.js`
|
|||
|
|
|
|||
|
|
### Issue: Cloudflare Challenge Page
|
|||
|
|
**Solution:**
|
|||
|
|
1. Increase wait time (10-15 seconds)
|
|||
|
|
2. Try `headless: false` (headful mode sometimes has higher success rate)
|
|||
|
|
3. Consider using proxy IPs
|
|||
|
|
|
|||
|
|
### Issue: Blank Page
|
|||
|
|
**Solution:**
|
|||
|
|
1. Increase `waitForTimeout`
|
|||
|
|
2. Use `waitUntil: 'networkidle'` or `'domcontentloaded'`
|
|||
|
|
3. Check if login is required
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📝 Memory & Experience
|
|||
|
|
|
|||
|
|
### 2026-02-07 Discuss.com.hk Test Conclusions
|
|||
|
|
- ✅ **Pure Playwright + Stealth** succeeded (5s, 200 OK)
|
|||
|
|
- ❌ Crawlee (deep-scraper) failed (403)
|
|||
|
|
- ❌ Chaser (Rust) failed (Cloudflare)
|
|||
|
|
- ❌ Puppeteer standard failed (403)
|
|||
|
|
|
|||
|
|
**Best Solution:** Pure Playwright + anti-bot techniques (framework-independent)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚧 Future Improvements
|
|||
|
|
|
|||
|
|
- [ ] Add proxy IP rotation
|
|||
|
|
- [ ] Implement cookie management (maintain login state)
|
|||
|
|
- [ ] Add CAPTCHA handling (2captcha / Anti-Captcha)
|
|||
|
|
- [ ] Batch scraping (parallel URLs)
|
|||
|
|
- [ ] Integration with OpenClaw's `browser` tool
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📚 References
|
|||
|
|
|
|||
|
|
- [Playwright Official Docs](https://playwright.dev/)
|
|||
|
|
- [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)
|
|||
|
|
- [deep-scraper skill](https://clawhub.com/opsun/deep-scraper)
|