Initial commit with translated description
This commit is contained in:
187
README.md
Normal file
187
README.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# Playwright Scraper Skill 🕷️
|
||||
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://nodejs.org/)
|
||||
[](https://playwright.dev/)
|
||||
|
||||
**[中文文檔](README_ZH.md)** | English
|
||||
|
||||
A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Successfully tested on complex websites like Discuss.com.hk.
|
||||
|
||||
> 📦 **Installation:** See [INSTALL.md](INSTALL.md)
|
||||
> 📚 **Full Documentation:** See [SKILL.md](SKILL.md)
|
||||
> 💡 **Examples:** See [examples/README.md](examples/README.md)
|
||||
|
||||
---
|
||||
|
||||
## ✨ Features
|
||||
|
||||
- ✅ **Pure Playwright** — Modern, powerful, easy to use
|
||||
- ✅ **Anti-Bot Protection** — Hides automation, realistic UA
|
||||
- ✅ **Verified** — 100% success on Discuss.com.hk
|
||||
- ✅ **Simple to Use** — One-line commands
|
||||
- ✅ **Customizable** — Environment variable support
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
npm install
|
||||
npx playwright install chromium
|
||||
```
|
||||
|
||||
### Usage
|
||||
|
||||
```bash
|
||||
# Quick scraping
|
||||
node scripts/playwright-simple.js https://example.com
|
||||
|
||||
# Stealth mode (recommended)
|
||||
node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📖 Two Modes
|
||||
|
||||
| Mode | Use Case | Speed | Anti-Bot |
|
||||
|------|----------|-------|----------|
|
||||
| **Simple** | Regular dynamic sites | Fast (3-5s) | None |
|
||||
| **Stealth** ⭐ | Sites with anti-bot | Medium (5-20s) | Medium-High |
|
||||
|
||||
### Simple Mode
|
||||
|
||||
For sites without anti-bot protection:
|
||||
|
||||
```bash
|
||||
node scripts/playwright-simple.js <URL>
|
||||
```
|
||||
|
||||
### Stealth Mode (Recommended)
|
||||
|
||||
For sites with Cloudflare or anti-bot protection:
|
||||
|
||||
```bash
|
||||
node scripts/playwright-stealth.js <URL>
|
||||
```
|
||||
|
||||
**Anti-Bot Techniques:**
|
||||
- Hide `navigator.webdriver`
|
||||
- Realistic User-Agent (iPhone)
|
||||
- Human-like behavior simulation
|
||||
- Screenshot and HTML saving support
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Customization
|
||||
|
||||
All scripts support environment variables:
|
||||
|
||||
```bash
|
||||
# Show browser
|
||||
HEADLESS=false node scripts/playwright-stealth.js <URL>
|
||||
|
||||
# Custom wait time (milliseconds)
|
||||
WAIT_TIME=10000 node scripts/playwright-stealth.js <URL>
|
||||
|
||||
# Save screenshot
|
||||
SCREENSHOT_PATH=/tmp/page.png node scripts/playwright-stealth.js <URL>
|
||||
|
||||
# Save HTML
|
||||
SAVE_HTML=true node scripts/playwright-stealth.js <URL>
|
||||
|
||||
# Custom User-Agent
|
||||
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js <URL>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Test Results
|
||||
|
||||
| Website | Result | Time |
|
||||
|---------|--------|------|
|
||||
| **Discuss.com.hk** | ✅ 200 OK | 5-20s |
|
||||
| **Example.com** | ✅ 200 OK | 3-5s |
|
||||
| **Cloudflare Protected** | ✅ Mostly successful | 10-30s |
|
||||
|
||||
---
|
||||
|
||||
## 📁 File Structure
|
||||
|
||||
```
|
||||
playwright-scraper-skill/
|
||||
├── scripts/
|
||||
│ ├── playwright-simple.js # Simple mode
|
||||
│ └── playwright-stealth.js # Stealth mode ⭐
|
||||
├── examples/
|
||||
│ ├── discuss-hk.sh # Discuss.com.hk example
|
||||
│ └── README.md # More examples
|
||||
├── SKILL.md # Full documentation
|
||||
├── INSTALL.md # Installation guide
|
||||
├── README.md # This file
|
||||
├── README_ZH.md # Chinese documentation
|
||||
├── CONTRIBUTING.md # Contribution guide
|
||||
├── CHANGELOG.md # Version history
|
||||
└── package.json # npm config
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 💡 Best Practices
|
||||
|
||||
1. **Try web_fetch first** — OpenClaw's built-in tool is fastest
|
||||
2. **Use Simple for dynamic sites** — When no anti-bot protection
|
||||
3. **Use Stealth for protected sites** ⭐ — Main workhorse
|
||||
4. **Use specialized skills** — For YouTube, Reddit, etc.
|
||||
|
||||
---
|
||||
|
||||
## 🐛 Troubleshooting
|
||||
|
||||
### Getting 403 blocked?
|
||||
|
||||
Use Stealth mode:
|
||||
```bash
|
||||
node scripts/playwright-stealth.js <URL>
|
||||
```
|
||||
|
||||
### Cloudflare challenge?
|
||||
|
||||
Increase wait time + headful mode:
|
||||
```bash
|
||||
HEADLESS=false WAIT_TIME=30000 node scripts/playwright-stealth.js <URL>
|
||||
```
|
||||
|
||||
### Playwright not found?
|
||||
|
||||
Reinstall:
|
||||
```bash
|
||||
npm install
|
||||
npx playwright install chromium
|
||||
```
|
||||
|
||||
More issues? See [INSTALL.md](INSTALL.md)
|
||||
|
||||
---
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md)
|
||||
|
||||
---
|
||||
|
||||
## 📄 License
|
||||
|
||||
MIT License - See [LICENSE](LICENSE)
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Links
|
||||
|
||||
- [Playwright Official Docs](https://playwright.dev/)
|
||||
- [Full Documentation (SKILL.md)](SKILL.md)
|
||||
- [Installation Guide (INSTALL.md)](INSTALL.md)
|
||||
- [Examples (examples/)](examples/)
|
||||
Reference in New Issue
Block a user