Initial commit with translated description

This commit is contained in:
2026-03-29 13:02:57 +08:00
commit d97055d190
7 changed files with 2719 additions and 0 deletions

448
AI_AGENT_GUIDE.md Normal file
View File

@@ -0,0 +1,448 @@
# AI Desktop Agent - Cognitive Automation Guide
## 🤖 What Is This?
The **AI Desktop Agent** is an intelligent layer on top of the basic desktop control that **understands** what you want and figures out how to do it autonomously.
Unlike basic automation that requires exact instructions, the AI Agent:
- **Understands natural language** ("Draw a cat in Paint")
- **Plans the steps** automatically
- **Executes autonomously**
- **Adapts** based on what it sees
---
## 🎯 What Can It Do?
### ✅ Autonomous Drawing
```python
from skills.desktop_control.ai_agent import AIDesktopAgent
agent = AIDesktopAgent()
# Just describe what you want!
agent.execute_task("Draw a circle in Paint")
agent.execute_task("Draw a star in MS Paint")
agent.execute_task("Draw a house with a sun")
```
**What it does:**
1. Opens MS Paint
2. Selects pencil tool
3. Figures out how to draw the requested shape
4. Draws it autonomously
5. Takes a screenshot of the result
### ✅ Autonomous Text Entry
```python
# It figures out where to type
agent.execute_task("Type 'Hello World' in Notepad")
agent.execute_task("Write an email saying thank you")
```
**What it does:**
1. Opens Notepad (or finds active text editor)
2. Types the text naturally
3. Formats if needed
### ✅ Autonomous Application Control
```python
# It knows how to open apps
agent.execute_task("Open Calculator")
agent.execute_task("Launch Microsoft Paint")
agent.execute_task("Open File Explorer")
```
### ✅ Autonomous Game Playing (Advanced)
```python
# It will try to play the game!
agent.execute_task("Play Solitaire for me")
agent.execute_task("Play Minesweeper")
```
**What it does:**
1. Analyzes the game screen
2. Detects game state (cards, mines, etc.)
3. Decides best move
4. Executes the move
5. Repeats until win/lose
---
## 🏗️ How It Works
### Architecture
```
User Request ("Draw a cat")
Natural Language Understanding
Task Planning (Step-by-step plan)
Step Execution Loop:
- Observe Screen (Computer Vision)
- Decide Action (AI Reasoning)
- Execute Action (Desktop Control)
- Verify Result
Task Complete!
```
### Key Components
1. **Task Planner** - Breaks down high-level tasks into steps
2. **Vision System** - Understands what's on screen (screenshots, OCR, object detection)
3. **Reasoning Engine** - Decides what to do next
4. **Action Executor** - Performsthe actual mouse/keyboard actions
5. **Feedback Loop** - Verifies actions succeeded
---
## 📋 Supported Tasks (Current)
### Tier 1: Fully Automated ✅
| Task Pattern | Example | Status |
|-------------|---------|---------|
| Draw shapes in Paint | "Draw a circle" | ✅ Working |
| Basic text entry | "Type Hello" | ✅ Working |
| Launch applications | "Open Paint" | ✅ Working |
### Tier 2: Partially Automated 🔨
| Task Pattern | Example | Status |
|-------------|---------|---------|
| Form filling | "Fill out this form" | 🔨 In Progress |
| File operations | "Copy these files" | 🔨 In Progress |
| Web navigation | "Find on Google" | 🔨 Planned |
### Tier 3: Experimental 🧪
| Task Pattern | Example | Status |
|-------------|---------|---------|
| Game playing | "Play Solitaire" | 🧪 Experimental |
| Image editing | "Resize this photo" | 🧪 Planned |
| Code editing | "Fix this bug" | 🧪 Research |
---
## 🎨 Example: Drawing in Paint
### Simple Request
```python
agent = AIDesktopAgent()
result = agent.execute_task("Draw a circle in Paint")
# Check result
print(f"Status: {result['status']}")
print(f"Steps taken: {len(result['steps'])}")
```
### What Happens Behind the Scenes
**1. Planning Phase:**
```
Plan generated:
Step 1: Launch MS Paint
Step 2: Wait 2s for Paint to load
Step 3: Activate Paint window
Step 4: Select pencil tool (press 'P')
Step 5: Draw circle at canvas center
Step 6: Screenshot the result
```
**2. Execution Phase:**
```
[✓] Launched Paint via Win+R → mspaint
[✓] Waited 2.0s
[✓] Activated window "Paint"
[✓] Pressed 'P' to select pencil
[✓] Drew circle with 72 points
[✓] Screenshot saved: drawing_result.png
```
**3. Result:**
```python
{
"task": "Draw a circle in Paint",
"status": "completed",
"success": True,
"steps": [... 6 steps ...],
"screenshots": [... 6 screenshots ...],
}
```
---
## 🎮 Example: Game Playing
```python
agent = AIDesktopAgent()
# Play a simple game
result = agent.execute_task("Play Solitaire for me")
```
### Game Playing Loop
```
1. Analyze screen → Detect cards, positions
2. Identify valid moves → Find legal plays
3. Evaluate moves → Which is best?
4. Execute move → Click and drag card
5. Repeat until game ends
```
### Game-Specific Intelligence
The agent can learn patterns for:
- **Solitaire**: Card stacking rules, suit matching
- **Minesweeper**: Probability calculations, safe clicks
- **2048**: Tile merging strategy
- **Chess** (if integrated with engine): Move evaluation
---
## 🧠 Enhancing the AI
### Adding Application Knowledge
```python
# In ai_agent.py, add to app_knowledge:
self.app_knowledge = {
"photoshop": {
"name": "Adobe Photoshop",
"launch_command": "photoshop",
"common_actions": {
"new_layer": {"hotkey": ["ctrl", "shift", "n"]},
"brush_tool": {"hotkey": ["b"]},
"eraser": {"hotkey": ["e"]},
}
}
}
```
### Adding Custom Task Patterns
```python
# Add a custom planning method
def _plan_photo_edit(self, task: str) -> List[Dict]:
"""Plan for photo editing tasks."""
return [
{"type": "launch_app", "app": "photoshop"},
{"type": "wait", "duration": 3.0},
{"type": "open_file", "path": extracted_path},
{"type": "apply_filter", "filter": extracted_filter},
{"type": "save_file"},
]
```
---
## 🔥 Advanced: Vision + Reasoning
### Screen Analysis
The agent can analyze screenshots to:
- **Detect UI elements** (buttons, text fields, menus)
- **Read text** (OCR for labels, instructions)
- **Identify objects** (icons, images, game pieces)
- **Understand layout** (where things are)
```python
# Analyze what's on screen
analysis = agent._analyze_screen()
print(analysis)
# Output:
# {
# "active_window": "Untitled - Paint",
# "mouse_position": (640, 480),
# "detected_elements": [...],
# "text_found": [...],
# }
```
### Integration with OpenClaw LLM
```python
# Future: Use OpenClaw's LLM for reasoning
agent = AIDesktopAgent(llm_client=openclaw_llm)
# The agent can now:
# - Reason about complex tasks
# - Understand context better
# - Plan more sophisticated workflows
# - Learn from feedback
```
---
## 🛠️ Extending for Your Needs
### Add Support for New Apps
1. **Identify the app**
2. **Document common actions**
3. **Add to knowledge base**
4. **Create planning method**
Example: Adding Excel support
```python
# Step 1: Add to app_knowledge
"excel": {
"name": "Microsoft Excel",
"launch_command": "excel",
"common_actions": {
"new_sheet": {"hotkey": ["shift", "f11"]},
"sum_formula": {"action": "type", "text": "=SUM()"},
}
}
# Step 2: Create planner
def _plan_excel_task(self, task: str) -> List[Dict]:
return [
{"type": "launch_app", "app": "excel"},
{"type": "wait", "duration": 2.0},
# ... specific Excel steps
]
# Step 3: Hook into main planner
if "excel" in task_lower or "spreadsheet" in task_lower:
return self._plan_excel_task(task)
```
---
## 🎯 Real-World Use Cases
### 1. Automated Form Filling
```python
agent.execute_task("Fill out the job application with my resume data")
```
### 2. Batch Image Processing
```python
agent.execute_task("Resize all images in this folder to 800x600")
```
### 3. Social Media Posting
```python
agent.execute_task("Post this image to Instagram with caption 'Beautiful sunset'")
```
### 4. Data Entry
```python
agent.execute_task("Copy data from this PDF to Excel spreadsheet")
```
### 5. Testing
```python
agent.execute_task("Test the login form with invalid credentials")
```
---
## ⚙️ Configuration
### Enable/Disable Failsafe
```python
# Safe mode (default)
agent = AIDesktopAgent(failsafe=True)
# Fast mode (no failsafe)
agent = AIDesktopAgent(failsafe=False)
```
### Set Max Steps
```python
# Prevent infinite loops
result = agent.execute_task("Play game", max_steps=100)
```
### Access Action History
```python
# Review what the agent did
print(agent.action_history)
```
---
## 🐛 Debugging
### View Step-by-Step Execution
```python
result = agent.execute_task("Draw a star in Paint")
for i, step in enumerate(result['steps'], 1):
print(f"Step {i}: {step['step']['description']}")
print(f" Success: {step['success']}")
if 'error' in step:
print(f" Error: {step['error']}")
```
### View Screenshots
```python
# Each step captures before/after screenshots
for screenshot_pair in result['screenshots']:
before = screenshot_pair['before']
after = screenshot_pair['after']
# Display or save for analysis
before.save(f"step_{screenshot_pair['step']}_before.png")
after.save(f"step_{screenshot_pair['step']}_after.png")
```
---
## 🚀 Future Enhancements
Planned features:
- [ ] **Computer Vision**: OCR, object detection, UI element recognition
- [ ] **LLM Integration**: Natural language understanding with OpenClaw LLM
- [ ] **Learning**: Remember successful patterns, improve over time
- [ ] **Multi-App Workflows**: "Get data from Chrome and put in Excel"
- [ ] **Voice Control**: "Alexa, draw a cat in Paint"
- [ ] **Autonomous Debugging**: Fix errors automatically
- [ ] **Game AI**: Reinforcement learning for game playing
- [ ] **Web Automation**: Full browser control with understanding
---
## 📚 Full API
### Main Methods
```python
# Execute a task
result = agent.execute_task(task: str, max_steps: int = 50)
# Analyze screen
analysis = agent._analyze_screen()
# Manual mode: Execute individual steps
step = {"type": "launch_app", "app": "paint"}
result = agent._execute_step(step)
```
### Result Structure
```python
{
"task": str, # Original task
"status": str, # "completed", "failed", "error"
"success": bool, # Overall success
"steps": List[Dict], # All steps executed
"screenshots": List[Dict], # Before/after screenshots
"failed_at_step": int, # If failed, which step
"error": str, # Error message if failed
}
```
---
**🦞 Built for OpenClaw - The future of desktop automation!**

269
QUICK_REFERENCE.md Normal file
View File

@@ -0,0 +1,269 @@
# Desktop Control - Quick Reference Card
## 🚀 Instant Start
```python
from skills.desktop_control import DesktopController
dc = DesktopController()
```
## 🖱️ Mouse Control (Top 10)
```python
# 1. Move mouse
dc.move_mouse(500, 300, duration=0.5)
# 2. Click
dc.click(500, 300) # Left click at position
dc.click() # Click at current position
# 3. Right click
dc.right_click(500, 300)
# 4. Double click
dc.double_click(500, 300)
# 5. Drag & drop
dc.drag(100, 100, 500, 500, duration=1.0)
# 6. Scroll
dc.scroll(-5) # Scroll down 5 clicks
# 7. Get position
x, y = dc.get_mouse_position()
# 8. Move relative
dc.move_relative(100, 50) # Move 100px right, 50px down
# 9. Smooth movement
dc.move_mouse(1000, 500, duration=1.0, smooth=True)
# 10. Middle click
dc.middle_click()
```
## ⌨️ Keyboard Control (Top 10)
```python
# 1. Type text (instant)
dc.type_text("Hello World")
# 2. Type text (human-like, 60 WPM)
dc.type_text("Hello World", wpm=60)
# 3. Press key
dc.press('enter')
dc.press('tab')
dc.press('escape')
# 4. Hotkeys (shortcuts)
dc.hotkey('ctrl', 'c') # Copy
dc.hotkey('ctrl', 'v') # Paste
dc.hotkey('ctrl', 's') # Save
dc.hotkey('win', 'r') # Run dialog
dc.hotkey('alt', 'tab') # Switch window
# 5. Hold & release
dc.key_down('shift')
dc.type_text("hello") # Types "HELLO"
dc.key_up('shift')
# 6. Arrow keys
dc.press('up')
dc.press('down')
dc.press('left')
dc.press('right')
# 7. Function keys
dc.press('f5') # Refresh
# 8. Multiple presses
dc.press('backspace', presses=5)
# 9. Special keys
dc.press('home')
dc.press('end')
dc.press('pagedown')
dc.press('delete')
# 10. Fast combo
dc.hotkey('ctrl', 'alt', 'delete')
```
## 📸 Screen Operations (Top 5)
```python
# 1. Screenshot (full screen)
img = dc.screenshot()
dc.screenshot(filename="screen.png")
# 2. Screenshot (region)
img = dc.screenshot(region=(100, 100, 800, 600))
# 3. Get pixel color
r, g, b = dc.get_pixel_color(500, 300)
# 4. Find image on screen
location = dc.find_on_screen("button.png")
# 5. Get screen size
width, height = dc.get_screen_size()
```
## 🪟 Window Management (Top 5)
```python
# 1. Get all windows
windows = dc.get_all_windows()
# 2. Activate window
dc.activate_window("Chrome")
# 3. Get active window
active = dc.get_active_window()
# 4. List windows
for title in dc.get_all_windows():
print(title)
# 5. Switch to app
dc.activate_window("Visual Studio Code")
```
## 📋 Clipboard (Top 2)
```python
# 1. Copy to clipboard
dc.copy_to_clipboard("Hello!")
# 2. Get from clipboard
text = dc.get_from_clipboard()
```
## 🔥 Real-World Examples
### Example 1: Auto-fill Form
```python
dc.click(300, 200) # Name field
dc.type_text("John Doe", wpm=80)
dc.press('tab')
dc.type_text("john@email.com", wpm=80)
dc.press('tab')
dc.type_text("Password123", wpm=60)
dc.press('enter')
```
### Example 2: Copy-Paste Automation
```python
# Select all
dc.hotkey('ctrl', 'a')
# Copy
dc.hotkey('ctrl', 'c')
# Wait
dc.pause(0.5)
# Switch window
dc.hotkey('alt', 'tab')
# Paste
dc.hotkey('ctrl', 'v')
```
### Example 3: File Operations
```python
# Select multiple files
dc.key_down('ctrl')
dc.click(100, 200)
dc.click(100, 250)
dc.click(100, 300)
dc.key_up('ctrl')
# Copy
dc.hotkey('ctrl', 'c')
```
### Example 4: Screenshot Workflow
```python
# Take screenshot
dc.screenshot(filename=f"capture_{time.time()}.png")
# Open in Paint
dc.hotkey('win', 'r')
dc.pause(0.5)
dc.type_text('mspaint')
dc.press('enter')
```
### Example 5: Search & Replace
```python
# Open Find & Replace
dc.hotkey('ctrl', 'h')
dc.pause(0.3)
# Type find text
dc.type_text("old_text")
dc.press('tab')
# Type replace text
dc.type_text("new_text")
# Replace all
dc.hotkey('alt', 'a')
```
## ⚙️ Configuration
```python
# With failsafe (move to corner to abort)
dc = DesktopController(failsafe=True)
# With approval mode (ask before each action)
dc = DesktopController(require_approval=True)
# Maximum speed (no safety checks)
dc = DesktopController(failsafe=False)
```
## 🛡️ Safety
```python
# Check if safe to continue
if dc.is_safe():
dc.click(500, 500)
# Pause execution
dc.pause(2.0) # Wait 2 seconds
# Emergency abort: Move mouse to any screen corner
```
## 🎯 Pro Tips
1. **Instant typing**: `interval=0` or `wpm=None`
2. **Human typing**: `wpm=60` (60 words/min)
3. **Smooth mouse**: `duration=0.5, smooth=True`
4. **Instant mouse**: `duration=0`
5. **Wait for UI**: `dc.pause(0.5)` between actions
6. **Failsafe**: Always enable for safety
7. **Test first**: Use `demo.py` to test features
8. **Coordinates**: Use `get_mouse_position()` to find them
9. **Screenshots**: Capture before/after for verification
10. **Hotkeys > Menus**: Faster and more reliable
## 📦 Dependencies
```bash
pip install pyautogui pillow opencv-python pygetwindow pyperclip
```
## 🚨 Common Issues
**Mouse not moving correctly?**
- Check DPI scaling in Windows settings
- Verify coordinates with `get_mouse_position()`
**Keyboard not working?**
- Ensure target app has focus
- Some apps block automation (games, secure apps)
**Failsafe triggering?**
- Keep mouse away from screen corners
- Disable if needed: `failsafe=False`
---
**Built for OpenClaw** 🦞 - Desktop automation made easy!

623
SKILL.md Normal file
View File

@@ -0,0 +1,623 @@
---
description: "使用鼠标、键盘和屏幕控制的高级桌面自动化。"
---
# Desktop Control Skill
**The most advanced desktop automation skill for OpenClaw.** Provides pixel-perfect mouse control, lightning-fast keyboard input, screen capture, window management, and clipboard operations.
## 🎯 Features
### Mouse Control
-**Absolute positioning** - Move to exact coordinates
-**Relative movement** - Move from current position
-**Smooth movement** - Natural, human-like mouse paths
-**Click types** - Left, right, middle, double, triple clicks
-**Drag & drop** - Drag from point A to point B
-**Scroll** - Vertical and horizontal scrolling
-**Position tracking** - Get current mouse coordinates
### Keyboard Control
-**Text typing** - Fast, accurate text input
-**Hotkeys** - Execute keyboard shortcuts (Ctrl+C, Win+R, etc.)
-**Special keys** - Enter, Tab, Escape, Arrow keys, F-keys
-**Key combinations** - Multi-key press combinations
-**Hold & release** - Manual key state control
-**Typing speed** - Configurable WPM (instant to human-like)
### Screen Operations
-**Screenshot** - Capture entire screen or regions
-**Image recognition** - Find elements on screen (via OpenCV)
-**Color detection** - Get pixel colors at coordinates
-**Multi-monitor** - Support for multiple displays
### Window Management
-**Window list** - Get all open windows
-**Activate window** - Bring window to front
-**Window info** - Get position, size, title
-**Minimize/Maximize** - Control window states
### Safety Features
-**Failsafe** - Move mouse to corner to abort
-**Pause control** - Emergency stop mechanism
-**Approval mode** - Require confirmation for actions
-**Bounds checking** - Prevent out-of-screen operations
-**Logging** - Track all automation actions
---
## 🚀 Quick Start
### Installation
First, install required dependencies:
```bash
pip install pyautogui pillow opencv-python pygetwindow
```
### Basic Usage
```python
from skills.desktop_control import DesktopController
# Initialize controller
dc = DesktopController(failsafe=True)
# Mouse operations
dc.move_mouse(500, 300) # Move to coordinates
dc.click() # Left click at current position
dc.click(100, 200, button="right") # Right click at position
# Keyboard operations
dc.type_text("Hello from OpenClaw!")
dc.hotkey("ctrl", "c") # Copy
dc.press("enter")
# Screen operations
screenshot = dc.screenshot()
position = dc.get_mouse_position()
```
---
## 📋 Complete API Reference
### Mouse Functions
#### `move_mouse(x, y, duration=0, smooth=True)`
Move mouse to absolute screen coordinates.
**Parameters:**
- `x` (int): X coordinate (pixels from left)
- `y` (int): Y coordinate (pixels from top)
- `duration` (float): Movement time in seconds (0 = instant, 0.5 = smooth)
- `smooth` (bool): Use bezier curve for natural movement
**Example:**
```python
# Instant movement
dc.move_mouse(1000, 500)
# Smooth 1-second movement
dc.move_mouse(1000, 500, duration=1.0)
```
#### `move_relative(x_offset, y_offset, duration=0)`
Move mouse relative to current position.
**Parameters:**
- `x_offset` (int): Pixels to move horizontally (positive = right)
- `y_offset` (int): Pixels to move vertically (positive = down)
- `duration` (float): Movement time in seconds
**Example:**
```python
# Move 100px right, 50px down
dc.move_relative(100, 50, duration=0.3)
```
#### `click(x=None, y=None, button='left', clicks=1, interval=0.1)`
Perform mouse click.
**Parameters:**
- `x, y` (int, optional): Coordinates to click (None = current position)
- `button` (str): 'left', 'right', 'middle'
- `clicks` (int): Number of clicks (1 = single, 2 = double)
- `interval` (float): Delay between multiple clicks
**Example:**
```python
# Simple left click
dc.click()
# Double-click at specific position
dc.click(500, 300, clicks=2)
# Right-click
dc.click(button='right')
```
#### `drag(start_x, start_y, end_x, end_y, duration=0.5, button='left')`
Drag and drop operation.
**Parameters:**
- `start_x, start_y` (int): Starting coordinates
- `end_x, end_y` (int): Ending coordinates
- `duration` (float): Drag duration
- `button` (str): Mouse button to use
**Example:**
```python
# Drag file from desktop to folder
dc.drag(100, 100, 500, 500, duration=1.0)
```
#### `scroll(clicks, direction='vertical', x=None, y=None)`
Scroll mouse wheel.
**Parameters:**
- `clicks` (int): Scroll amount (positive = up/left, negative = down/right)
- `direction` (str): 'vertical' or 'horizontal'
- `x, y` (int, optional): Position to scroll at
**Example:**
```python
# Scroll down 5 clicks
dc.scroll(-5)
# Scroll up 10 clicks
dc.scroll(10)
# Horizontal scroll
dc.scroll(5, direction='horizontal')
```
#### `get_mouse_position()`
Get current mouse coordinates.
**Returns:** `(x, y)` tuple
**Example:**
```python
x, y = dc.get_mouse_position()
print(f"Mouse is at: {x}, {y}")
```
---
### Keyboard Functions
#### `type_text(text, interval=0, wpm=None)`
Type text with configurable speed.
**Parameters:**
- `text` (str): Text to type
- `interval` (float): Delay between keystrokes (0 = instant)
- `wpm` (int, optional): Words per minute (overrides interval)
**Example:**
```python
# Instant typing
dc.type_text("Hello World")
# Human-like typing at 60 WPM
dc.type_text("Hello World", wpm=60)
# Slow typing with 0.1s between keys
dc.type_text("Hello World", interval=0.1)
```
#### `press(key, presses=1, interval=0.1)`
Press and release a key.
**Parameters:**
- `key` (str): Key name (see Key Names section)
- `presses` (int): Number of times to press
- `interval` (float): Delay between presses
**Example:**
```python
# Press Enter
dc.press('enter')
# Press Space 3 times
dc.press('space', presses=3)
# Press Down arrow
dc.press('down')
```
#### `hotkey(*keys, interval=0.05)`
Execute keyboard shortcut.
**Parameters:**
- `*keys` (str): Keys to press together
- `interval` (float): Delay between key presses
**Example:**
```python
# Copy (Ctrl+C)
dc.hotkey('ctrl', 'c')
# Paste (Ctrl+V)
dc.hotkey('ctrl', 'v')
# Open Run dialog (Win+R)
dc.hotkey('win', 'r')
# Save (Ctrl+S)
dc.hotkey('ctrl', 's')
# Select All (Ctrl+A)
dc.hotkey('ctrl', 'a')
```
#### `key_down(key)` / `key_up(key)`
Manually control key state.
**Example:**
```python
# Hold Shift
dc.key_down('shift')
dc.type_text("hello") # Types "HELLO"
dc.key_up('shift')
# Hold Ctrl and click (for multi-select)
dc.key_down('ctrl')
dc.click(100, 100)
dc.click(200, 100)
dc.key_up('ctrl')
```
---
### Screen Functions
#### `screenshot(region=None, filename=None)`
Capture screen or region.
**Parameters:**
- `region` (tuple, optional): (left, top, width, height) for partial capture
- `filename` (str, optional): Path to save image
**Returns:** PIL Image object
**Example:**
```python
# Full screen
img = dc.screenshot()
# Save to file
dc.screenshot(filename="screenshot.png")
# Capture specific region
img = dc.screenshot(region=(100, 100, 500, 300))
```
#### `get_pixel_color(x, y)`
Get color of pixel at coordinates.
**Returns:** RGB tuple `(r, g, b)`
**Example:**
```python
r, g, b = dc.get_pixel_color(500, 300)
print(f"Color at (500, 300): RGB({r}, {g}, {b})")
```
#### `find_on_screen(image_path, confidence=0.8)`
Find image on screen (requires OpenCV).
**Parameters:**
- `image_path` (str): Path to template image
- `confidence` (float): Match threshold (0-1)
**Returns:** `(x, y, width, height)` or None
**Example:**
```python
# Find button on screen
location = dc.find_on_screen("button.png")
if location:
x, y, w, h = location
# Click center of found image
dc.click(x + w//2, y + h//2)
```
#### `get_screen_size()`
Get screen resolution.
**Returns:** `(width, height)` tuple
**Example:**
```python
width, height = dc.get_screen_size()
print(f"Screen: {width}x{height}")
```
---
### Window Functions
#### `get_all_windows()`
List all open windows.
**Returns:** List of window titles
**Example:**
```python
windows = dc.get_all_windows()
for title in windows:
print(f"Window: {title}")
```
#### `activate_window(title_substring)`
Bring window to front by title.
**Parameters:**
- `title_substring` (str): Part of window title to match
**Example:**
```python
# Activate Chrome
dc.activate_window("Chrome")
# Activate VS Code
dc.activate_window("Visual Studio Code")
```
#### `get_active_window()`
Get currently focused window.
**Returns:** Window title (str)
**Example:**
```python
active = dc.get_active_window()
print(f"Active window: {active}")
```
---
### Clipboard Functions
#### `copy_to_clipboard(text)`
Copy text to clipboard.
**Example:**
```python
dc.copy_to_clipboard("Hello from OpenClaw!")
```
#### `get_from_clipboard()`
Get text from clipboard.
**Returns:** str
**Example:**
```python
text = dc.get_from_clipboard()
print(f"Clipboard: {text}")
```
---
## ⌨️ Key Names Reference
### Alphabet Keys
`'a'` through `'z'`
### Number Keys
`'0'` through `'9'`
### Function Keys
`'f1'` through `'f24'`
### Special Keys
- `'enter'` / `'return'`
- `'esc'` / `'escape'`
- `'space'` / `'spacebar'`
- `'tab'`
- `'backspace'`
- `'delete'` / `'del'`
- `'insert'`
- `'home'`
- `'end'`
- `'pageup'` / `'pgup'`
- `'pagedown'` / `'pgdn'`
### Arrow Keys
- `'up'` / `'down'` / `'left'` / `'right'`
### Modifier Keys
- `'ctrl'` / `'control'`
- `'shift'`
- `'alt'`
- `'win'` / `'winleft'` / `'winright'`
- `'cmd'` / `'command'` (Mac)
### Lock Keys
- `'capslock'`
- `'numlock'`
- `'scrolllock'`
### Punctuation
- `'.'` / `','` / `'?'` / `'!'` / `';'` / `':'`
- `'['` / `']'` / `'{'` / `'}'`
- `'('` / `')'`
- `'+'` / `'-'` / `'*'` / `'/'` / `'='`
---
## 🛡️ Safety Features
### Failsafe Mode
Move mouse to **any corner** of the screen to abort all automation.
```python
# Enable failsafe (enabled by default)
dc = DesktopController(failsafe=True)
```
### Pause Control
```python
# Pause all automation for 2 seconds
dc.pause(2.0)
# Check if automation is safe to proceed
if dc.is_safe():
dc.click(500, 500)
```
### Approval Mode
Require user confirmation before actions:
```python
dc = DesktopController(require_approval=True)
# This will ask for confirmation
dc.click(500, 500) # Prompt: "Allow click at (500, 500)? [y/n]"
```
---
## 🎨 Advanced Examples
### Example 1: Automated Form Filling
```python
dc = DesktopController()
# Click name field
dc.click(300, 200)
dc.type_text("John Doe", wpm=80)
# Tab to next field
dc.press('tab')
dc.type_text("john@example.com", wpm=80)
# Tab to password
dc.press('tab')
dc.type_text("SecurePassword123", wpm=60)
# Submit form
dc.press('enter')
```
### Example 2: Screenshot Region and Save
```python
# Capture specific area
region = (100, 100, 800, 600) # left, top, width, height
img = dc.screenshot(region=region)
# Save with timestamp
import datetime
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
img.save(f"capture_{timestamp}.png")
```
### Example 3: Multi-File Selection
```python
# Hold Ctrl and click multiple files
dc.key_down('ctrl')
dc.click(100, 200) # First file
dc.click(100, 250) # Second file
dc.click(100, 300) # Third file
dc.key_up('ctrl')
# Copy selected files
dc.hotkey('ctrl', 'c')
```
### Example 4: Window Automation
```python
# Activate Calculator
dc.activate_window("Calculator")
time.sleep(0.5)
# Type calculation
dc.type_text("5+3=", interval=0.2)
time.sleep(0.5)
# Take screenshot of result
dc.screenshot(filename="calculation_result.png")
```
### Example 5: Drag & Drop File
```python
# Drag file from source to destination
dc.drag(
start_x=200, start_y=300, # File location
end_x=800, end_y=500, # Folder location
duration=1.0 # Smooth 1-second drag
)
```
---
## ⚡ Performance Tips
1. **Use instant movements** for speed: `duration=0`
2. **Batch operations** instead of individual calls
3. **Cache screen positions** instead of recalculating
4. **Disable failsafe** for maximum performance (use with caution)
5. **Use hotkeys** instead of menu navigation
---
## ⚠️ Important Notes
- **Screen coordinates** start at (0, 0) in top-left corner
- **Multi-monitor setups** may have negative coordinates for secondary displays
- **Windows DPI scaling** may affect coordinate accuracy
- **Failsafe corners** are: (0,0), (width-1, 0), (0, height-1), (width-1, height-1)
- **Some applications** may block simulated input (games, secure apps)
---
## 🔧 Troubleshooting
### Mouse not moving to correct position
- Check DPI scaling settings
- Verify screen resolution matches expectations
- Use `get_screen_size()` to confirm dimensions
### Keyboard input not working
- Ensure target application has focus
- Some apps require admin privileges
- Try increasing `interval` for reliability
### Failsafe triggering accidentally
- Increase screen border tolerance
- Move mouse away from corners during normal use
- Disable if needed: `DesktopController(failsafe=False)`
### Permission errors
- Run Python with administrator privileges for some operations
- Some secure applications block automation
---
## 📦 Dependencies
- **PyAutoGUI** - Core automation engine
- **Pillow** - Image processing
- **OpenCV** (optional) - Image recognition
- **PyGetWindow** - Window management
Install all:
```bash
pip install pyautogui pillow opencv-python pygetwindow
```
---
**Built for OpenClaw** - The ultimate desktop automation companion 🦞

522
__init__.py Normal file
View File

@@ -0,0 +1,522 @@
"""
Desktop Control - Advanced Mouse, Keyboard, and Screen Automation
The best ever possible responsive desktop control for OpenClaw
"""
import pyautogui
import time
import sys
from typing import Tuple, Optional, List, Union
from pathlib import Path
import logging
# Configure PyAutoGUI
pyautogui.MINIMUM_DURATION = 0 # Allow instant movements
pyautogui.MINIMUM_SLEEP = 0 # No forced delays
pyautogui.PAUSE = 0 # No pause between function calls
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class DesktopController:
"""
Advanced desktop automation controller with mouse, keyboard, and screen operations.
Designed for maximum responsiveness and reliability.
"""
def __init__(self, failsafe: bool = True, require_approval: bool = False):
"""
Initialize desktop controller.
Args:
failsafe: Enable failsafe (move mouse to corner to abort)
require_approval: Require user confirmation for actions
"""
self.failsafe = failsafe
self.require_approval = require_approval
pyautogui.FAILSAFE = failsafe
# Get screen info
self.screen_width, self.screen_height = pyautogui.size()
logger.info(f"Desktop Controller initialized. Screen: {self.screen_width}x{self.screen_height}")
logger.info(f"Failsafe: {failsafe}, Require Approval: {require_approval}")
# ========== MOUSE OPERATIONS ==========
def move_mouse(self, x: int, y: int, duration: float = 0, smooth: bool = True) -> None:
"""
Move mouse to absolute screen coordinates.
Args:
x: X coordinate (pixels from left)
y: Y coordinate (pixels from top)
duration: Movement time in seconds (0 = instant)
smooth: Use smooth movement (cubic bezier)
"""
if self._check_approval(f"move mouse to ({x}, {y})"):
if smooth and duration > 0:
pyautogui.moveTo(x, y, duration=duration, tween=pyautogui.easeInOutQuad)
else:
pyautogui.moveTo(x, y, duration=duration)
logger.debug(f"Moved mouse to ({x}, {y}) in {duration}s")
def move_relative(self, x_offset: int, y_offset: int, duration: float = 0) -> None:
"""
Move mouse relative to current position.
Args:
x_offset: Pixels to move horizontally (+ = right, - = left)
y_offset: Pixels to move vertically (+ = down, - = up)
duration: Movement time in seconds
"""
if self._check_approval(f"move mouse relative ({x_offset}, {y_offset})"):
pyautogui.move(x_offset, y_offset, duration=duration)
logger.debug(f"Moved mouse relative ({x_offset}, {y_offset})")
def click(self, x: Optional[int] = None, y: Optional[int] = None,
button: str = 'left', clicks: int = 1, interval: float = 0.1) -> None:
"""
Perform mouse click.
Args:
x, y: Coordinates to click (None = current position)
button: 'left', 'right', 'middle'
clicks: Number of clicks (1 = single, 2 = double, etc.)
interval: Delay between multiple clicks
"""
position_str = f"at ({x}, {y})" if x is not None else "at current position"
if self._check_approval(f"{button} click {position_str}"):
pyautogui.click(x=x, y=y, clicks=clicks, interval=interval, button=button)
logger.info(f"{button.capitalize()} click {position_str} (x{clicks})")
def double_click(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Convenience method for double-click."""
self.click(x, y, clicks=2)
def right_click(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Convenience method for right-click."""
self.click(x, y, button='right')
def middle_click(self, x: Optional[int] = None, y: Optional[int] = None) -> None:
"""Convenience method for middle-click."""
self.click(x, y, button='middle')
def drag(self, start_x: int, start_y: int, end_x: int, end_y: int,
duration: float = 0.5, button: str = 'left') -> None:
"""
Drag and drop operation.
Args:
start_x, start_y: Starting coordinates
end_x, end_y: Ending coordinates
duration: Drag duration in seconds
button: Mouse button to use ('left', 'right', 'middle')
"""
if self._check_approval(f"drag from ({start_x}, {start_y}) to ({end_x}, {end_y})"):
pyautogui.moveTo(start_x, start_y)
time.sleep(0.05) # Small delay to ensure position
pyautogui.drag(end_x - start_x, end_y - start_y, duration=duration, button=button)
logger.info(f"Dragged from ({start_x}, {start_y}) to ({end_x}, {end_y})")
def scroll(self, clicks: int, direction: str = 'vertical',
x: Optional[int] = None, y: Optional[int] = None) -> None:
"""
Scroll mouse wheel.
Args:
clicks: Scroll amount (+ = up/left, - = down/right)
direction: 'vertical' or 'horizontal'
x, y: Position to scroll at (None = current position)
"""
if x is not None and y is not None:
pyautogui.moveTo(x, y)
if direction == 'vertical':
pyautogui.scroll(clicks)
else:
pyautogui.hscroll(clicks)
logger.debug(f"Scrolled {direction} {clicks} clicks")
def get_mouse_position(self) -> Tuple[int, int]:
"""
Get current mouse coordinates.
Returns:
(x, y) tuple
"""
pos = pyautogui.position()
return (pos.x, pos.y)
# ========== KEYBOARD OPERATIONS ==========
def type_text(self, text: str, interval: float = 0, wpm: Optional[int] = None) -> None:
"""
Type text with configurable speed.
Args:
text: Text to type
interval: Delay between keystrokes (0 = instant)
wpm: Words per minute (overrides interval, typical human: 40-80 WPM)
"""
if wpm is not None:
# Convert WPM to interval (assuming avg 5 chars per word)
chars_per_second = (wpm * 5) / 60
interval = 1.0 / chars_per_second
if self._check_approval(f"type text: '{text[:50]}...'"):
pyautogui.write(text, interval=interval)
logger.info(f"Typed text: '{text[:50]}{'...' if len(text) > 50 else ''}' (interval={interval:.3f}s)")
def press(self, key: str, presses: int = 1, interval: float = 0.1) -> None:
"""
Press and release a key.
Args:
key: Key name (e.g., 'enter', 'space', 'a', 'f1')
presses: Number of times to press
interval: Delay between presses
"""
if self._check_approval(f"press '{key}' {presses}x"):
pyautogui.press(key, presses=presses, interval=interval)
logger.info(f"Pressed '{key}' {presses}x")
def hotkey(self, *keys, interval: float = 0.05) -> None:
"""
Execute keyboard shortcut (e.g., Ctrl+C, Alt+Tab).
Args:
*keys: Keys to press together (e.g., 'ctrl', 'c')
interval: Delay between key presses
"""
keys_str = '+'.join(keys)
if self._check_approval(f"hotkey: {keys_str}"):
pyautogui.hotkey(*keys, interval=interval)
logger.info(f"Executed hotkey: {keys_str}")
def key_down(self, key: str) -> None:
"""Press and hold a key without releasing."""
pyautogui.keyDown(key)
logger.debug(f"Key down: '{key}'")
def key_up(self, key: str) -> None:
"""Release a held key."""
pyautogui.keyUp(key)
logger.debug(f"Key up: '{key}'")
# ========== SCREEN OPERATIONS ==========
def screenshot(self, region: Optional[Tuple[int, int, int, int]] = None,
filename: Optional[str] = None):
"""
Capture screen or region.
Args:
region: (left, top, width, height) for partial capture
filename: Path to save image (None = return PIL Image)
Returns:
PIL Image object (if filename is None)
"""
img = pyautogui.screenshot(region=region)
if filename:
img.save(filename)
logger.info(f"Screenshot saved to: {filename}")
else:
logger.debug(f"Screenshot captured (region={region})")
return img
def get_pixel_color(self, x: int, y: int) -> Tuple[int, int, int]:
"""
Get RGB color of pixel at coordinates.
Args:
x, y: Screen coordinates
Returns:
(r, g, b) tuple
"""
color = pyautogui.pixel(x, y)
return color
def find_on_screen(self, image_path: str, confidence: float = 0.8,
region: Optional[Tuple[int, int, int, int]] = None):
"""
Find image on screen using template matching.
Requires OpenCV (opencv-python).
Args:
image_path: Path to template image
confidence: Match threshold 0-1 (0.8 = 80% match)
region: Search region (left, top, width, height)
Returns:
(x, y, width, height) of match, or None if not found
"""
try:
location = pyautogui.locateOnScreen(image_path, confidence=confidence, region=region)
if location:
logger.info(f"Found '{image_path}' at {location}")
return location
else:
logger.debug(f"'{image_path}' not found on screen")
return None
except Exception as e:
logger.error(f"Error finding image: {e}")
return None
def get_screen_size(self) -> Tuple[int, int]:
"""
Get screen resolution.
Returns:
(width, height) tuple
"""
return (self.screen_width, self.screen_height)
# ========== WINDOW OPERATIONS ==========
def get_all_windows(self) -> List[str]:
"""
Get list of all open window titles.
Returns:
List of window title strings
"""
try:
import pygetwindow as gw
windows = gw.getAllTitles()
# Filter out empty titles
windows = [w for w in windows if w.strip()]
return windows
except ImportError:
logger.error("pygetwindow not installed. Run: pip install pygetwindow")
return []
except Exception as e:
logger.error(f"Error getting windows: {e}")
return []
def activate_window(self, title_substring: str) -> bool:
"""
Bring window to front by title (partial match).
Args:
title_substring: Part of window title to match
Returns:
True if window was activated, False otherwise
"""
try:
import pygetwindow as gw
windows = gw.getWindowsWithTitle(title_substring)
if windows:
windows[0].activate()
logger.info(f"Activated window: '{windows[0].title}'")
return True
else:
logger.warning(f"No window found with title containing: '{title_substring}'")
return False
except ImportError:
logger.error("pygetwindow not installed")
return False
except Exception as e:
logger.error(f"Error activating window: {e}")
return False
def get_active_window(self) -> Optional[str]:
"""
Get title of currently focused window.
Returns:
Window title string, or None if error
"""
try:
import pygetwindow as gw
active = gw.getActiveWindow()
return active.title if active else None
except ImportError:
logger.error("pygetwindow not installed")
return None
except Exception as e:
logger.error(f"Error getting active window: {e}")
return None
# ========== CLIPBOARD OPERATIONS ==========
def copy_to_clipboard(self, text: str) -> None:
"""
Copy text to clipboard.
Args:
text: Text to copy
"""
try:
import pyperclip
pyperclip.copy(text)
logger.info(f"Copied to clipboard: '{text[:50]}...'")
except ImportError:
logger.error("pyperclip not installed. Run: pip install pyperclip")
except Exception as e:
logger.error(f"Error copying to clipboard: {e}")
def get_from_clipboard(self) -> Optional[str]:
"""
Get text from clipboard.
Returns:
Clipboard text, or None if error
"""
try:
import pyperclip
text = pyperclip.paste()
logger.debug(f"Got from clipboard: '{text[:50]}...'")
return text
except ImportError:
logger.error("pyperclip not installed. Run: pip install pyperclip")
return None
except Exception as e:
logger.error(f"Error getting clipboard: {e}")
return None
# ========== UTILITY METHODS ==========
def pause(self, seconds: float) -> None:
"""
Pause automation for specified duration.
Args:
seconds: Time to pause
"""
logger.info(f"Pausing for {seconds}s...")
time.sleep(seconds)
def is_safe(self) -> bool:
"""
Check if it's safe to continue automation.
Returns False if mouse is in a corner (failsafe position).
Returns:
True if safe to continue
"""
if not self.failsafe:
return True
x, y = self.get_mouse_position()
corner_tolerance = 5
# Check corners
corners = [
(0, 0), # Top-left
(self.screen_width - 1, 0), # Top-right
(0, self.screen_height - 1), # Bottom-left
(self.screen_width - 1, self.screen_height - 1) # Bottom-right
]
for cx, cy in corners:
if abs(x - cx) <= corner_tolerance and abs(y - cy) <= corner_tolerance:
logger.warning(f"Mouse in corner ({x}, {y}) - FAILSAFE TRIGGERED")
return False
return True
def _check_approval(self, action: str) -> bool:
"""
Check if user approves action (if approval mode is enabled).
Args:
action: Description of action
Returns:
True if approved (or approval not required)
"""
if not self.require_approval:
return True
response = input(f"Allow: {action}? [y/n]: ").strip().lower()
approved = response in ['y', 'yes']
if not approved:
logger.warning(f"Action declined: {action}")
return approved
# ========== CONVENIENCE METHODS ==========
def alert(self, text: str = '', title: str = 'Alert', button: str = 'OK') -> None:
"""Show alert dialog box."""
pyautogui.alert(text=text, title=title, button=button)
def confirm(self, text: str = '', title: str = 'Confirm', buttons: List[str] = None) -> str:
"""Show confirmation dialog with buttons."""
if buttons is None:
buttons = ['OK', 'Cancel']
return pyautogui.confirm(text=text, title=title, buttons=buttons)
def prompt(self, text: str = '', title: str = 'Input', default: str = '') -> Optional[str]:
"""Show input prompt dialog."""
return pyautogui.prompt(text=text, title=title, default=default)
# ========== QUICK ACCESS FUNCTIONS ==========
# Global controller instance for quick access
_controller = None
def get_controller(**kwargs) -> DesktopController:
"""Get or create global controller instance."""
global _controller
if _controller is None:
_controller = DesktopController(**kwargs)
return _controller
# Convenience function exports
def move_mouse(x: int, y: int, duration: float = 0) -> None:
"""Quick mouse move."""
get_controller().move_mouse(x, y, duration)
def click(x: Optional[int] = None, y: Optional[int] = None, button: str = 'left') -> None:
"""Quick click."""
get_controller().click(x, y, button=button)
def type_text(text: str, wpm: Optional[int] = None) -> None:
"""Quick text typing."""
get_controller().type_text(text, wpm=wpm)
def hotkey(*keys) -> None:
"""Quick hotkey."""
get_controller().hotkey(*keys)
def screenshot(filename: Optional[str] = None):
"""Quick screenshot."""
return get_controller().screenshot(filename=filename)
# ========== DEMONSTRATION ==========
if __name__ == "__main__":
print("🖱️ Desktop Control Skill - Test Mode")
print("=" * 50)
# Initialize controller
dc = DesktopController(failsafe=True)
# Display info
print(f"\n📺 Screen Size: {dc.get_screen_size()}")
print(f"🖱️ Current Mouse Position: {dc.get_mouse_position()}")
# Test window operations
print(f"\n🪟 Active Window: {dc.get_active_window()}")
windows = dc.get_all_windows()
print(f"\n📋 Open Windows ({len(windows)}):")
for i, title in enumerate(windows[:10], 1): # Show first 10
print(f" {i}. {title}")
print("\n✅ Desktop Control ready!")
print("⚠️ Move mouse to any corner to trigger failsafe")
# Keep running to allow testing
print("\nController is ready. Import this module to use it in your OpenClaw skills!")

6
_meta.json Normal file
View File

@@ -0,0 +1,6 @@
{
"ownerId": "kn7ag28ra4hhta8bx2k2j1kpv180kqbk",
"slug": "desktop-control",
"version": "1.0.0",
"publishedAt": 1770255200863
}

613
ai_agent.py Normal file
View File

@@ -0,0 +1,613 @@
"""
AI Desktop Agent - Cognitive Desktop Automation
Combines vision, reasoning, and control for autonomous task execution
"""
import base64
import time
from typing import Dict, List, Optional, Any, Callable
from pathlib import Path
import logging
from desktop_control import DesktopController
logger = logging.getLogger(__name__)
class AIDesktopAgent:
"""
Intelligent desktop agent that combines computer vision, LLM reasoning,
and desktop control for autonomous task execution.
Can understand screen content, plan actions, and execute complex workflows.
"""
def __init__(self, llm_client=None, failsafe: bool = True):
"""
Initialize AI Desktop Agent.
Args:
llm_client: OpenClaw LLM client for reasoning (optional, will try to auto-detect)
failsafe: Enable failsafe mode
"""
self.dc = DesktopController(failsafe=failsafe)
self.llm_client = llm_client
self.screen_width, self.screen_height = self.dc.get_screen_size()
# Action history for learning
self.action_history = []
# Application knowledge base
self.app_knowledge = self._load_app_knowledge()
logger.info("AI Desktop Agent initialized")
def _load_app_knowledge(self) -> Dict[str, Dict]:
"""
Load application-specific knowledge.
This can be extended with learned patterns.
"""
return {
"mspaint": {
"name": "Microsoft Paint",
"launch_command": "mspaint",
"common_actions": {
"select_pencil": {"menu": "Tools", "position": "toolbar_left"},
"select_brush": {"menu": "Tools", "position": "toolbar"},
"select_color": {"menu": "Colors", "action": "click_palette"},
"draw_line": {"action": "drag", "tool_required": "line"},
}
},
"notepad": {
"name": "Notepad",
"launch_command": "notepad",
"common_actions": {
"type_text": {"action": "type"},
"save": {"hotkey": ["ctrl", "s"]},
"new_file": {"hotkey": ["ctrl", "n"]},
}
},
"calculator": {
"name": "Calculator",
"launch_command": "calc",
"common_actions": {
"calculate": {"action": "type_numbers"},
}
}
}
def execute_task(self, task: str, max_steps: int = 50) -> Dict[str, Any]:
"""
Execute a high-level task autonomously.
Args:
task: Natural language task description
max_steps: Maximum number of steps to attempt
Returns:
Execution result with status and details
"""
logger.info(f"Executing task: {task}")
# Initialize result
result = {
"task": task,
"status": "in_progress",
"steps": [],
"screenshots": [],
"success": False
}
try:
# Step 1: Analyze task and plan
plan = self._plan_task(task)
logger.info(f"Generated plan with {len(plan)} steps")
# Step 2: Execute plan step by step
for step_num, step in enumerate(plan, 1):
if step_num > max_steps:
logger.warning(f"Reached max steps ({max_steps})")
break
logger.info(f"Step {step_num}/{len(plan)}: {step['description']}")
# Capture screen before action
screenshot_before = self.dc.screenshot()
# Execute step
step_result = self._execute_step(step)
result["steps"].append(step_result)
# Capture screen after action
screenshot_after = self.dc.screenshot()
result["screenshots"].append({
"step": step_num,
"before": screenshot_before,
"after": screenshot_after
})
# Verify step success
if not step_result.get("success", False):
logger.error(f"Step {step_num} failed: {step_result.get('error')}")
result["status"] = "failed"
result["failed_at_step"] = step_num
return result
# Small delay between steps
time.sleep(0.5)
result["status"] = "completed"
result["success"] = True
logger.info(f"Task completed successfully in {len(result['steps'])} steps")
except Exception as e:
logger.error(f"Task execution error: {e}")
result["status"] = "error"
result["error"] = str(e)
return result
def _plan_task(self, task: str) -> List[Dict[str, Any]]:
"""
Plan task execution using LLM reasoning.
Args:
task: Task description
Returns:
List of execution steps
"""
# For now, use rule-based planning
# TODO: Integrate with OpenClaw LLM for intelligent planning
# Parse task intent
task_lower = task.lower()
# Pattern matching for common tasks
if "draw" in task_lower and "paint" in task_lower:
return self._plan_paint_drawing(task)
elif "type" in task_lower or "write" in task_lower:
return self._plan_text_entry(task)
elif "play" in task_lower and "game" in task_lower:
return self._plan_game_play(task)
elif "open" in task_lower or "launch" in task_lower:
return self._plan_app_launch(task)
else:
# Generic plan - analyze and improvise
return self._plan_generic(task)
def _plan_paint_drawing(self, task: str) -> List[Dict]:
"""Plan for drawing in MS Paint."""
# Extract what to draw
drawing_subject = self._extract_subject(task)
return [
{
"type": "launch_app",
"app": "mspaint",
"description": "Launch Microsoft Paint"
},
{
"type": "wait",
"duration": 2.0,
"description": "Wait for Paint to load"
},
{
"type": "activate_window",
"title": "Paint",
"description": "Ensure Paint window is active"
},
{
"type": "select_tool",
"tool": "pencil",
"description": "Select pencil tool"
},
{
"type": "draw",
"subject": drawing_subject,
"description": f"Draw {drawing_subject}"
},
{
"type": "screenshot",
"save_as": "drawing_result.png",
"description": "Capture the drawing"
}
]
def _plan_text_entry(self, task: str) -> List[Dict]:
"""Plan for text entry task."""
# Extract text to type
text_content = self._extract_text_content(task)
return [
{
"type": "launch_app",
"app": "notepad",
"description": "Launch Notepad"
},
{
"type": "wait",
"duration": 1.0,
"description": "Wait for Notepad to load"
},
{
"type": "type_text",
"text": text_content,
"wpm": 80,
"description": f"Type: {text_content[:50]}..."
}
]
def _plan_game_play(self, task: str) -> List[Dict]:
"""Plan for playing a game."""
game_name = self._extract_game_name(task)
return [
{
"type": "analyze_screen",
"description": "Analyze game screen"
},
{
"type": "detect_game_state",
"game": game_name,
"description": f"Detect {game_name} state"
},
{
"type": "execute_game_loop",
"game": game_name,
"max_iterations": 100,
"description": f"Play {game_name}"
}
]
def _plan_app_launch(self, task: str) -> List[Dict]:
"""Plan for launching an application."""
app_name = self._extract_app_name(task)
return [
{
"type": "launch_app",
"app": app_name,
"description": f"Launch {app_name}"
},
{
"type": "wait",
"duration": 2.0,
"description": f"Wait for {app_name} to load"
}
]
def _plan_generic(self, task: str) -> List[Dict]:
"""Generic planning fallback."""
return [
{
"type": "analyze_screen",
"description": "Analyze current screen state"
},
{
"type": "infer_action",
"task": task,
"description": f"Infer action for: {task}"
}
]
def _execute_step(self, step: Dict[str, Any]) -> Dict[str, Any]:
"""
Execute a single step.
Args:
step: Step definition
Returns:
Execution result
"""
step_type = step.get("type")
result = {"step": step, "success": False}
try:
if step_type == "launch_app":
self._do_launch_app(step["app"])
result["success"] = True
elif step_type == "wait":
time.sleep(step["duration"])
result["success"] = True
elif step_type == "activate_window":
success = self.dc.activate_window(step["title"])
result["success"] = success
elif step_type == "select_tool":
self._do_select_tool(step["tool"])
result["success"] = True
elif step_type == "draw":
self._do_draw(step["subject"])
result["success"] = True
elif step_type == "type_text":
self.dc.type_text(step["text"], wpm=step.get("wpm", 80))
result["success"] = True
elif step_type == "screenshot":
filename = step.get("save_as", "screenshot.png")
self.dc.screenshot(filename=filename)
result["success"] = True
result["saved_to"] = filename
elif step_type == "analyze_screen":
analysis = self._analyze_screen()
result["analysis"] = analysis
result["success"] = True
elif step_type == "execute_game_loop":
game_result = self._execute_game_loop(step)
result["game_result"] = game_result
result["success"] = True
else:
result["error"] = f"Unknown step type: {step_type}"
except Exception as e:
logger.error(f"Step execution error: {e}")
result["error"] = str(e)
return result
def _do_launch_app(self, app: str) -> None:
"""Launch an application."""
# Get launch command from knowledge base
app_info = self.app_knowledge.get(app, {})
launch_cmd = app_info.get("launch_command", app)
# Open Run dialog
self.dc.hotkey('win', 'r')
time.sleep(0.5)
# Type and execute command
self.dc.type_text(launch_cmd, wpm=100)
self.dc.press('enter')
logger.info(f"Launched: {app}")
def _do_select_tool(self, tool: str) -> None:
"""Select a tool (e.g., in Paint)."""
# This is simplified - in reality would use computer vision
# to find and click the tool button
# For Paint, tools are typically in the ribbon
# We'll use hotkeys where possible
if tool == "pencil":
# In Paint, press 'P' for pencil
self.dc.press('p')
elif tool == "brush":
self.dc.press('b')
elif tool == "eraser":
self.dc.press('e')
logger.info(f"Selected tool: {tool}")
def _do_draw(self, subject: str) -> None:
"""
Draw something on screen.
This is a simplified implementation - would be enhanced with:
- Image generation (use wan2gp to generate reference)
- Trace generation (convert image to draw commands)
- Executed drawing (execute the commands)
"""
logger.info(f"Drawing: {subject}")
# Get canvas center (simplified - would detect canvas)
canvas_x = self.screen_width // 2
canvas_y = self.screen_height // 2
# Simple drawing pattern (example: draw a simple shape)
if "circle" in subject.lower():
self._draw_circle(canvas_x, canvas_y, radius=100)
elif "square" in subject.lower():
self._draw_square(canvas_x, canvas_y, size=200)
elif "star" in subject.lower():
self._draw_star(canvas_x, canvas_y, size=100)
else:
# Generic: draw a simple pattern
self._draw_simple_pattern(canvas_x, canvas_y)
logger.info(f"Completed drawing: {subject}")
def _draw_circle(self, cx: int, cy: int, radius: int) -> None:
"""Draw a circle."""
import math
points = []
for angle in range(0, 360, 5):
rad = math.radians(angle)
x = int(cx + radius * math.cos(rad))
y = int(cy + radius * math.sin(rad))
points.append((x, y))
# Draw by connecting points
for i in range(len(points) - 1):
self.dc.drag(points[i][0], points[i][1],
points[i+1][0], points[i+1][1],
duration=0.01)
# Close the circle
self.dc.drag(points[-1][0], points[-1][1],
points[0][0], points[0][1],
duration=0.01)
def _draw_square(self, cx: int, cy: int, size: int) -> None:
"""Draw a square."""
half = size // 2
corners = [
(cx - half, cy - half), # Top-left
(cx + half, cy - half), # Top-right
(cx + half, cy + half), # Bottom-right
(cx - half, cy + half), # Bottom-left
]
# Draw sides
for i in range(4):
start = corners[i]
end = corners[(i + 1) % 4]
self.dc.drag(start[0], start[1], end[0], end[1], duration=0.2)
def _draw_star(self, cx: int, cy: int, size: int) -> None:
"""Draw a 5-pointed star."""
import math
points = []
for i in range(10):
angle = math.radians(i * 36 - 90)
radius = size if i % 2 == 0 else size // 2
x = int(cx + radius * math.cos(angle))
y = int(cy + radius * math.sin(angle))
points.append((x, y))
# Draw by connecting points
for i in range(len(points)):
start = points[i]
end = points[(i + 1) % len(points)]
self.dc.drag(start[0], start[1], end[0], end[1], duration=0.1)
def _draw_simple_pattern(self, cx: int, cy: int) -> None:
"""Draw a simple decorative pattern."""
# Draw a few curved lines
for offset in [-50, 0, 50]:
self.dc.drag(cx - 100, cy + offset,
cx + 100, cy + offset,
duration=0.3)
def _analyze_screen(self) -> Dict[str, Any]:
"""
Analyze current screen state.
Would use OCR, object detection in full implementation.
"""
screenshot = self.dc.screenshot()
active_window = self.dc.get_active_window()
mouse_pos = self.dc.get_mouse_position()
analysis = {
"active_window": active_window,
"mouse_position": mouse_pos,
"screen_size": (self.screen_width, self.screen_height),
"timestamp": time.time()
}
# TODO: Add OCR, object detection, UI element detection
return analysis
def _execute_game_loop(self, step: Dict) -> Dict:
"""
Execute game playing loop.
Would use reinforcement learning in full implementation.
"""
game = step.get("game", "unknown")
max_iter = step.get("max_iterations", 100)
logger.info(f"Starting game loop for: {game}")
result = {
"game": game,
"iterations": 0,
"actions_taken": []
}
# Simple game loop - would be much more sophisticated
for i in range(max_iter):
# Analyze game state
state = self._analyze_screen()
# Decide action (simplified - would use ML model)
action = self._decide_game_action(state, game)
# Execute action
self._execute_game_action(action)
result["iterations"] += 1
result["actions_taken"].append(action)
# Check win/lose condition
# (would detect from screen)
time.sleep(0.1)
return result
def _decide_game_action(self, state: Dict, game: str) -> str:
"""Decide next game action based on state."""
# Simplified - would use game-specific AI
return "continue"
def _execute_game_action(self, action: str) -> None:
"""Execute a game action."""
# Simplified - would translate to specific inputs
pass
# Helper methods for parsing
def _extract_subject(self, text: str) -> str:
"""Extract subject from drawing request."""
# Simple extraction - would use NLP
if "draw" in text.lower():
parts = text.lower().split("draw")
if len(parts) > 1:
return parts[1].strip()
return "unknown"
def _extract_text_content(self, text: str) -> str:
"""Extract text content from typing request."""
# Simple extraction
if "type" in text.lower():
parts = text.split("type")
if len(parts) > 1:
return parts[1].strip().strip('"').strip("'")
return text
def _extract_game_name(self, text: str) -> str:
"""Extract game name from request."""
# Would use NER for better extraction
return "unknown_game"
def _extract_app_name(self, text: str) -> str:
"""Extract application name from request."""
# Simple extraction - would use NER
for app in self.app_knowledge.keys():
if app in text.lower():
return app
return "notepad" # Default fallback
# Quick access function
def create_agent(**kwargs) -> AIDesktopAgent:
"""Create an AI Desktop Agent instance."""
return AIDesktopAgent(**kwargs)
if __name__ == "__main__":
print("🤖 AI Desktop Agent - Cognitive Automation")
print("=" * 60)
# Create agent
agent = AIDesktopAgent(failsafe=True)
print("\n✨ Examples of what you can ask:")
print(" - 'Draw a circle in Paint'")
print(" - 'Type Hello World in Notepad'")
print(" - 'Open Calculator'")
print(" - 'Play Solitaire for me'")
print("\n🎯 Try it:")
task = input("\nWhat would you like me to do? ")
if task.strip():
result = agent.execute_task(task)
print(f"\n{'='* 60}")
print(f"Task Status: {result['status']}")
print(f"Steps Executed: {len(result['steps'])}")
print(f"Success: {result['success']}")
if result.get('screenshots'):
print(f"Screenshots captured: {len(result['screenshots'])}")
else:
print("\nNo task entered. Exiting.")

238
demo.py Normal file
View File

@@ -0,0 +1,238 @@
"""
Desktop Control Demo - Quick examples and tests
"""
import sys
import time
from pathlib import Path
# Add skills to path
sys.path.insert(0, str(Path(__file__).parent))
from desktop_control import DesktopController
def demo_mouse_control():
"""Demo: Mouse movement and clicking"""
print("\n🖱️ === MOUSE CONTROL DEMO ===")
dc = DesktopController(failsafe=True)
print(f"Current position: {dc.get_mouse_position()}")
# Smooth movement
print("\n1. Moving mouse smoothly to center of screen...")
screen_w, screen_h = dc.get_screen_size()
center_x, center_y = screen_w // 2, screen_h // 2
dc.move_mouse(center_x, center_y, duration=1.0)
# Relative movement
print("2. Moving 100px right...")
dc.move_relative(100, 0, duration=0.5)
print(f"Final position: {dc.get_mouse_position()}")
print("✅ Mouse demo complete!")
def demo_keyboard_control():
"""Demo: Keyboard typing"""
print("\n⌨️ === KEYBOARD CONTROL DEMO ===")
dc = DesktopController()
print("\n⚠️ In 3 seconds, I'll type 'Hello from OpenClaw!' in the active window")
print("Switch to Notepad or any text editor NOW!")
time.sleep(3)
# Type with human-like speed
dc.type_text("Hello from OpenClaw! ", wpm=60)
dc.type_text("This is desktop automation in action. ", wpm=80)
# Press Enter
dc.press('enter')
dc.press('enter')
# Type instant
dc.type_text("This was typed instantly!", interval=0)
print("\n✅ Keyboard demo complete!")
def demo_screen_capture():
"""Demo: Screenshot functionality"""
print("\n📸 === SCREEN CAPTURE DEMO ===")
dc = DesktopController()
# Full screenshot
print("\n1. Capturing full screen...")
dc.screenshot(filename="demo_fullscreen.png")
print(" Saved: demo_fullscreen.png")
# Region screenshot (center 800x600)
print("\n2. Capturing center region (800x600)...")
screen_w, screen_h = dc.get_screen_size()
region = (
(screen_w - 800) // 2, # left
(screen_h - 600) // 2, # top
800, # width
600 # height
)
dc.screenshot(region=region, filename="demo_region.png")
print(" Saved: demo_region.png")
# Get pixel color
print("\n3. Getting pixel color at center...")
center_x, center_y = screen_w // 2, screen_h // 2
r, g, b = dc.get_pixel_color(center_x, center_y)
print(f" Color at ({center_x}, {center_y}): RGB({r}, {g}, {b})")
print("\n✅ Screen capture demo complete!")
def demo_window_management():
"""Demo: Window operations"""
print("\n🪟 === WINDOW MANAGEMENT DEMO ===")
dc = DesktopController()
# Get current window
print(f"\n1. Active window: {dc.get_active_window()}")
# List all windows
windows = dc.get_all_windows()
print(f"\n2. Found {len(windows)} open windows:")
for i, title in enumerate(windows[:15], 1): # Show first 15
print(f" {i}. {title}")
print("\n✅ Window management demo complete!")
def demo_hotkeys():
"""Demo: Keyboard shortcuts"""
print("\n🔥 === HOTKEY DEMO ===")
dc = DesktopController()
print("\n⚠️ This demo will:")
print(" 1. Open Windows Run dialog (Win+R)")
print(" 2. Type 'notepad'")
print(" 3. Press Enter to open Notepad")
print(" 4. Type a message")
print("\nPress Enter to continue...")
input()
# Open Run dialog
print("\n1. Opening Run dialog...")
dc.hotkey('win', 'r')
time.sleep(0.5)
# Type notepad command
print("2. Typing 'notepad'...")
dc.type_text('notepad', wpm=80)
time.sleep(0.3)
# Press Enter
print("3. Launching Notepad...")
dc.press('enter')
time.sleep(1)
# Type message
print("4. Typing message in Notepad...")
dc.type_text("Desktop Control Skill Test\n\n", wpm=60)
dc.type_text("This was automated by OpenClaw!\n", wpm=60)
dc.type_text("- Mouse control ✓\n", wpm=60)
dc.type_text("- Keyboard control ✓\n", wpm=60)
dc.type_text("- Hotkeys ✓\n", wpm=60)
print("\n✅ Hotkey demo complete!")
def demo_advanced_automation():
"""Demo: Complete automation workflow"""
print("\n🚀 === ADVANCED AUTOMATION DEMO ===")
dc = DesktopController()
print("\nThis demo will:")
print("1. Get your clipboard content")
print("2. Copy a new string to clipboard")
print("3. Show the changes")
print("\nPress Enter to continue...")
input()
# Get current clipboard
original = dc.get_from_clipboard()
print(f"\n1. Original clipboard: '{original}'")
# Copy new content
test_text = "Hello from OpenClaw Desktop Control!"
dc.copy_to_clipboard(test_text)
print(f"2. Copied to clipboard: '{test_text}'")
# Verify
new_clipboard = dc.get_from_clipboard()
print(f"3. Verified clipboard: '{new_clipboard}'")
# Restore original
if original:
dc.copy_to_clipboard(original)
print("4. Restored original clipboard")
print("\n✅ Advanced automation demo complete!")
def main():
"""Run all demos"""
print("=" * 60)
print("🎮 DESKTOP CONTROL SKILL - DEMO SUITE")
print("=" * 60)
print("\n⚠️ IMPORTANT:")
print("- Failsafe is ENABLED (move mouse to corner to abort)")
print("- Some demos will control your mouse and keyboard")
print("- Close important applications before continuing")
print("\n" + "=" * 60)
demos = [
("Mouse Control", demo_mouse_control),
("Window Management", demo_window_management),
("Screen Capture", demo_screen_capture),
("Hotkeys", demo_hotkeys),
("Keyboard Control", demo_keyboard_control),
("Advanced Automation", demo_advanced_automation),
]
while True:
print("\n📋 SELECT DEMO:")
for i, (name, _) in enumerate(demos, 1):
print(f" {i}. {name}")
print(f" {len(demos) + 1}. Run All")
print(" 0. Exit")
choice = input("\nEnter choice: ").strip()
if choice == '0':
print("\n👋 Goodbye!")
break
elif choice == str(len(demos) + 1):
for name, func in demos:
print(f"\n{'=' * 60}")
func()
time.sleep(1)
print(f"\n{'=' * 60}")
print("🎉 All demos complete!")
elif choice.isdigit() and 1 <= int(choice) <= len(demos):
demos[int(choice) - 1][1]()
else:
print("❌ Invalid choice!")
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
print("\n\n⚠️ Demo interrupted by user")
except Exception as e:
print(f"\n\n❌ Error: {e}")
import traceback
traceback.print_exc()