Initial commit with translated description

2026-03-29 13:04:09 +08:00
commit 8b78fef0fb
8 changed files with 1773 additions and 0 deletions
--- a/SKILL.md
+++ b/SKILL.md
@@ -0,0 +1,356 @@
+---
+name: pdf-text-extractor
+description: "使用OCR支持从PDF中提取文本。"
+metadata:
+  {
+    "openclaw":
+      {
+        "version": "1.0.0",
+        "author": "Vernox",
+        "license": "MIT",
+        "tags": ["pdf", "ocr", "text", "extraction", "document", "digitization"],
+        "category": "tools"
+      }
+  }
+---
+
+# PDF-Text-Extractor - Extract Text from PDFs
+
+**Vernox Utility Skill - Perfect for document digitization.**
+
+## Overview
+
+PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).
+
+## Features
+
+### ✅ Text Extraction
+- Extract text from PDFs without external tools
+- Support for both text-based and scanned PDFs
+- Preserve document structure and formatting
+- Fast extraction (milliseconds for text-based)
+
+### ✅ OCR Support
+- Use Tesseract.js for scanned documents
+- Support multiple languages (English, Spanish, French, German)
+- Configurable OCR quality/speed
+- Fallback to text extraction when possible
+
+### ✅ Batch Processing
+- Process multiple PDFs at once
+- Batch extraction for document workflows
+- Progress tracking for large files
+- Error handling and retry logic
+
+### ✅ Output Options
+- Plain text output
+- JSON output with metadata
+- Markdown conversion
+- HTML output (preserving links)
+
+### ✅ Utility Features
+- Page-by-page extraction
+- Character/word counting
+- Language detection
+- Metadata extraction (author, title, creation date)
+
+## Installation
+
+```bash
+clawhub install pdf-text-extractor
+```
+
+## Quick Start
+
+### Extract Text from PDF
+
+```javascript
+const result = await extractText({
+  pdfPath: './document.pdf',
+  options: {
+    outputFormat: 'text',
+    ocr: true,
+    language: 'eng'
+  }
+});
+
+console.log(result.text);
+console.log(`Pages: ${result.pages}`);
+console.log(`Words: ${result.wordCount}`);
+```
+
+### Batch Extract Multiple PDFs
+
+```javascript
+const results = await extractBatch({
+  pdfFiles: [
+    './document1.pdf',
+    './document2.pdf',
+    './document3.pdf'
+  ],
+  options: {
+    outputFormat: 'json',
+    ocr: true
+  }
+});
+
+console.log(`Extracted ${results.length} PDFs`);
+```
+
+### Extract with OCR
+
+```javascript
+const result = await extractText({
+  pdfPath: './scanned-document.pdf',
+  options: {
+    ocr: true,
+    language: 'eng',
+    ocrQuality: 'high'
+  }
+});
+
+// OCR will be used (scanned document detected)
+```
+
+## Tool Functions
+
+### `extractText`
+Extract text content from a single PDF file.
+
+**Parameters:**
+- `pdfPath` (string, required): Path to PDF file
+- `options` (object, optional): Extraction options
+  - `outputFormat` (string): 'text' | 'json' | 'markdown' | 'html'
+  - `ocr` (boolean): Enable OCR for scanned docs
+  - `language` (string): OCR language code ('eng', 'spa', 'fra', 'deu')
+  - `preserveFormatting` (boolean): Keep headings/structure
+  - `minConfidence` (number): Minimum OCR confidence score (0-100)
+
+**Returns:**
+- `text` (string): Extracted text content
+- `pages` (number): Number of pages processed
+- `wordCount` (number): Total word count
+- `charCount` (number): Total character count
+- `language` (string): Detected language
+- `metadata` (object): PDF metadata (title, author, creation date)
+- `method` (string): 'text' or 'ocr' (extraction method)
+
+### `extractBatch`
+Extract text from multiple PDF files at once.
+
+**Parameters:**
+- `pdfFiles` (array, required): Array of PDF file paths
+- `options` (object, optional): Same as extractText
+
+**Returns:**
+- `results` (array): Array of extraction results
+- `totalPages` (number): Total pages across all PDFs
+- `successCount` (number): Successfully extracted
+- `failureCount` (number): Failed extractions
+- `errors` (array): Error details for failures
+
+### `countWords`
+Count words in extracted text.
+
+**Parameters:**
+- `text` (string, required): Text to count
+- `options` (object, optional):
+  - `minWordLength` (number): Minimum characters per word (default: 3)
+  - `excludeNumbers` (boolean): Don't count numbers as words
+  - `countByPage` (boolean): Return word count per page
+
+**Returns:**
+- `wordCount` (number): Total word count
+- `charCount` (number): Total character count
+- `pageCounts` (array): Word count per page
+- `averageWordsPerPage` (number): Average words per page
+
+### `detectLanguage`
+Detect the language of extracted text.
+
+**Parameters:**
+- `text` (string, required): Text to analyze
+- `minConfidence` (number): Minimum confidence for detection
+
+**Returns:**
+- `language` (string): Detected language code
+- `languageName` (string): Full language name
+- `confidence` (number): Confidence score (0-100)
+
+## Use Cases
+
+### Document Digitization
+- Convert paper documents to digital text
+- Process invoices and receipts
+- Digitize contracts and agreements
+- Archive physical documents
+
+### Content Analysis
+- Extract text for analysis tools
+- Prepare content for LLM processing
+- Clean up scanned documents
+- Parse PDF-based reports
+
+### Data Extraction
+- Extract data from PDF reports
+- Parse tables from PDFs
+- Pull structured data
+- Automate document workflows
+
+### Text Processing
+- Prepare content for translation
+- Clean up OCR output
+- Extract specific sections
+- Search within PDF content
+
+## Performance
+
+### Text-Based PDFs
+- **Speed:** ~100ms for 10-page PDF
+- **Accuracy:** 100% (exact text)
+- **Memory:** ~10MB for typical document
+
+### OCR Processing
+- **Speed:** ~1-3s per page (high quality)
+- **Accuracy:** 85-95% (depends on scan quality)
+- **Memory:** ~50-100MB peak during OCR
+
+## Technical Details
+
+### PDF Parsing
+- Uses native PDF.js library
+- Extracts text layer directly (no OCR needed)
+- Preserves document structure
+- Handles password-protected PDFs
+
+### OCR Engine
+- Tesseract.js under the hood
+- Supports 100+ languages
+- Adjustable quality/speed tradeoff
+- Confidence scoring for accuracy
+
+### Dependencies
+- **ZERO external dependencies**
+- Uses Node.js built-in modules only
+- PDF.js included in skill
+- Tesseract.js bundled
+
+## Error Handling
+
+### Invalid PDF
+- Clear error message
+- Suggest fix (check file format)
+- Skip to next file in batch
+
+### OCR Failure
+- Report confidence score
+- Suggest rescan at higher quality
+- Fallback to basic extraction
+
+### Memory Issues
+- Stream processing for large files
+- Progress reporting
+- Graceful degradation
+
+## Configuration
+
+### Edit `config.json`:
+```json
+{
+  "ocr": {
+    "enabled": true,
+    "defaultLanguage": "eng",
+    "quality": "medium",
+    "languages": ["eng", "spa", "fra", "deu"]
+  },
+  "output": {
+    "defaultFormat": "text",
+    "preserveFormatting": true,
+    "includeMetadata": true
+  },
+  "batch": {
+    "maxConcurrent": 3,
+    "timeoutSeconds": 30
+  }
+}
+```
+
+## Examples
+
+### Extract from Invoice
+```javascript
+const invoice = await extractText('./invoice.pdf');
+console.log(invoice.text);
+// "INVOICE #12345 Date: 2026-02-04..."
+```
+
+### Extract from Scanned Contract
+```javascript
+const contract = await extractText('./scanned-contract.pdf', {
+  ocr: true,
+  language: 'eng',
+  ocrQuality: 'high'
+});
+console.log(contract.text);
+// "AGREEMENT This contract between..."
+```
+
+### Batch Process Documents
+```javascript
+const docs = await extractBatch([
+  './doc1.pdf',
+  './doc2.pdf',
+  './doc3.pdf',
+  './doc4.pdf'
+]);
+console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);
+```
+
+## Troubleshooting
+
+### OCR Not Working
+- Check if PDF is truly scanned (not text-based)
+- Try different quality settings (low/medium/high)
+- Ensure language matches document
+- Check image quality of scan
+
+### Extraction Returns Empty
+- PDF may be image-only
+- OCR failed with low confidence
+- Try different language setting
+
+### Slow Processing
+- Large PDF takes longer
+- Reduce quality for speed
+- Process in smaller batches
+
+## Tips
+
+### Best Results
+- Use text-based PDFs when possible (faster, 100% accurate)
+- High-quality scans for OCR (300 DPI+)
+- Clean background before scanning
+- Use correct language setting
+
+### Performance Optimization
+- Batch processing for multiple files
+- Disable OCR for text-based PDFs
+- Lower OCR quality for speed when acceptable
+
+## Roadmap
+
+- [ ] PDF/A support
+- [ ] Advanced OCR pre-processing
+- [ ] Table extraction from OCR
+- [ ] Handwriting OCR
+- [ ] PDF form field extraction
+- [ ] Batch language detection
+- [ ] Confidence scoring visualization
+
+## License
+
+MIT
+
+---
+
+**Extract text from PDFs. Fast, accurate, zero dependencies.** 🔮