michael-laffin_pdf-text-ext…/README.md

# PDF-Text-Extractor

**Extract text from PDFs with OCR support. Zero external dependencies (except PDF.js).**

## Quick Start

```bash
# Install
clawhub install pdf-text-extractor

# Extract text from PDF
cd ~/.openclaw/skills/pdf-text-extractor
node index.js extractText '{"pdfPath":"./document.pdf","options":{"outputFormat":"text"}}'
```

## Usage Examples

### Extract to Text
```javascript
const result = await extractText({
  pdfPath: './invoice.pdf',
  options: { outputFormat: 'text' }
});

console.log(result.text);
```

### Extract to JSON with Metadata
```javascript
const result = await extractText({
  pdfPath: './contract.pdf',
  options: {
    outputFormat: 'json',
    includeMetadata: true
  }
});

console.log(result.metadata);
console.log(`Words: ${result.wordCount}`);
```

### Batch Process Multiple PDFs
```javascript
const results = await extractBatch({
  pdfFiles: [
    './doc1.pdf',
    './doc2.pdf',
    './doc3.pdf'
  ]
});

console.log(`Processed ${results.successCount}/${results.results.length} documents`);
```

### Extract with OCR (Scanned Documents)
```javascript
const result = await extractText({
  pdfPath: './scanned-doc.pdf',
  options: {
    ocr: true,
    language: 'eng',
    ocrQuality: 'high'
  }
});

console.log(result.text);
```

### Count Words and Stats
```javascript
const stats = await countWords({
  text: result.text,
  options: { countByPage: true }
});

console.log(`Total words: ${stats.wordCount}`);
console.log(`Pages: ${stats.pageCounts.length}`);
console.log(`Avg per page: ${stats.averageWordsPerPage}`);
```

### Detect Language
```javascript
const lang = await detectLanguage(text);

console.log(`Language: ${lang.languageName}`);
console.log(`Confidence: ${lang.confidence}%`);
```

## Features

- **Text Extraction:** Extract text from PDFs without external tools
- **OCR Support:** Use Tesseract for scanned documents
- **Batch Processing:** Process multiple PDFs at once
- **Multiple Output Formats:** Text, JSON, Markdown, HTML
- **Word Counting:** Accurate word and character counting
- **Language Detection:** Simple heuristic for common languages
- **Metadata Extraction:** Title, author, creation date
- **Page-by-Page:** Extract text with page structure
- **Zero Config Required:** Works out of the box

## Use Cases

### Document Digitization
- Convert paper documents to digital text
- Process invoices and receipts
- Digitize contracts and agreements
- Archive physical documents

### Content Analysis
- Extract text for analysis tools
- Prepare content for LLM processing
- Clean up scanned documents
- Parse PDF-based reports

### Data Extraction
- Extract data from PDF reports
- Parse tables from PDFs
- Pull structured data
- Automate document workflows

### Text Processing
- Prepare content for translation
- Clean up OCR output
- Extract specific sections
- Search within PDF content

## Configuration

Edit `config.json` to customize:

```json
{
  "ocr": {
    "enabled": true,
    "defaultLanguage": "eng",
    "quality": "medium"
  },
  "output": {
    "defaultFormat": "text",
    "preserveFormatting": true
  },
  "batch": {
    "maxConcurrent": 3
  }
}
```

## Test

```bash
node test.js
```

## Output Formats

### Text
Plain text extraction with newlines between pages.

### JSON
```json
{
  "text": "Document text here...",
  "pages": 10,
  "wordCount": 1500,
  "charCount": 8500,
  "language": "English",
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "creationDate": "2026-02-04"
  }
}
```

## Performance

### Text-Based PDFs
- **Speed:** ~100ms for 10-page PDF
- **Accuracy:** 100% (exact text)

### OCR Processing
- **Speed:** ~1-3s per page
- **Accuracy:** 85-95% (depends on scan quality)

## Troubleshooting

### PDF Not Parsing
- Check if file is a valid PDF
- Ensure not password-protected
- Verify PDF.js is installed

### OCR Low Accuracy
- Ensure document language matches OCR language setting
- Try higher quality setting (slower but more accurate)
- Check scan quality (300 DPI+ recommended)

### Slow Processing
- Reduce batch concurrency
- Lower OCR quality for speed
- Process files individually

## Dependencies

```bash
npm install pdfjs-dist
```

## License

MIT

---

**Extract text from PDFs. Fast, accurate, ready to use.** 🔮