Strutex¶
Python AI PDF Utilities — Extract structured JSON from documents using LLMs.
Features¶
- Quick Setup
Install with pip and extract data from PDFs in minutes.
- Fully Pluggable
Every component is a plugin. Swap providers, add validators.
- Security Layer
Protect against prompt injection with built-in sanitizers.
- Pydantic Support
Use Pydantic models for type-safe extractions.
Quick Example¶
from strutex import DocumentProcessor, Object, String, Number
schema = Object(properties={
"invoice_number": String(description="Invoice ID"),
"total": Number(description="Total amount")
})
processor = DocumentProcessor(provider="gemini")
result = processor.process("invoice.pdf", "Extract invoice data", schema)
print(result["invoice_number"]) # "INV-2024-001"
from pydantic import BaseModel
from strutex import DocumentProcessor
class Invoice(BaseModel):
invoice_number: str
total: float
processor = DocumentProcessor(provider="gemini")
result = processor.process("invoice.pdf", "Extract data", model=Invoice)
# result is a validated Invoice instance!
print(result.invoice_number)
Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ DocumentProcessor │
├─────────────────────────────────────────────────────────────┤
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Security │→ │ Extractor│→ │ Provider │→ │Validator │ │
│ │ Chain │ │ Plugin │ │ Plugin │ │ Plugin │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ ↓ ↓ ↓ ↓ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Plugin Registry │ │
│ │ @register("provider") / @register("validator") │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key Features¶
| Feature | Description |
|---|---|
| Plugin System | Register custom providers, validators, postprocessors |
| Security Layer | Input sanitization, prompt injection detection |
| Pydantic Support | Type-safe extractions with automatic validation |
| Structured Prompts | Build organized prompts with the fluent API |
| Multi-Provider | Gemini, OpenAI, Anthropic (extensible) |