Strutex¶
Structured Text Extraction โ Extract structured JSON from documents using LLMs.
The Simplest Example¶
import strutex
from pydantic import BaseModel
class Invoice(BaseModel):
invoice_number: str
total: float
result = strutex.extract("invoice.pdf", model=Invoice)
print(result.invoice_number, result.total)
That's it. Everything else in strutex is optional.
What You Can Do¶
| Level | Features | When to use |
|---|---|---|
| Basic | extract(), schemas |
Most use cases |
| Reliability | verification, validation | Production |
| Scale | caching, async, batch | High volume |
| Extensibility | plugins, hooks | Custom needs |
Most users only need Level 1. The rest is there when you need it.
Documentation Map¶
๐ Tutorial (Start Here)¶
Progressive learning path from basics to advanced:
| # | Page | Description |
|---|---|---|
| 1 | Quickstart | First extraction in 5 minutes |
| 2 | Your First Schema | Define custom schemas (Pydantic & native) |
| 3 | Switching Providers | Configure GeminiProvider, OpenAIProvider, etc. |
| 4 | Adding Validation | Validators and verification loop |
| 5 | Caching | MemoryCache, SQLiteCache, FileCache |
| 6 | Processing Hooks | Pre/post processing hooks |
| 7 | Input Sanitization | Input cleaning, PII redaction |
| 8 | Batch & Async | process_batch, aprocess |
| 9 | Streaming | Real-time extraction feedback |
| 10 | Error Handling | Errors, retries, debugging |
| 11 | File Uploads | BytesIO, Flask, FastAPI |
| 12 | Integrations | LangChain, LlamaIndex (Experimental) |
| 13 | Custom Plugins | Create Provider, Extractor, SecurityPlugin |
| 14 | Advanced Processors | Deep dive into all extraction strategies |
| 15 | Use Cases | Invoice, Receipt, Resume examples |
| 16 | Prompt Engineering | StructuredPrompt builder |
๐ User Guide¶
Reference documentation for core features:
| Section | Pages |
|---|---|
| Schemas | Schema Types ยท Built-in Schemas ยท Pydantic Support |
| Prompts | Prompt Builder ยท Verification |
| RAG | Retrieval-Augmented Generation |
โก Providers¶
LLM provider configuration and optimization:
| Page | Description |
|---|---|
| Overview | All supported providers |
| Provider Chains | Fallback and cost optimization |
| Caching Reference | Detailed cache API |
๐ Integrations¶
Use with popular AI frameworks:
| Page | Description |
|---|---|
| Integrations | LangChain, LlamaIndex, Haystack, Unstructured |
๐ง Advanced¶
For power users and contributors:
| Page | Description |
|---|---|
| Advanced Processors | All extraction strategies (standard & advanced) |
| Plugins API | Custom provider and extractor reference |
| Hooks System | Lifecycle hooks and event system |
| CLI Reference | Advanced command-line usage |
| API Reference | Autogenerated package reference |
| CLI Commands | Command-line interface |
๐ Architecture¶
Internal design and extension points:
| Page | Description |
|---|---|
| Extractors | PDF, Excel, Image extractors |
| Validators | Schema, Sum, Date validators |
| Input Sanitization | Sanitization API |
๐ Reference¶
| Page | Description |
|---|---|
| API Reference | Full API documentation |
| Changelog | Version history |