Skip to content

Strutex

Structured Text Extraction โ€” Extract structured JSON from documents using LLMs.


The Simplest Example

import strutex
from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str
    total: float

result = strutex.extract("invoice.pdf", model=Invoice)
print(result.invoice_number, result.total)

That's it. Everything else in strutex is optional.


What You Can Do

Level Features When to use
Basic extract(), schemas Most use cases
Reliability verification, validation Production
Scale caching, async, batch High volume
Extensibility plugins, hooks Custom needs

Most users only need Level 1. The rest is there when you need it.


Documentation Map

๐Ÿ“š Tutorial (Start Here)

Progressive learning path from basics to advanced:

# Page Description
1 Quickstart First extraction in 5 minutes
2 Your First Schema Define custom schemas (Pydantic & native)
3 Switching Providers Configure GeminiProvider, OpenAIProvider, etc.
4 Adding Validation Validators and verification loop
5 Caching MemoryCache, SQLiteCache, FileCache
6 Processing Hooks Pre/post processing hooks
7 Input Sanitization Input cleaning, PII redaction
8 Batch & Async process_batch, aprocess
9 Streaming Real-time extraction feedback
10 Error Handling Errors, retries, debugging
11 File Uploads BytesIO, Flask, FastAPI
12 Integrations LangChain, LlamaIndex (Experimental)
13 Custom Plugins Create Provider, Extractor, SecurityPlugin
14 Advanced Processors Deep dive into all extraction strategies
15 Use Cases Invoice, Receipt, Resume examples
16 Prompt Engineering StructuredPrompt builder

๐Ÿ“– User Guide

Reference documentation for core features:

Section Pages
Schemas Schema Types ยท Built-in Schemas ยท Pydantic Support
Prompts Prompt Builder ยท Verification
RAG Retrieval-Augmented Generation

โšก Providers

LLM provider configuration and optimization:

Page Description
Overview All supported providers
Provider Chains Fallback and cost optimization
Caching Reference Detailed cache API

๐Ÿ”Œ Integrations

Use with popular AI frameworks:

Page Description
Integrations LangChain, LlamaIndex, Haystack, Unstructured

๐Ÿ”ง Advanced

For power users and contributors:

Page Description
Advanced Processors All extraction strategies (standard & advanced)
Plugins API Custom provider and extractor reference
Hooks System Lifecycle hooks and event system
CLI Reference Advanced command-line usage
API Reference Autogenerated package reference
CLI Commands Command-line interface

๐Ÿ— Architecture

Internal design and extension points:

Page Description
Extractors PDF, Excel, Image extractors
Validators Schema, Sum, Date validators
Input Sanitization Sanitization API

๐Ÿ“‹ Reference

Page Description
API Reference Full API documentation
Changelog Version history

Installation

pip install strutex

# With integrations
pip install strutex[langchain]
pip install strutex[rag]
pip install strutex[all]

โ†’ Getting Started