Skip to content

Strutex

Python AI PDF Utilities — Extract structured JSON from documents using LLMs.


Features

  • Quick Setup

Install with pip and extract data from PDFs in minutes.

Getting Started

  • Fully Pluggable

Every component is a plugin. Swap providers, add validators.

Plugin System

  • Security Layer

Protect against prompt injection with built-in sanitizers.

Security

  • Pydantic Support

Use Pydantic models for type-safe extractions.

Pydantic


Quick Example

from strutex import DocumentProcessor, Object, String, Number

schema = Object(properties={
    "invoice_number": String(description="Invoice ID"),
    "total": Number(description="Total amount")
})

processor = DocumentProcessor(provider="gemini")
result = processor.process("invoice.pdf", "Extract invoice data", schema)

print(result["invoice_number"])  # "INV-2024-001"
from pydantic import BaseModel
from strutex import DocumentProcessor

class Invoice(BaseModel):
    invoice_number: str
    total: float

processor = DocumentProcessor(provider="gemini")
result = processor.process("invoice.pdf", "Extract data", model=Invoice)

# result is a validated Invoice instance!
print(result.invoice_number)

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     DocumentProcessor                        │
├─────────────────────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
│  │ Security │→ │ Extractor│→ │ Provider │→ │Validator │    │
│  │  Chain   │  │  Plugin  │  │  Plugin  │  │  Plugin  │    │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │
│         ↓            ↓            ↓            ↓            │
│  ┌────────────────────────────────────────────────────┐    │
│  │              Plugin Registry                        │    │
│  │   @register("provider") / @register("validator")  │    │
│  └────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Key Features

Feature Description
Plugin System Register custom providers, validators, postprocessors
Security Layer Input sanitization, prompt injection detection
Pydantic Support Type-safe extractions with automatic validation
Structured Prompts Build organized prompts with the fluent API
Multi-Provider Gemini, OpenAI, Anthropic (extensible)