Skip to content

Your First Schema

Learn to define custom schemas for your specific documents.


Why Custom Schemas?

Built-in schemas like INVOICE_US are great for standard documents. But your documents might have:

  • Custom fields (e.g., reference_number, department_code)
  • Different structure (e.g., multiple addresses)
  • Domain-specific data (e.g., medical codes, legal citations)

The cleanest way to define schemas:

from pydantic import BaseModel
from typing import List, Optional
from strutex import DocumentProcessor

# Define your schema
class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class PurchaseOrder(BaseModel):
    po_number: str
    vendor: str
    ship_to: str
    order_date: str
    items: List[LineItem]
    subtotal: float
    tax: Optional[float] = None
    total: float

# Use it
processor = DocumentProcessor(provider="gemini")
result = processor.process(
    file_path="purchase_order.pdf",
    prompt="Extract all purchase order details",
    model=PurchaseOrder  # Note: 'model' not 'schema'
)

# Result is a validated Pydantic model
print(f"PO#: {result.po_number}")
for item in result.items:
    print(f"  - {item.description}: ${item.total}")

Option 2: Schema Types (No Pydantic)

If you prefer not to use Pydantic:

from strutex import DocumentProcessor, Object, String, Number, Array

# Define schema using strutex types
order_schema = Object(
    description="Purchase Order",
    properties={
        "po_number": String(description="Purchase order number"),
        "vendor": String(description="Vendor name"),
        "total": Number(description="Total amount"),
        "items": Array(
            items=Object(properties={
                "description": String,  # Simplified syntax!
                "quantity": Number,
                "unit_price": Number
            })
        )
    }
)

# Use it
result = processor.process(
    file_path="order.pdf",
    prompt="Extract purchase order",
    schema=order_schema  # Note: 'schema' not 'model'
)

# Result is a dict
print(result["po_number"])

Schema Best Practices

Do Don't
Use descriptive field names Use abbreviations (amtamount)
Add description to complex fields Leave descriptions empty
Use Optional for fields that may be missing Require everything
Keep nesting shallow (3 levels max) Create deeply nested structures

Field Types Reference

You can use types as classes (e.g., String) or instances (e.g., String(description="...")).

Pydantic Type Strutex Type Use For
str String Text, IDs, names
int Integer Counts, quantities
float Number Prices, percentages
bool Boolean Yes/no fields
List[T] Array(items=T) Line items, tags
Optional[T] T(nullable=True) Fields that may be missing

Next Steps

Want to... Go to...
Try different LLM providers Switching Providers
Add validation rules Adding Validation
See real-world schemas Built-in Schemas