Adding Validation¶
Learn to validate extracted data for accuracy and consistency.
Why Validate?¶
LLMs can make mistakes:
- Math errors — totals don't add up
- Date formatting — inconsistent formats
- Missing required fields — schema not enforced
- Hallucinations — data invented from nothing
Validation catches these before they reach your application.
Built-in Validators¶
SchemaValidator (Always Active)¶
Validates that output matches your schema structure:
from strutex import DocumentProcessor
from pydantic import BaseModel
class Invoice(BaseModel):
invoice_number: str
total: float
processor = DocumentProcessor(provider="gemini")
result = processor.process("invoice.pdf", "Extract", model=Invoice)
# Schema validation happens automatically
SumValidator¶
Checks that line items add up to totals:
from strutex import DocumentProcessor
from strutex.validators import ValidationChain, SchemaValidator, SumValidator
# Create validation chain
validators = ValidationChain([
SchemaValidator(),
SumValidator() # Checks: sum(items.price) == total
])
# The SumValidator will flag if totals don't match
# useful for invoices, receipts, purchase orders
DateValidator¶
Ensures dates are valid and consistent:
from strutex.validators import DateValidator
validators = ValidationChain([
SchemaValidator(),
DateValidator() # Checks date format validity
])
Verification Loop (Self-Correction)¶
Let the LLM audit its own work:
result = processor.process(
file_path="invoice.pdf",
prompt="Extract invoice",
model=Invoice,
verify=True # LLM reviews and corrects errors
)
How it works:
- First pass: Extract data
- Second pass: LLM reviews extraction against document
- Returns corrected result if errors found
This catches ~30% more errors at 2x API cost.
Custom Validation¶
Add your own business logic:
from strutex import DocumentProcessor
def validate_invoice(result, context):
errors = []
# Check PO number format
if not result.get("po_number", "").startswith("PO-"):
errors.append("PO number must start with 'PO-'")
# Check date is not in future
from datetime import datetime
if result.get("date"):
inv_date = datetime.strptime(result["date"], "%Y-%m-%d")
if inv_date > datetime.now():
errors.append("Invoice date cannot be in the future")
if errors:
result["_validation_warnings"] = errors
return result
# Register as post-process hook
processor = DocumentProcessor(
provider="gemini",
on_post_process=validate_invoice
)
Validation Strategies¶
| Scenario | Strategy |
|---|---|
| Invoices/receipts | SumValidator + verify=True |
| Dates matter | DateValidator |
| Critical data | All validators + custom rules |
| High volume | SchemaValidator only (fast) |
| Compliance | Custom hooks for business rules |
Error Handling¶
from strutex.validators import ValidationChain, SchemaValidator, SumValidator
chain = ValidationChain([SchemaValidator(), SumValidator()])
# Validate manually
is_valid, errors = chain.validate(result, schema)
if not is_valid:
print(f"Validation failed: {errors}")
# Handle errors...
Next Steps¶
| Want to... | Go to... |
|---|---|
| Learn about caching | Caching |
| Add security checks | Security Layer |
| Use in RAG pipelines | Integrations |