Skip to content

Postprocessors

Postprocessors are plugins that run after the LLM extraction but before validation. They are used to normalize, clean, or enrich the extracted data.

strutex includes several built-in postprocessors for common data cleaning tasks.

Using Postprocessors

You can use postprocessors individually or chain them together.

from strutex import DatePostprocessor, NumberPostprocessor, PostprocessorChain

# Individual usage
date_pp = DatePostprocessor()
data = date_pp.process({"invoice_date": "15.01.2024"})
# Result: {"invoice_date": "2024-01-15"}

# Chained usage
chain = PostprocessorChain([
    DatePostprocessor(),
    NumberPostprocessor(),
])
result = chain.process(raw_data)

Built-in Postprocessors

DatePostprocessor

Normalizes date fields to a standard ISO format (YYYY-MM-DD).

It automatically detects fields with "date" in their name, or you can specify a list of date_fields.

Features:

  • Dynamic Format Support: Automatically recognizes dates using standard separators (-, /, ., space, _).
  • No-Separator Support: parses compact formats like 20240115 or 15012024.
  • Text Month Support: parses formats like "January 15, 2024" or "15 Jan 2024".
  • Year Range Validation: Ignores dates outside the specified year range (default 1900-2100).
  • Configurable Output: Defaults to %Y-%m-%d but can be customized.

Configuration:

from strutex import DatePostprocessor

pp = DatePostprocessor(
    date_fields=["dob", "start_date"],  # Optional: specific fields
    separators=["-", "/", "."],         # Optional: custom separators
    output_format="%Y-%m-%d",           # Optional: custom output format
    min_year=1900,                      # Optional: min year validation
    max_year=2100                       # Optional: max year validation
)

NumberPostprocessor

Parses formatted number strings (e.g., currency, percentages) into float or int values.

It automatically detects fields like "total", "amount", "price", "cost", "sum", "qty".

Features:

  • Currency Handling: Removes symbols like $, €, £, ¥.
  • Locale Awareness: Handles US (1,234.56) and European (1.234,56) formats.
  • Negative Numbers: Handles parentheses (100) -> -100.

Configuration:

from strutex import NumberPostprocessor

# Default (US locale)
pp = NumberPostprocessor()

# European locale (dot as thousand separator, comma as decimal)
pp_eu = NumberPostprocessor(locale="de_DE")
pp_eu.process({"total": "1.234,56 €"})
# Result: {"total": 1234.56}

CurrencyNormalizer

Converts monetary amounts to a base currency. Adds new fields with a suffix (e.g., _usd).

Features:

  • Live Rates: Can fetch current exchange rates from a public API.
  • Static Rates: Can use a provided dictionary of exchange rates.
  • Auto-Conversion: Converts fields if a currency field is present in the data.

Configuration:

from strutex import CurrencyNormalizer

# Using static rates
pp = CurrencyNormalizer(
    base_currency="USD",
    exchange_rates={"EUR": 1.10, "GBP": 1.27}
)
result = pp.process({"total": 100, "currency": "EUR"})
# Result: {"total": 100, "currency": "EUR", "total_usd": 110.0}

# Fetching live rates
pp_live = CurrencyNormalizer(
    base_currency="USD",
    fetch_rates=True
)

PostprocessorChain

Executes a list of postprocessors in sequence.

from strutex import PostprocessorChain, DatePostprocessor

chain = PostprocessorChain([
    DatePostprocessor(),
    # ... other postprocessors
])

Creating Custom Postprocessors

You can create your own postprocessor by inheriting from strutex.Postprocessor.

from typing import Dict, Any
from strutex import Postprocessor, register

@register
class MyCustomPostprocessor(Postprocessor, name="my_custom"):
    priority = 50

    def process(self, data: Dict[str, Any]) -> Dict[str, Any]:
        # Modify data in place or return new dict
        if "title" in data:
            data["title"] = data["title"].upper()
        return data