Postprocessors¶
Postprocessors are plugins that run after the LLM extraction but before validation. They are used to normalize, clean, or enrich the extracted data.
strutex includes several built-in postprocessors for common data cleaning tasks.
Using Postprocessors¶
You can use postprocessors individually or chain them together.
from strutex import DatePostprocessor, NumberPostprocessor, PostprocessorChain
# Individual usage
date_pp = DatePostprocessor()
data = date_pp.process({"invoice_date": "15.01.2024"})
# Result: {"invoice_date": "2024-01-15"}
# Chained usage
chain = PostprocessorChain([
DatePostprocessor(),
NumberPostprocessor(),
])
result = chain.process(raw_data)
Built-in Postprocessors¶
DatePostprocessor¶
Normalizes date fields to a standard ISO format (YYYY-MM-DD).
It automatically detects fields with "date" in their name, or you can specify a list of date_fields.
Features:
- Dynamic Format Support: Automatically recognizes dates using standard separators (
-,/,., space,_). - No-Separator Support: parses compact formats like
20240115or15012024. - Text Month Support: parses formats like "January 15, 2024" or "15 Jan 2024".
- Year Range Validation: Ignores dates outside the specified year range (default 1900-2100).
- Configurable Output: Defaults to
%Y-%m-%dbut can be customized.
Configuration:
from strutex import DatePostprocessor
pp = DatePostprocessor(
date_fields=["dob", "start_date"], # Optional: specific fields
separators=["-", "/", "."], # Optional: custom separators
output_format="%Y-%m-%d", # Optional: custom output format
min_year=1900, # Optional: min year validation
max_year=2100 # Optional: max year validation
)
NumberPostprocessor¶
Parses formatted number strings (e.g., currency, percentages) into float or int values.
It automatically detects fields like "total", "amount", "price", "cost", "sum", "qty".
Features:
- Currency Handling: Removes symbols like $, €, £, ¥.
- Locale Awareness: Handles US (
1,234.56) and European (1.234,56) formats. - Negative Numbers: Handles parentheses
(100)->-100.
Configuration:
from strutex import NumberPostprocessor
# Default (US locale)
pp = NumberPostprocessor()
# European locale (dot as thousand separator, comma as decimal)
pp_eu = NumberPostprocessor(locale="de_DE")
pp_eu.process({"total": "1.234,56 €"})
# Result: {"total": 1234.56}
CurrencyNormalizer¶
Converts monetary amounts to a base currency. Adds new fields with a suffix (e.g., _usd).
Features:
- Live Rates: Can fetch current exchange rates from a public API.
- Static Rates: Can use a provided dictionary of exchange rates.
- Auto-Conversion: Converts fields if a
currencyfield is present in the data.
Configuration:
from strutex import CurrencyNormalizer
# Using static rates
pp = CurrencyNormalizer(
base_currency="USD",
exchange_rates={"EUR": 1.10, "GBP": 1.27}
)
result = pp.process({"total": 100, "currency": "EUR"})
# Result: {"total": 100, "currency": "EUR", "total_usd": 110.0}
# Fetching live rates
pp_live = CurrencyNormalizer(
base_currency="USD",
fetch_rates=True
)
PostprocessorChain¶
Executes a list of postprocessors in sequence.
from strutex import PostprocessorChain, DatePostprocessor
chain = PostprocessorChain([
DatePostprocessor(),
# ... other postprocessors
])
Creating Custom Postprocessors¶
You can create your own postprocessor by inheriting from strutex.Postprocessor.
from typing import Dict, Any
from strutex import Postprocessor, register
@register
class MyCustomPostprocessor(Postprocessor, name="my_custom"):
priority = 50
def process(self, data: Dict[str, Any]) -> Dict[str, Any]:
# Modify data in place or return new dict
if "title" in data:
data["title"] = data["title"].upper()
return data