Caching¶
Reduce API costs and speed up repeated extractions with caching.
Enable Caching¶
Pass a cache instance to DocumentProcessor:
from strutex import DocumentProcessor, MemoryCache, GeminiProvider
processor = DocumentProcessor(
provider=GeminiProvider(),
cache=MemoryCache(max_size=100, ttl=3600) # 1 hour TTL
)
# First call: hits the LLM API
result1 = processor.process("invoice.pdf", "Extract data", schema=MySchema)
# Second call: returns cached result instantly (no API call!)
result2 = processor.process("invoice.pdf", "Extract data", schema=MySchema)
Cache Types¶
MemoryCache¶
Fast, in-memory LRU cache. Best for single-process apps.
from strutex import MemoryCache
cache = MemoryCache(
max_size=100, # Max entries (LRU eviction)
ttl=3600 # TTL in seconds (optional)
)
processor = DocumentProcessor(provider=provider, cache=cache)
Features:
- LRU eviction when full
- Optional TTL (time-to-live)
- Thread-safe
- Hit/miss statistics
SQLiteCache¶
Persistent cache that survives restarts.
from strutex import SQLiteCache
cache = SQLiteCache(
db_path="~/.cache/strutex/cache.db",
ttl=86400, # 24 hour TTL
max_size=1000 # Optional size limit
)
processor = DocumentProcessor(provider=provider, cache=cache)
Features:
- Persists across restarts
- Automatic table creation
- Lazy TTL cleanup
FileCache¶
File-based cache for simple deployments.
How Cache Keys Work¶
Cache keys are computed from:
- File content hash — Same content = same key (even if renamed)
- Prompt hash — Different prompts = different keys
- Schema hash — Different schemas = different keys
- Provider + model — Different providers = different keys
# These produce DIFFERENT cache keys:
processor.process("invoice.pdf", "Extract all", schema=InvoiceSchema)
processor.process("invoice.pdf", "Extract total only", schema=InvoiceSchema)
# This produces the SAME cache key (same content, prompt, schema):
processor.process("invoice.pdf", "Extract all", schema=InvoiceSchema)
processor.process("invoice_copy.pdf", "Extract all", schema=InvoiceSchema) # Same content
Cache Invalidation Pitfalls¶
[!WARNING] Caching can cause stale data if you're not careful. Understand when caches invalidate.
When Cache DOES Invalidate¶
| Change | Invalidates? | Why |
|---|---|---|
| File content changes | Yes | Content hash changes |
| Prompt changes | Yes | Prompt hash changes |
| Schema adds new fields | Yes | Schema hash changes |
| Provider changes | Yes | Provider in key |
| Model changes | Yes | Model in key |
When Cache Does NOT Invalidate¶
| Change | Invalidates? | Risk |
|---|---|---|
| Field descriptions change | No | Schema structure same, but extraction may differ |
| Pydantic validators change | No | Post-processing logic not in key |
| Provider credentials change | No | Key uses provider name, not credentials |
| Prompt template internals | No | Only final prompt string is hashed |
Safe Cache Usage¶
# DANGER: Schema description changed, but cache returns old extraction
class Invoice_v1(BaseModel):
total: float = Field(description="Total amount") # Old
class Invoice_v2(BaseModel):
total: float = Field(description="Total amount including tax") # Changed!
# These have THE SAME cache key because structure is identical:
# processor.process(..., model=Invoice_v1)
# processor.process(..., model=Invoice_v2) # Returns stale cached result!
# FIX: Clear cache when schema semantics change
cache.clear()
# Or use versioned prompts
processor.process(..., prompt="Extract invoice v2", model=Invoice_v2)
Recommendations¶
- Clear cache after schema changes — Especially description changes
- Use TTL — Don't cache forever:
MemoryCache(ttl=3600) - Version your prompts — Include version in prompt string
- Test with fresh cache — After model updates
Cache Statistics¶
from strutex import MemoryCache
cache = MemoryCache()
processor = DocumentProcessor(provider=provider, cache=cache)
# After some usage...
stats = cache.stats()
print(f"Hits: {stats['hits']}")
print(f"Misses: {stats['misses']}")
print(f"Hit rate: {stats['hit_rate']:.1%}")
print(f"Size: {stats['size']}")
Manual Cache Operations¶
from strutex import CacheKey, MemoryCache
cache = MemoryCache()
# Create a cache key manually
key = CacheKey.create(
file_path="invoice.pdf",
prompt="Extract",
schema=MySchema,
provider="gemini"
)
# Check if cached
if cache.has(key):
result = cache.get(key)
else:
result = processor.process(...)
cache.set(key, result)
# Delete specific entry
cache.delete(key)
# Clear all
cache.clear()
Next Steps¶
| Want to... | Go to... |
|---|---|
| Add processing hooks | Processing Hooks |
| Add security checks | Security |
| Process in batches | Batch Processing |