Processors¶
Strutex provides 11 specialized extraction strategies. Each processor is optimized for a specific production requirement, from simple one-shot extractions to high-stakes multi-model consensus.
1. SimpleProcessor¶
The standard extraction strategy. It takes a document and a schema, and returns structured data in a single LLM call.
2. VerifiedProcessor¶
Adds a "verification loop" where the LLM reviews and corrects its own output. This significantly reduces hallucinations.
# Enable via 'verify=True' or using the strategy directly
result = processor.process("contract.pdf", "Extract clauses", verify=True)
3. RagProcessor¶
Retrieval-Augmented Generation for massive documents that don't fit in a single prompt. It retrieves relevant chunks before extraction.
4. BatchProcessor¶
Handles parallel processing of multiple documents using a thread pool. Excellent for high-volume ingest.
batch_context = processor.batch.process(["doc1.pdf", "doc2.pdf"], "Extract titles")
results = batch_context.results
5. FallbackProcessor¶
Implements a safety net. It tries multiple providers in order—useful for cost-optimization or ensuring high availability.
fallback = processor.create_fallback(configs=[
{"provider": "gemini", "model_name": "gemini-2.5-flash"}, # Fast/Cheap
{"provider": "openai", "model_name": "gpt-4o"} # Powerful/Reliable
])
result = fallback.process("data.pdf", "Extract")
6. RouterProcessor¶
Intelligently classifies a document before processing, routing it to a specialized processor.
router = processor.create_router(routes={"invoice": invoice_proc, "id": id_proc})
result = router.process("unknown.jpg", "Extract")
7. EnsembleProcessor¶
Runs multiple models in parallel (e.g., Gemini and GPT-4) and uses a "judge" model to resolve contradictions.
ensemble = processor.create_ensemble(providers=[p1, p2], judge=judge_p)
result = ensemble.process("medical_report.pdf", "Extract diagnosis")
8. SequentialProcessor¶
Processes very long documents page-by-page, carrying a "running state" to ensure consistency across the whole file.
sequential = processor.create_sequential(chunk_size_pages=1)
result = sequential.process("legal_bundle.pdf", "Extract timeline")
9. PrivacyProcessor¶
Redacts PII (Emails, Phones, SSNs) locally before sending the document to an LLM provider, then restores the data in the final result.
privacy = processor.create_privacy()
result = privacy.process("patient_record.txt", "Extract symptoms")
10. ActiveLearningProcessor¶
Assesses extraction confidence. It flags results that require human review based on consistency across multiple trials.
active = processor.create_active(num_trials=3, confidence_threshold=0.9)
result = active.process("blurred_id.jpg", "Extract number")
# result contains "_confidence" and "_requires_review"
11. AgenticProcessor¶
The most advanced processor. It uses an autonomous Plan-Act-Observe loop (powered by LangGraph) to navigate complex documents using other processors and internal library utilities as tools.
result = processor.agentic.process(
"100_page_contract.pdf",
"Find all termination clauses and liability limits across the document"
)
Creating Your Own Strategy¶
The true power of strutex lies in its extensibility. You can create entirely new extraction workflows by inheriting from strutex.processors.base.Processor.
By inheriting from the base class, your custom processor automatically gains:
- Automatic Provider Resolution: Handle string names or provider instances.
- Global Hooks: Pre-process, post-process, and error hooks execution.
- Validation Infrastructure: Built-in support for Pydantic and native schemas.
- Security Integration: Automatic execution of security plugins.
Example: A custom "Debug" Processor¶
from strutex.processors import Processor
from typing import Any, Optional
class DebugProcessor(Processor):
def process(self, file_path: str, prompt: str, **kwargs) -> Any:
print(f"DEBUG: Processing {file_path}")
# Your custom logic here
return {"status": "debug_ok"}
async def aprocess(self, file_path: str, prompt: str, **kwargs) -> Any:
# Async implementation
return await super().aprocess(file_path, prompt, **kwargs)