Framework Integrations¶
Use strutex with LangChain, LlamaIndex, and other AI frameworks.
[!WARNING] > Experimental: These integrations may break with framework updates. LangChain, LlamaIndex, and Haystack evolve rapidly. Pin dependency versions.
Overview¶
Strutex integrates with popular AI/ML frameworks for RAG pipelines:
┌─────────────────────────────────────────────────────────────┐
│ Your RAG Pipeline │
├─────────────────────────────────────────────────────────────┤
│ Documents → [Strutex] → Structured JSON → Vector Store │
│ ↓ │
│ Query → [LLM] → Answer │
└─────────────────────────────────────────────────────────────┘
Installation¶
# LangChain integration
pip install strutex[langchain]
# LlamaIndex integration
pip install strutex[llamaindex]
# Both
pip install strutex[all]
LangChain Integration¶
StrutexLoader¶
Use as a LangChain document loader:
from strutex.integrations import StrutexLoader
from strutex.schemas import INVOICE_US
# Create loader
loader = StrutexLoader(
file_path="invoice.pdf",
schema=INVOICE_US,
provider="gemini"
)
# Load documents
documents = loader.load()
# Use in LangChain pipeline
print(documents[0].page_content) # JSON string
print(documents[0].metadata) # {"source": "invoice.pdf", ...}
With Vector Store¶
from strutex.integrations import StrutexLoader
from strutex.schemas import INVOICE_US
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
# Load and extract invoices
loader = StrutexLoader("invoices/jan.pdf", schema=INVOICE_US)
docs = loader.load()
# Create vector store
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
# Query
results = vectorstore.similarity_search("highest total invoice")
StrutexOutputParser¶
Validate LLM output with strutex schemas:
from strutex.integrations import StrutexOutputParser
from pydantic import BaseModel
class InvoiceData(BaseModel):
vendor: str
total: float
date: str
parser = StrutexOutputParser(
schema=InvoiceData,
validators=["schema", "sum"] # Use strutex validators
)
# Parse LLM response
result = parser.parse(llm_response_text)
print(result.vendor) # Validated Pydantic model
# Get format instructions for prompts
instructions = parser.get_format_instructions()
LlamaIndex Integration¶
StrutexReader¶
Use as a LlamaIndex document reader:
from strutex.integrations import StrutexReader
from strutex.schemas import INVOICE_GENERIC
reader = StrutexReader(
schema=INVOICE_GENERIC,
provider="openai"
)
# Load data
documents = reader.load_data("invoice.pdf")
# Build index
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What was the total amount?")
StrutexNodeParser¶
Keep structured documents as single nodes (prevents chunking):
from strutex.integrations import StrutexReader, StrutexNodeParser
reader = StrutexReader(schema=MySchema)
docs = reader.load_data("complex_doc.pdf")
# Don't chunk structured JSON
parser = StrutexNodeParser()
nodes = parser.get_nodes_from_documents(docs)
# Each document stays as one node
print(len(nodes)) # Same as len(docs)
Full RAG Pipeline Example¶
from strutex.integrations import StrutexLoader
from strutex.schemas import INVOICE_US
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from pathlib import Path
# 1. Load all invoices from directory
documents = []
for pdf in Path("invoices/").glob("*.pdf"):
loader = StrutexLoader(str(pdf), schema=INVOICE_US)
documents.extend(loader.load())
print(f"Loaded {len(documents)} invoices")
# 2. Create vector store
vectorstore = Chroma.from_documents(
documents,
OpenAIEmbeddings()
)
# 3. Build QA chain
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(),
retriever=vectorstore.as_retriever()
)
# 4. Query your invoices
answer = qa.invoke("Which vendor had the highest total?")
print(answer)
answer = qa.invoke("List all invoices from January")
print(answer)
Haystack Integration¶
Use in Haystack 2.x pipelines:
from strutex.integrations import StrutexConverter
from strutex.schemas import INVOICE_US
converter = StrutexConverter(schema=INVOICE_US)
result = converter.run(sources=["invoice.pdf"])
documents = result["documents"]
Unstructured Fallback¶
Hybrid mode with consistent error handling:
from strutex.integrations import UnstructuredFallbackProcessor, ExtractionError
from strutex.schemas import INVOICE_US
# on_fallback options: "raise" (default), "empty", "partial"
processor = UnstructuredFallbackProcessor(
schema=INVOICE_US,
provider="gemini",
on_fallback="raise" # Fail loudly for consistent handling
)
try:
result = processor.process("messy_doc.pdf")
print(result["vendor_name"]) # Always returns consistent dict shape
except ExtractionError as e:
print(f"Extraction failed: {e}")
# Handle failure explicitly
Fallback modes:
| Mode | Behavior |
|---|---|
"raise" |
Raise ExtractionError on failure (recommended) |
"empty" |
Return empty dict matching schema |
"partial" |
Return empty dict with _fallback=True metadata |
Next Steps¶
| Want to... | Go to... |
|---|---|
| Handle file uploads | DocumentInput |
| Create custom plugins | Custom Plugins |
| See built-in schemas | Built-in Schemas |