Working with File Uploads¶
Handle file paths, BytesIO streams, and HTTP uploads with DocumentInput.
The Problem¶
Your application might receive documents from different sources:
# From file system
file_path = "/path/to/invoice.pdf"
# From HTTP upload (Flask/FastAPI)
uploaded_file = request.files["document"]
file_bytes = uploaded_file.read()
# From cloud storage
blob = bucket.get_blob("invoice.pdf")
content = blob.download_as_bytes()
DocumentProcessor.process() expects a file path. How do you handle in-memory bytes?
DocumentInput: Unified Interface¶
DocumentInput provides a consistent interface for all input sources:
from strutex import DocumentInput, DocumentProcessor, GeminiProvider
from pathlib import Path
import io
# From file path (string)
doc = DocumentInput("invoice.pdf")
# From Path object
doc = DocumentInput(Path("invoice.pdf"))
# From BytesIO
bytes_data = io.BytesIO(pdf_bytes)
doc = DocumentInput(bytes_data, filename="upload.pdf")
Using with DocumentProcessor¶
Use the as_file_path() context manager:
from strutex import DocumentInput, DocumentProcessor, GeminiProvider
processor = DocumentProcessor(provider=GeminiProvider())
# Works with file path
doc = DocumentInput("invoice.pdf")
with doc.as_file_path() as path:
result = processor.process(path, "Extract", schema=MySchema)
# Works with BytesIO
doc = DocumentInput(io.BytesIO(pdf_bytes), filename="upload.pdf")
with doc.as_file_path() as path:
# Temp file is created automatically
result = processor.process(path, "Extract", schema=MySchema)
# Temp file is cleaned up automatically
Flask Example¶
from flask import Flask, request, jsonify
from strutex import DocumentInput, DocumentProcessor, GeminiProvider
from pydantic import BaseModel
app = Flask(__name__)
processor = DocumentProcessor(provider=GeminiProvider())
class Invoice(BaseModel):
vendor: str
total: float
date: str
@app.route("/extract", methods=["POST"])
def extract_invoice():
# Get uploaded file
uploaded = request.files["document"]
# Create DocumentInput from bytes
doc = DocumentInput(
io.BytesIO(uploaded.read()),
filename=uploaded.filename
)
# Process
with doc.as_file_path() as path:
result = processor.process(path, "Extract invoice", model=Invoice)
return jsonify(result.model_dump())
FastAPI Example¶
from fastapi import FastAPI, UploadFile
from strutex import DocumentInput, DocumentProcessor, GeminiProvider
from pydantic import BaseModel
import io
app = FastAPI()
processor = DocumentProcessor(provider=GeminiProvider())
class Invoice(BaseModel):
vendor: str
total: float
date: str
@app.post("/extract")
async def extract_invoice(file: UploadFile):
# Read file content
content = await file.read()
# Create DocumentInput
doc = DocumentInput(
io.BytesIO(content),
filename=file.filename
)
# Process
with doc.as_file_path() as path:
result = processor.process(path, "Extract invoice", model=Invoice)
return result.model_dump()
DocumentInput Properties¶
from strutex import DocumentInput
doc = DocumentInput("invoice.pdf")
# Check source type
print(doc.is_file_path) # True for file paths, False for BytesIO
print(doc.path) # Original path (or None for BytesIO)
print(doc.filename) # "invoice.pdf"
# Get raw bytes
content = doc.get_bytes()
# Get MIME type
mime_type = doc.get_mime_type() # "application/pdf"
MIME Type Detection¶
DocumentInput automatically detects MIME types:
doc = DocumentInput(io.BytesIO(data), filename="report.pdf")
print(doc.get_mime_type()) # "application/pdf"
doc = DocumentInput(io.BytesIO(data), filename="scan.png")
print(doc.get_mime_type()) # "image/png"
doc = DocumentInput(io.BytesIO(data), filename="data.xlsx")
print(doc.get_mime_type()) # "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
# Override MIME type explicitly
doc = DocumentInput(io.BytesIO(data), filename="file.bin", mime_type="application/pdf")
Temp File Lifecycle¶
When using BytesIO, as_file_path() creates a temporary file:
doc = DocumentInput(io.BytesIO(pdf_bytes), filename="upload.pdf")
with doc.as_file_path() as path:
# Temp file exists here
print(path) # /tmp/strutex_abc123.pdf
# Process the file
result = processor.process(path, "Extract", schema=MySchema)
# Temp file is automatically deleted here
Benefits:
- No manual cleanup needed
- Works with any processor/extractor
- Preserves file extension for MIME detection
Cloud Storage Examples¶
AWS S3¶
import boto3
import io
from strutex import DocumentInput, DocumentProcessor
s3 = boto3.client("s3")
processor = DocumentProcessor(provider=provider)
# Download from S3
response = s3.get_object(Bucket="my-bucket", Key="invoice.pdf")
content = response["Body"].read()
# Create DocumentInput
doc = DocumentInput(io.BytesIO(content), filename="invoice.pdf")
with doc.as_file_path() as path:
result = processor.process(path, "Extract", schema=MySchema)
Google Cloud Storage¶
from google.cloud import storage
import io
from strutex import DocumentInput, DocumentProcessor
client = storage.Client()
bucket = client.bucket("my-bucket")
blob = bucket.blob("invoice.pdf")
# Download
content = blob.download_as_bytes()
# Process
doc = DocumentInput(io.BytesIO(content), filename="invoice.pdf")
with doc.as_file_path() as path:
result = processor.process(path, "Extract", schema=MySchema)
Best Practices¶
| Practice | Why |
|---|---|
Always use as_file_path() context manager |
Ensures temp files are cleaned up |
Provide filename for BytesIO |
Enables MIME detection |
Reuse DocumentInput if processing multiple times |
Avoids re-reading bytes |
Set explicit mime_type for unknown extensions |
Prevents detection failures |
Next Steps¶
| Want to... | Go to... |
|---|---|
| Use with LangChain | Integrations |
| Add batch processing | Batch & Async |
| Create custom plugins | Custom Plugins |