Working with File Uploads¶

Handle file paths, BytesIO streams, and HTTP uploads with DocumentInput.

The Problem¶

Your application might receive documents from different sources:

# From file system
file_path = "/path/to/invoice.pdf"

# From HTTP upload (Flask/FastAPI)
uploaded_file = request.files["document"]
file_bytes = uploaded_file.read()

# From cloud storage
blob = bucket.get_blob("invoice.pdf")
content = blob.download_as_bytes()

DocumentProcessor.process() expects a file path. How do you handle in-memory bytes?

DocumentInput: Unified Interface¶

DocumentInput provides a consistent interface for all input sources:

from strutex import DocumentInput, DocumentProcessor, GeminiProvider
from pathlib import Path
import io

# From file path (string)
doc = DocumentInput("invoice.pdf")

# From Path object
doc = DocumentInput(Path("invoice.pdf"))

# From BytesIO
bytes_data = io.BytesIO(pdf_bytes)
doc = DocumentInput(bytes_data, filename="upload.pdf")

Using with DocumentProcessor¶

Use the as_file_path() context manager:

from strutex import DocumentInput, DocumentProcessor, GeminiProvider

processor = DocumentProcessor(provider=GeminiProvider())

# Works with file path
doc = DocumentInput("invoice.pdf")
with doc.as_file_path() as path:
    result = processor.process(path, "Extract", schema=MySchema)

# Works with BytesIO
doc = DocumentInput(io.BytesIO(pdf_bytes), filename="upload.pdf")
with doc.as_file_path() as path:
    # Temp file is created automatically
    result = processor.process(path, "Extract", schema=MySchema)
# Temp file is cleaned up automatically

Flask Example¶

from flask import Flask, request, jsonify
from strutex import DocumentInput, DocumentProcessor, GeminiProvider
from pydantic import BaseModel

app = Flask(__name__)
processor = DocumentProcessor(provider=GeminiProvider())

class Invoice(BaseModel):
    vendor: str
    total: float
    date: str

@app.route("/extract", methods=["POST"])
def extract_invoice():
    # Get uploaded file
    uploaded = request.files["document"]

    # Create DocumentInput from bytes
    doc = DocumentInput(
        io.BytesIO(uploaded.read()),
        filename=uploaded.filename
    )

    # Process
    with doc.as_file_path() as path:
        result = processor.process(path, "Extract invoice", model=Invoice)

    return jsonify(result.model_dump())

FastAPI Example¶

from fastapi import FastAPI, UploadFile
from strutex import DocumentInput, DocumentProcessor, GeminiProvider
from pydantic import BaseModel
import io

app = FastAPI()
processor = DocumentProcessor(provider=GeminiProvider())

class Invoice(BaseModel):
    vendor: str
    total: float
    date: str

@app.post("/extract")
async def extract_invoice(file: UploadFile):
    # Read file content
    content = await file.read()

    # Create DocumentInput
    doc = DocumentInput(
        io.BytesIO(content),
        filename=file.filename
    )

    # Process
    with doc.as_file_path() as path:
        result = processor.process(path, "Extract invoice", model=Invoice)

    return result.model_dump()

DocumentInput Properties¶

from strutex import DocumentInput

doc = DocumentInput("invoice.pdf")

# Check source type
print(doc.is_file_path)  # True for file paths, False for BytesIO
print(doc.path)          # Original path (or None for BytesIO)
print(doc.filename)      # "invoice.pdf"

# Get raw bytes
content = doc.get_bytes()

# Get MIME type
mime_type = doc.get_mime_type()  # "application/pdf"

MIME Type Detection¶

DocumentInput automatically detects MIME types:

doc = DocumentInput(io.BytesIO(data), filename="report.pdf")
print(doc.get_mime_type())  # "application/pdf"

doc = DocumentInput(io.BytesIO(data), filename="scan.png")
print(doc.get_mime_type())  # "image/png"

doc = DocumentInput(io.BytesIO(data), filename="data.xlsx")
print(doc.get_mime_type())  # "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"

# Override MIME type explicitly
doc = DocumentInput(io.BytesIO(data), filename="file.bin", mime_type="application/pdf")

Temp File Lifecycle¶

When using BytesIO, as_file_path() creates a temporary file:

doc = DocumentInput(io.BytesIO(pdf_bytes), filename="upload.pdf")

with doc.as_file_path() as path:
    # Temp file exists here
    print(path)  # /tmp/strutex_abc123.pdf

    # Process the file
    result = processor.process(path, "Extract", schema=MySchema)

# Temp file is automatically deleted here

Benefits:

No manual cleanup needed
Works with any processor/extractor
Preserves file extension for MIME detection

Cloud Storage Examples¶

AWS S3¶

import boto3
import io
from strutex import DocumentInput, DocumentProcessor

s3 = boto3.client("s3")
processor = DocumentProcessor(provider=provider)

# Download from S3
response = s3.get_object(Bucket="my-bucket", Key="invoice.pdf")
content = response["Body"].read()

# Create DocumentInput
doc = DocumentInput(io.BytesIO(content), filename="invoice.pdf")

with doc.as_file_path() as path:
    result = processor.process(path, "Extract", schema=MySchema)

Google Cloud Storage¶

from google.cloud import storage
import io
from strutex import DocumentInput, DocumentProcessor

client = storage.Client()
bucket = client.bucket("my-bucket")
blob = bucket.blob("invoice.pdf")

# Download
content = blob.download_as_bytes()

# Process
doc = DocumentInput(io.BytesIO(content), filename="invoice.pdf")

with doc.as_file_path() as path:
    result = processor.process(path, "Extract", schema=MySchema)

Best Practices¶

Practice	Why
Always use `as_file_path()` context manager	Ensures temp files are cleaned up
Provide `filename` for BytesIO	Enables MIME detection
Reuse `DocumentInput` if processing multiple times	Avoids re-reading bytes
Set explicit `mime_type` for unknown extensions	Prevents detection failures

Next Steps¶

Want to...	Go to...
Use with LangChain	Integrations
Add batch processing	Batch & Async
Create custom plugins	Custom Plugins