Extractors¶
Extract text from documents for LLM processing.
Overview¶
Extractors convert document files (PDF, images, spreadsheets) into text that can be sent to an LLM.
from strutex import PDFExtractor, get_extractor
# Direct usage
extractor = PDFExtractor()
text = extractor.extract("invoice.pdf")
# Auto-select by MIME type
extractor = get_extractor("application/pdf")
text = extractor.extract("invoice.pdf")
Built-in Extractors¶
PDFExtractor¶
Uses a waterfall strategy: pypdf → pdfplumber → pdfminer → OCR.
from strutex import PDFExtractor
extractor = PDFExtractor()
text = extractor.extract("document.pdf")
ImageExtractor¶
Uses Tesseract OCR. Requires pytesseract and PIL.
from strutex import ImageExtractor
extractor = ImageExtractor()
text = extractor.extract("scan.png")
OCR Dependencies
Install with: pip install strutex[ocr]
ExcelExtractor¶
Converts spreadsheets to CSV text representation.
from strutex import ExcelExtractor
extractor = ExcelExtractor()
text = extractor.extract("data.xlsx")
GlinerExtractor¶
Fast, local entity extraction using GLiNER - a zero-shot NER model.
!!! tip "When to Use" - Speed: 10-100x faster than LLM calls - Cost: Runs locally, no API costs - Hybrid: Pre-extract entities, refine with LLM
from strutex.extractors import GlinerExtractor
# Default labels (person, company, date, money, etc.)
extractor = GlinerExtractor()
result = extractor.extract("invoice.pdf")
# Custom labels for your domain
extractor = GlinerExtractor(
labels=["container_number", "vessel", "port", "weight"],
threshold=0.3 # Confidence threshold
)
result = extractor.extract("bill_of_lading.pdf")
# Get structured output (not formatted string)
entities = extractor.extract_structured("invoice.pdf")
# Returns: {"date": [{"text": "2024-01-15", "score": 0.92}], ...}
Hybrid Pipeline - Use GLiNER for speed, LLM for accuracy:
from strutex import DocumentProcessor
from strutex.extractors import GlinerExtractor
from strutex.schemas import INVOICE_GENERIC
# Fast local pre-extraction
extractor = GlinerExtractor(labels=["invoice_number", "date", "amount"])
pre_extracted = extractor.extract("invoice.pdf")
# LLM refines and validates
processor = DocumentProcessor(provider="gemini")
result = processor.process(
"invoice.pdf",
f"Extract invoice. Hints: {pre_extracted}",
model=INVOICE_GENERIC
)
Installation
Install with: pip install strutex[gliner]
Auto-Selection¶
Use get_extractor() to automatically select based on MIME type:
from strutex import get_extractor
from strutex.documents import get_mime_type
mime_type = get_mime_type("file.pdf")
extractor = get_extractor(mime_type)
text = extractor.extract("file.pdf")
Creating Custom Extractors¶
from strutex.plugins import Extractor
class XMLExtractor(Extractor, name="xml"):
mime_types = ["application/xml", "text/xml"]
def extract(self, file_path: str) -> str:
import xml.etree.ElementTree as ET
tree = ET.parse(file_path)
return ET.tostring(tree.getroot(), encoding="unicode")
def can_handle(self, mime_type: str) -> bool:
return mime_type in self.mime_types
API Reference¶
PDFExtractor
¶
Bases: Extractor
Extracts text from PDF files using pdfplumber.
Robust fallback when multimodal LLM processing fails.
extract(file_path: str) -> str
¶
Extract text from a PDF file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the PDF file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Extracted text content from all pages |
Source code in strutex/extractors/pdf.py
options: show_root_heading: true
ImageExtractor
¶
Bases: Extractor
Image extractor using Tesseract OCR.
Requires pytesseract and PIL to be installed. Install with: pip install strutex[ocr]
| ATTRIBUTE | DESCRIPTION |
|---|---|
mime_types |
MIME types this extractor handles
|
priority |
Extraction priority
|
can_handle(mime_type: str) -> bool
¶
extract(file_path: str) -> str
¶
Extract text from an image using OCR.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the image file
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Extracted text content |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If OCR dependencies are not installed |
Source code in strutex/extractors/image.py
options: show_root_heading: true
ExcelExtractor
¶
Bases: Extractor
Excel and spreadsheet extractor.
Converts spreadsheet data to a text representation suitable for LLM processing.
| ATTRIBUTE | DESCRIPTION |
|---|---|
mime_types |
MIME types this extractor handles
|
priority |
Extraction priority
|
options: show_root_heading: true