Skip to content

Extractors

Extract text from documents for LLM processing.


Overview

Extractors convert document files (PDF, images, spreadsheets) into text that can be sent to an LLM.

from strutex import PDFExtractor, get_extractor

# Direct usage
extractor = PDFExtractor()
text = extractor.extract("invoice.pdf")

# Auto-select by MIME type
extractor = get_extractor("application/pdf")
text = extractor.extract("invoice.pdf")

Built-in Extractors

PDFExtractor

Uses a waterfall strategy: pypdf → pdfplumber → pdfminer → OCR.

from strutex import PDFExtractor

extractor = PDFExtractor()
text = extractor.extract("document.pdf")

ImageExtractor

Uses Tesseract OCR. Requires pytesseract and PIL.

from strutex import ImageExtractor

extractor = ImageExtractor()
text = extractor.extract("scan.png")

OCR Dependencies

Install with: pip install strutex[ocr]

ExcelExtractor

Converts spreadsheets to CSV text representation.

from strutex import ExcelExtractor

extractor = ExcelExtractor()
text = extractor.extract("data.xlsx")

GlinerExtractor

Fast, local entity extraction using GLiNER - a zero-shot NER model.

!!! tip "When to Use" - Speed: 10-100x faster than LLM calls - Cost: Runs locally, no API costs - Hybrid: Pre-extract entities, refine with LLM

from strutex.extractors import GlinerExtractor

# Default labels (person, company, date, money, etc.)
extractor = GlinerExtractor()
result = extractor.extract("invoice.pdf")

# Custom labels for your domain
extractor = GlinerExtractor(
    labels=["container_number", "vessel", "port", "weight"],
    threshold=0.3  # Confidence threshold
)
result = extractor.extract("bill_of_lading.pdf")

# Get structured output (not formatted string)
entities = extractor.extract_structured("invoice.pdf")
# Returns: {"date": [{"text": "2024-01-15", "score": 0.92}], ...}

Hybrid Pipeline - Use GLiNER for speed, LLM for accuracy:

from strutex import DocumentProcessor
from strutex.extractors import GlinerExtractor
from strutex.schemas import INVOICE_GENERIC

# Fast local pre-extraction
extractor = GlinerExtractor(labels=["invoice_number", "date", "amount"])
pre_extracted = extractor.extract("invoice.pdf")

# LLM refines and validates
processor = DocumentProcessor(provider="gemini")
result = processor.process(
    "invoice.pdf",
    f"Extract invoice. Hints: {pre_extracted}",
    model=INVOICE_GENERIC
)

Installation

Install with: pip install strutex[gliner]


Auto-Selection

Use get_extractor() to automatically select based on MIME type:

from strutex import get_extractor
from strutex.documents import get_mime_type

mime_type = get_mime_type("file.pdf")
extractor = get_extractor(mime_type)
text = extractor.extract("file.pdf")

Creating Custom Extractors

from strutex.plugins import Extractor

class XMLExtractor(Extractor, name="xml"):
    mime_types = ["application/xml", "text/xml"]

    def extract(self, file_path: str) -> str:
        import xml.etree.ElementTree as ET
        tree = ET.parse(file_path)
        return ET.tostring(tree.getroot(), encoding="unicode")

    def can_handle(self, mime_type: str) -> bool:
        return mime_type in self.mime_types

API Reference

PDFExtractor

Bases: Extractor

Extracts text from PDF files using pdfplumber.

Robust fallback when multimodal LLM processing fails.

extract(file_path: str) -> str

Extract text from a PDF file.

PARAMETER DESCRIPTION
file_path

Path to the PDF file

TYPE: str

RETURNS DESCRIPTION
str

Extracted text content from all pages

Source code in strutex/extractors/pdf.py
def extract(self, file_path: str) -> str:
    """
    Extract text from a PDF file.

    Args:
        file_path: Path to the PDF file

    Returns:
        Extracted text content from all pages
    """
    if not PDFPLUMBER_AVAILABLE:
        raise ImportError("pdfplumber is required for PDFExtractor. Install with: pip install pdfplumber")

    text_content = []
    try:
        with pdfplumber.open(file_path) as pdf:
            for i, page in enumerate(pdf.pages):
                page_text = page.extract_text()
                if page_text:
                    text_content.append(f"--- Page {i+1} ---\n{page_text}")

        return "\n\n".join(text_content)

    except Exception as e:
        logger.error(f"Failed to extract text from {file_path}: {e}")
        raise RuntimeError(f"PDF extraction failed: {e}") from e

options: show_root_heading: true

ImageExtractor

Bases: Extractor

Image extractor using Tesseract OCR.

Requires pytesseract and PIL to be installed. Install with: pip install strutex[ocr]

ATTRIBUTE DESCRIPTION
mime_types

MIME types this extractor handles

priority

Extraction priority

can_handle(mime_type: str) -> bool

Check if this extractor can handle the given MIME type.

Source code in strutex/extractors/image.py
def can_handle(self, mime_type: str) -> bool:
    """Check if this extractor can handle the given MIME type."""
    return mime_type in self.mime_types

extract(file_path: str) -> str

Extract text from an image using OCR.

PARAMETER DESCRIPTION
file_path

Path to the image file

TYPE: str

RETURNS DESCRIPTION
str

Extracted text content

RAISES DESCRIPTION
RuntimeError

If OCR dependencies are not installed

Source code in strutex/extractors/image.py
def extract(self, file_path: str) -> str:
    """
    Extract text from an image using OCR.

    Args:
        file_path: Path to the image file

    Returns:
        Extracted text content

    Raises:
        RuntimeError: If OCR dependencies are not installed
    """
    if not _OCR_AVAILABLE:
        raise RuntimeError(
            "OCR dependencies not installed. "
            "Install with: pip install strutex[ocr]"
        )

    try:
        image = Image.open(file_path)
        text = pytesseract.image_to_string(image)
        return text.strip()
    except Exception as e:
        logger.error(f"OCR extraction failed for {file_path}: {e}")
        raise RuntimeError(f"Failed to extract text from image: {e}")

health_check() -> bool classmethod

Check if OCR dependencies are available.

Source code in strutex/extractors/image.py
@classmethod
def health_check(cls) -> bool:
    """Check if OCR dependencies are available."""
    return _OCR_AVAILABLE

options: show_root_heading: true

ExcelExtractor

Bases: Extractor

Excel and spreadsheet extractor.

Converts spreadsheet data to a text representation suitable for LLM processing.

ATTRIBUTE DESCRIPTION
mime_types

MIME types this extractor handles

priority

Extraction priority

can_handle(mime_type: str) -> bool

Check if this extractor can handle the given MIME type.

Source code in strutex/extractors/excel.py
def can_handle(self, mime_type: str) -> bool:
    """Check if this extractor can handle the given MIME type."""
    return mime_type in self.mime_types

extract(file_path: str) -> str

Extract text from a spreadsheet file.

PARAMETER DESCRIPTION
file_path

Path to the spreadsheet file

TYPE: str

RETURNS DESCRIPTION
str

Text representation of the spreadsheet data

Source code in strutex/extractors/excel.py
def extract(self, file_path: str) -> str:
    """
    Extract text from a spreadsheet file.

    Args:
        file_path: Path to the spreadsheet file

    Returns:
        Text representation of the spreadsheet data
    """
    from ..documents.spreadsheet import spreadsheet_to_text
    return spreadsheet_to_text(file_path)

options: show_root_heading: true