Extractors¶

Extract text from documents for LLM processing.

Overview¶

Extractors convert document files (PDF, images, spreadsheets) into text that can be sent to an LLM.

from strutex import PDFExtractor, get_extractor

# Direct usage
extractor = PDFExtractor()
text = extractor.extract("invoice.pdf")

# Auto-select by MIME type
extractor = get_extractor("application/pdf")
text = extractor.extract("invoice.pdf")

Built-in Extractors¶

PDFExtractor¶

Uses a waterfall strategy: pypdf → pdfplumber → pdfminer → OCR.

from strutex import PDFExtractor

extractor = PDFExtractor()
text = extractor.extract("document.pdf")

ImageExtractor¶

Uses Tesseract OCR. Requires pytesseract and PIL.

from strutex import ImageExtractor

extractor = ImageExtractor()
text = extractor.extract("scan.png")

OCR Dependencies

Install with: pip install strutex[ocr]

ExcelExtractor¶

Converts spreadsheets to CSV text representation.

from strutex import ExcelExtractor

extractor = ExcelExtractor()
text = extractor.extract("data.xlsx")

GlinerExtractor¶

Fast, local entity extraction using GLiNER - a zero-shot NER model.

!!! tip "When to Use" - Speed: 10-100x faster than LLM calls - Cost: Runs locally, no API costs - Hybrid: Pre-extract entities, refine with LLM

from strutex.extractors import GlinerExtractor

# Default labels (person, company, date, money, etc.)
extractor = GlinerExtractor()
result = extractor.extract("invoice.pdf")

# Custom labels for your domain
extractor = GlinerExtractor(
    labels=["container_number", "vessel", "port", "weight"],
    threshold=0.3  # Confidence threshold
)
result = extractor.extract("bill_of_lading.pdf")

# Get structured output (not formatted string)
entities = extractor.extract_structured("invoice.pdf")
# Returns: {"date": [{"text": "2024-01-15", "score": 0.92}], ...}

Hybrid Pipeline - Use GLiNER for speed, LLM for accuracy:

from strutex import DocumentProcessor
from strutex.extractors import GlinerExtractor
from strutex.schemas import INVOICE_GENERIC

# Fast local pre-extraction
extractor = GlinerExtractor(labels=["invoice_number", "date", "amount"])
pre_extracted = extractor.extract("invoice.pdf")

# LLM refines and validates
processor = DocumentProcessor(provider="gemini")
result = processor.process(
    "invoice.pdf",
    f"Extract invoice. Hints: {pre_extracted}",
    model=INVOICE_GENERIC
)

Installation

Install with: pip install strutex[gliner]

Auto-Selection¶

Use get_extractor() to automatically select based on MIME type:

from strutex import get_extractor
from strutex.documents import get_mime_type

mime_type = get_mime_type("file.pdf")
extractor = get_extractor(mime_type)
text = extractor.extract("file.pdf")

Creating Custom Extractors¶

from strutex.plugins import Extractor

class XMLExtractor(Extractor, name="xml"):
    mime_types = ["application/xml", "text/xml"]

    def extract(self, file_path: str) -> str:
        import xml.etree.ElementTree as ET
        tree = ET.parse(file_path)
        return ET.tostring(tree.getroot(), encoding="unicode")

    def can_handle(self, mime_type: str) -> bool:
        return mime_type in self.mime_types

API Reference¶

`PDFExtractor` ¶

Bases: Extractor

Extracts text from PDF files using pdfplumber.

Robust fallback when multimodal LLM processing fails.

`extract(file_path: str) -> str` ¶

Extract text from a PDF file.

PARAMETER	DESCRIPTION
`file_path`	Path to the PDF file TYPE: `str`

RETURNS	DESCRIPTION
`str`	Extracted text content from all pages

Source code in strutex/extractors/pdf.py

def extract(self, file_path: str) -> str:
    """
    Extract text from a PDF file.

    Args:
        file_path: Path to the PDF file

    Returns:
        Extracted text content from all pages
    """
    if not PDFPLUMBER_AVAILABLE:
        raise ImportError("pdfplumber is required for PDFExtractor. Install with: pip install pdfplumber")

    text_content = []
    try:
        with pdfplumber.open(file_path) as pdf:
            for i, page in enumerate(pdf.pages):
                page_text = page.extract_text()
                if page_text:
                    text_content.append(f"--- Page {i+1} ---\n{page_text}")

        return "\n\n".join(text_content)

    except Exception as e:
        logger.error(f"Failed to extract text from {file_path}: {e}")
        raise RuntimeError(f"PDF extraction failed: {e}") from e

options: show_root_heading: true

`ImageExtractor` ¶

Bases: Extractor

Image extractor using Tesseract OCR.

Requires pytesseract and PIL to be installed. Install with: pip install strutex[ocr]

ATTRIBUTE	DESCRIPTION
`mime_types`	MIME types this extractor handles
`priority`	Extraction priority

`can_handle(mime_type: str) -> bool` ¶

Check if this extractor can handle the given MIME type.

Source code in strutex/extractors/image.py

def can_handle(self, mime_type: str) -> bool:
    """Check if this extractor can handle the given MIME type."""
    return mime_type in self.mime_types

`extract(file_path: str) -> str` ¶

Extract text from an image using OCR.

PARAMETER	DESCRIPTION
`file_path`	Path to the image file TYPE: `str`

RETURNS	DESCRIPTION
`str`	Extracted text content

RAISES	DESCRIPTION
`RuntimeError`	If OCR dependencies are not installed

Source code in strutex/extractors/image.py

def extract(self, file_path: str) -> str:
    """
    Extract text from an image using OCR.

    Args:
        file_path: Path to the image file

    Returns:
        Extracted text content

    Raises:
        RuntimeError: If OCR dependencies are not installed
    """
    if not _OCR_AVAILABLE:
        raise RuntimeError(
            "OCR dependencies not installed. "
            "Install with: pip install strutex[ocr]"
        )

    try:
        image = Image.open(file_path)
        text = pytesseract.image_to_string(image)
        return text.strip()
    except Exception as e:
        logger.error(f"OCR extraction failed for {file_path}: {e}")
        raise RuntimeError(f"Failed to extract text from image: {e}")

`health_check() -> bool` `classmethod` ¶

Check if OCR dependencies are available.

Source code in strutex/extractors/image.py

@classmethod
def health_check(cls) -> bool:
    """Check if OCR dependencies are available."""
    return _OCR_AVAILABLE

options: show_root_heading: true

`ExcelExtractor` ¶

Bases: Extractor

Excel and spreadsheet extractor.

Converts spreadsheet data to a text representation suitable for LLM processing.

ATTRIBUTE	DESCRIPTION
`mime_types`	MIME types this extractor handles
`priority`	Extraction priority

`can_handle(mime_type: str) -> bool` ¶

Check if this extractor can handle the given MIME type.

Source code in strutex/extractors/excel.py

def can_handle(self, mime_type: str) -> bool:
    """Check if this extractor can handle the given MIME type."""
    return mime_type in self.mime_types

`extract(file_path: str) -> str` ¶

Extract text from a spreadsheet file.

PARAMETER	DESCRIPTION
`file_path`	Path to the spreadsheet file TYPE: `str`

RETURNS	DESCRIPTION
`str`	Text representation of the spreadsheet data

Source code in strutex/extractors/excel.py

def extract(self, file_path: str) -> str:
    """
    Extract text from a spreadsheet file.

    Args:
        file_path: Path to the spreadsheet file

    Returns:
        Text representation of the spreadsheet data
    """
    from ..documents.spreadsheet import spreadsheet_to_text
    return spreadsheet_to_text(file_path)

options: show_root_heading: true

Extractors¶

Overview¶

Built-in Extractors¶

PDFExtractor¶

ImageExtractor¶

ExcelExtractor¶

GlinerExtractor¶

Auto-Selection¶

Creating Custom Extractors¶

API Reference¶

PDFExtractor ¶

extract(file_path: str) -> str ¶

ImageExtractor ¶

can_handle(mime_type: str) -> bool ¶

extract(file_path: str) -> str ¶

health_check() -> bool classmethod ¶

ExcelExtractor ¶

can_handle(mime_type: str) -> bool ¶

extract(file_path: str) -> str ¶

`PDFExtractor` ¶

`extract(file_path: str) -> str` ¶

`ImageExtractor` ¶

`can_handle(mime_type: str) -> bool` ¶

`extract(file_path: str) -> str` ¶

`health_check() -> bool` `classmethod` ¶

`ExcelExtractor` ¶

`can_handle(mime_type: str) -> bool` ¶

`extract(file_path: str) -> str` ¶