Document Processing Guide

Document-processing Optional Dependency

The document-processing extra equips Insight Ingenious with a unified API and CLI for turning born-digital documents into structured text blocks.cument‑processing Optional Dependency

The document‑processing extra equips Insight Ingenious with a unified API and CLI for tu—

Happy extracting!ng born‑digital documents into structured text blocks. It is ideal for:

  • RAG pipelines that need paragraph‑level text with coordinates
  • Extraction flows that must gather data from mixed PDF + DOCX + PPTX collections
  • Quickly inspecting files from the command line without writing Python

1  Installation

# minimal feature set (PyMuPDF + Azure Document Intelligence wrapper + CLI)
uv pip install -e ".[document-processing]"

# include pure‑Python PDFMiner and rich‑text Unstructured engines
uv pip install -e ".[document-processing,pdfminer,unstructured]"

Why separate extras?

  • PyMuPDF is the fastest path for standard PDFs.
  • PDFMiner avoids native code – useful on Alpine or AWS Lambda.
  • Unstructured adds DOCX, PPTX.

2  Command‑line quick‑start

# 1️⃣  Stream a remote PDF through pdfminer engine
ingen document-processing https://example.com/contract.pdf --engine pdfminer --out pages_pdfminer.jsonl

3  Python API in three lines

from pathlib import Path
from ingenious.document_processing import extract

elements = list(extract(Path("report.pdf")))  # defaults to PyMuPDF
print(elements[0]["text"])

Choosing a specific engine

for block in extract("paper.pdf", engine="pdfminer"):
    ...

Valid engine values (all case‑sensitive):

Engine key Dependency (extra) Best for
pymupdf document-processing Fast positional PDF extraction
pdfminer pdfminer Pure‑Python / Alpine / Lambda builds
unstructured unstructured DOCX, PPTX, HTML, TXT, unusual PDFs
azdocint document-processing Cloud‑based Azure AI Document Intelligence

4  Azure AI Document Intelligence engine

Cloud extraction unlocks semantic paragraphs and table metadata. Set two environment variables before running or importing:

export AZURE_DOC_INTEL_ENDPOINT="https://<resource>.cognitiveservices.azure.com"
export AZURE_DOC_INTEL_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

The service is pay‑as‑you‑go; the tiny unit tests stay within the free tier.


5  Streaming vs list

Every extractor yields lazy generators. Consume them as you like:

# memory‑efficient streaming
for element in extract("big.pdf"):
    process(element)

# materialise everything (convenient, but RAM‑heavy)
all_blocks = list(extract("big.pdf"))

6  Running the test‑suite

Azure Document Intelligence integration assets

The Azure DI tests look for four tiny sample files –

  • sample.pdf
  • sample.jpg
  • sample.png
  • sample.tiff

– inside ingenious/document_processing/tests/data_azure_doc_intell/. Create that folder and drop the files in before running pytest -m integration. If the folder or any file is missing, the tests are auto-skipped, so the rest of the suite still runs cleanly.

# Core engines only
uv pip install -e ".[document-processing,tests]"

Add pdfminer and unstructured extras to expand coverage:

uv pip install -e ".[document-processing,pdfminer,unstructured,tests]"

# Run all tests
uv run pytest ingenious/document_processing/tests

7  Troubleshooting

Symptom Likely cause / fix
ValueError: Unknown engine 'xyz' Typo — run ingen document-processing extract --help
ModuleNotFoundError: fitz Forgot document-processing extra
CLI exits with extra not installed tip Install the suggested extra with uv pip install
Empty output on scanned PDF Engine has no OCR — use Azure DI

Happy extracting 📑