Document Processing Guide

Document-processing Optional Dependency

The document-processing extra equips Insight Ingenious with a unified API and CLI for turning born-digital documents into structured text blocks.cument‑processing Optional Dependency

The document‑processing extra equips Insight Ingenious with a unified API and CLI for tu—

Happy extracting!ng born‑digital documents into structured text blocks. It is ideal for:

RAG pipelines that need paragraph‑level text with coordinates
Extraction flows that must gather data from mixed PDF + DOCX + PPTX collections
Quickly inspecting files from the command line without writing Python

1 Installation

# minimal feature set (PyMuPDF + Azure Document Intelligence wrapper + CLI)
uv pip install -e ".[document-processing]"

# include pure‑Python PDFMiner and rich‑text Unstructured engines
uv pip install -e ".[document-processing,pdfminer,unstructured]"

Why separate extras?

PyMuPDF is the fastest path for standard PDFs.

PDFMiner avoids native code – useful on Alpine or AWS Lambda.

Unstructured adds DOCX, PPTX.

2 Command‑line quick‑start

# 1️⃣  Stream a remote PDF through pdfminer engine
ingen document-processing https://example.com/contract.pdf --engine pdfminer --out pages_pdfminer.jsonl

3 Python API in three lines

from pathlib import Path
from ingenious.document_processing import extract

elements = list(extract(Path("report.pdf")))  # defaults to PyMuPDF
print(elements[0]["text"])

Choosing a specific engine

for block in extract("paper.pdf", engine="pdfminer"):
    ...

Valid engine values (all case‑sensitive):

Engine key	Dependency (extra)	Best for
`pymupdf`	`document-processing`	Fast positional PDF extraction
`pdfminer`	`pdfminer`	Pure‑Python / Alpine / Lambda builds
`unstructured`	`unstructured`	DOCX, PPTX, HTML, TXT, unusual PDFs
`azdocint`	`document-processing`	Cloud‑based Azure AI Document Intelligence

4 Azure AI Document Intelligence engine

Cloud extraction unlocks semantic paragraphs and table metadata. Set two environment variables before running or importing:

export AZURE_DOC_INTEL_ENDPOINT="https://<resource>.cognitiveservices.azure.com"
export AZURE_DOC_INTEL_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

The service is pay‑as‑you‑go; the tiny unit tests stay within the free tier.

5 Streaming vs list

Every extractor yields lazy generators. Consume them as you like:

# memory‑efficient streaming
for element in extract("big.pdf"):
    process(element)

# materialise everything (convenient, but RAM‑heavy)
all_blocks = list(extract("big.pdf"))

6 Running the test‑suite

Azure Document Intelligence integration assets

The Azure DI tests look for four tiny sample files –

sample.pdf
sample.jpg
sample.png
sample.tiff

– inside ingenious/document_processing/tests/data_azure_doc_intell/. Create that folder and drop the files in before running pytest -m integration. If the folder or any file is missing, the tests are auto-skipped, so the rest of the suite still runs cleanly.

# Core engines only
uv pip install -e ".[document-processing,tests]"

Add pdfminer and unstructured extras to expand coverage:

uv pip install -e ".[document-processing,pdfminer,unstructured,tests]"

# Run all tests
uv run pytest ingenious/document_processing/tests

7 Troubleshooting

Symptom	Likely cause / fix
`ValueError: Unknown engine 'xyz'`	Typo — run `ingen document-processing extract --help`
`ModuleNotFoundError: fitz`	Forgot `document-processing` extra
CLI exits with extra not installed tip	Install the suggested extra with `uv pip install`
Empty output on scanned PDF	Engine has no OCR — use Azure DI

Happy extracting 📑