Data Preparation Guide
Dataprep Crawler (Scrapfly)
This optional pack lets you scrape web pages straight from the Insight Ingenious CLI.
Installation (extras)
uv pip install -e ".[dataprep,tests]" # crawler + its test suite
Requires a Scrapfly key (SCRAPFLY_API_KEY
). Add it to your environment or a .env
file.
3. Quick‑Start (CLI)
3.1. Single URL
# Pretty‑print a JSON envelope to stdout
SCRAPFLY_API_KEY=sk_live_… \
ingen dataprep crawl https://example.com
3.2. Batch Mode
# Scrape three URLs, write newline‑delimited JSON to a file
ingen dataprep batch \
https://a.com https://b.com https://c.com \
--out pages.ndjson --raw
3.3. Flag Reference
Flag | Forwarded kwarg | Description |
---|---|---|
--api-key |
api_key |
Scrapfly key for this run (else env var) |
--max-attempts |
max_attempts |
Total tries per URL (default 5) |
--retry-code |
retry_on_status_code |
Repeatable – add HTTP codes to retry list |
--delay |
delay |
Initial back‑off in seconds (doubles each retry) |
--js / --no-js |
extra_scrapfly_cfg.render_js |
Toggle headless browser rendering |
--extra-scrapfly-cfg |
extra_scrapfly_cfg |
Raw JSON merged over default SDK config |
Run ingen dataprep crawl --help
to see the full Typer‑generated help screen.
3.4. Fresh‑Clone Walkthrough
Goal: go from a fresh clone to running unit tests, e2e tests, and the new CLI in one continuous shell session.
Prerequisites: uv is installed • You are in the repo root (
ingenious/
).
# 1️⃣ Build an isolated virtual‑env and install extras
uv venv # creates .venv/ and writes .python-version
source .venv/bin/activate
uv pip install --python .venv/bin/python -e ".[dataprep,tests]"
# 2️⃣ Supply your Scrapfly key (required for live tests / CLI)
export SCRAPFLY_API_KEY="sk_live_your_real_key_here"
# – or – add the same line to a .env at repo root
# 3️⃣ Run all tests for data prep
uv run pytest ingenious/dataprep/tests
# 4️⃣ Smoke‑test the new CLI commands
## 4.a Single‑page scrape (pretty JSON)
ingen dataprep crawl \
"https://www.medicalnewstoday.com/articles/tyrer-cuzick-score#summary"
## 4.c Batch scrape two URLs (NDJSON → file)
ingen dataprep batch \
"https://www.volparahealth.com/news/how-breast-density-impacts-lifetime-cancer-risk" \
"https://www.medicalnewstoday.com/articles/tyrer-cuzick-score#summary" \
--out pages.ndjson
These commands exercise all public surfaces added by the Dataprep pack: environment creation, tests, and both CLI commands.
Need details? See the flag reference above or call ingen dataprep crawl --help
. Happy scraping! 🚀