Synthetic Data Generation Package¶
Home > Packages > Synthetic Data Generation
The Synthetic Data Generation package creates realistic test datasets at any scale. It supports multiple data patterns and both single-dataset and incremental time-series generation.
At a Glance¶
| Item | Summary |
|---|---|
| Purpose | Generate realistic synthetic datasets for testing and development |
| CLI Commands | generate, list, compile |
| Target Environments | Lakehouse, Warehouse |
| Scale Range | 1K to 1B+ rows |
| Data Domains | Retail, Finance, Healthcare, E-commerce |
| Schema Types | OLTP (transactional), OLAP (star schema) |
| Generation Modes | Single dataset, Time-series incremental |
| Key Features | Realistic relationships, configurable patterns, reproducible seeds |
Overview¶
The package provides: - Multi-Scale Generation: From thousands to billions of rows - Domain-Specific Datasets: Pre-configured datasets for retail, finance, healthcare, and e-commerce - Flexible Schema Support: OLTP (transactional) and OLAP (star schema) patterns - Single and Series Generation: One-time datasets or time-based incremental data - Runtime Configuration: Modify parameters through notebook configuration
Current CLI Commands¶
The package currently supports three main commands:
1. generate - Create Ready-to-Run Notebooks¶
ingen_fab package synthetic-data generate <config> [OPTIONS]
Options:
--mode Generation mode: 'single' or 'series' [default: single]
--parameters JSON string with generation parameters
--output-path Custom output path for generated files
--dry-run Preview without generating files
--target-environment Target: 'lakehouse' or 'warehouse' [default: lakehouse]
--no-execute Don't execute the notebook after generation
2. list - Show Available Configurations¶
ingen_fab package synthetic-data list [OPTIONS]
Options:
--type, -t What to list: 'datasets', 'templates', or 'all' [default: all]
--format, -f Output format: 'table' or 'json' [default: table]
3. compile - Create Template Notebooks¶
ingen_fab package synthetic-data compile [TEMPLATE] [OPTIONS]
Options:
--runtime-config JSON configuration for compilation
--output-format Output: 'notebook', 'ddl', or 'all' [default: all]
--target-environment Target: 'lakehouse' or 'warehouse' [default: lakehouse]
Available Datasets¶
Retail Domain¶
retail_oltp_small/retail_oltp_large- Transactional systemretail_star_small/retail_star_large- Analytics star schema
Financial Domain¶
finance_oltp_small/finance_oltp_large- Banking systemfinance_star_small/finance_star_large- Financial analytics
E-commerce Domain¶
ecommerce_star_small/ecommerce_star_large- E-commerce analytics
Healthcare Domain¶
healthcare_oltp_small- Healthcare system (anonymized)
Single Dataset Generation¶
Basic Usage¶
Generate a single dataset with default parameters:
# Generate small retail dataset
ingen_fab package synthetic-data generate retail_oltp_small
# Generate with custom parameters
ingen_fab package synthetic-data generate retail_oltp_small \
--parameters '{"target_rows": 100000, "seed_value": 42}'
Configuration Parameters¶
Working Parameters for Single Dataset¶
| Parameter | Description | Default | Example |
|---|---|---|---|
target_rows | Number of rows to generate | 10000 | 100000 |
seed_value | Random seed for reproducibility | None | 42 |
output_mode | Output format | "table" | "parquet", "csv" |
generation_mode | Generation engine | "auto" | "pyspark" |
scale_factor | Multiplier for all table sizes | 1.0 | 2.5 |
Parameters That Exist But Don't Work¶
These parameters may appear in configurations but have no effect: - chunk_size - Not implemented in generation logic - parallel_tables - Tables are always generated sequentially - memory_fraction - Memory management not configurable - enable_spill - Disk spilling not implemented - compression - Compression settings ignored - parallel_workers - No parallel processing
Generation Modes¶
auto: Automatically selects PySpark (always uses PySpark currently)pyspark: Uses distributed Spark for generationpython: Not actually available (falls back to PySpark)
Series/Incremental Generation¶
For time-based data generation, see Incremental Synthetic Data Generation.
Related Documentation¶
- Synthetic Data Generation Enhancements - Latest features and improvements
- Incremental Synthetic Data Generation - Time-series data generation
- Extract Generation - Synthetic Data Alignment - Integration with extract generation
- Incremental Synthetic Data Import Integration - Integration with flat file ingestion
Overview¶
Key points: - Generates data day-by-day for date ranges - batch_size groups days for memory management - Each day's data has correct date values embedded
Generic Templates¶
The package includes generic notebook templates that can be parameterized at runtime:
Available Templates¶
generic_single_dataset_lakehouse- Single dataset for lakehousegeneric_single_dataset_warehouse- Single dataset for warehousegeneric_incremental_series_lakehouse- Time series for lakehousegeneric_incremental_series_warehouse- Time series for warehouse
Using Generic Templates¶
# Compile all generic templates
ingen_fab package synthetic-data compile
# Compile specific template
ingen_fab package synthetic-data compile generic_single_dataset_lakehouse
# Generate using generic template
ingen_fab package synthetic-data generate generic_single_dataset_lakehouse \
--parameters '{"dataset_id": "retail_oltp_small", "target_rows": 50000}'
File Organization¶
Table Output Mode¶
Creates Delta/Parquet tables in the lakehouse:
Parquet/CSV Output Mode¶
Creates files in the Files section:
Files/synthetic_data/
└── retail_oltp_small/
├── customers.parquet
├── products.parquet
├── orders.parquet
└── order_items.parquet
DDL Scripts Created¶
The package creates configuration and logging tables:
Lakehouse: - config_synthetic_data_datasets - Dataset configurations - log_synthetic_data_generation - Generation history
Warehouse: - config.config_synthetic_data_datasets - config.log_synthetic_data_generation
Data Characteristics¶
Table Relationships¶
The generated data maintains referential integrity: - Orders reference valid customer IDs - Order items reference valid order and product IDs - Foreign key relationships are preserved
Data Distributions¶
- Customer ages: Normal distribution (18-95 years)
- Order amounts: Log-normal distribution
- Dates: Evenly distributed or seasonal patterns
- Categories: Realistic category distributions
Limitations¶
- No NULL handling: All fields are populated (no realistic NULL patterns)
- No data evolution: Customer/product attributes don't change over time
- Simple distributions: No complex multi-modal distributions
- No data quality issues: No duplicates, conflicts, or anomalies unless explicitly coded
Performance Considerations¶
Memory Usage¶
- Approximate memory needed: ~1GB per 10M rows
- Larger datasets may require cluster scaling
- No automatic memory management or spilling
Generation Speed¶
- Small datasets (<100K rows): 1-5 seconds
- Medium datasets (1M rows): 10-30 seconds
- Large datasets (100M rows): 5-15 minutes
- Very large (1B+ rows): 1-2 hours
Best Practices¶
- Start with small datasets for testing
- Use consistent seeds for reproducibility
- Monitor cluster resources for large generations
- Use parquet format for better compression
Common Issues and Solutions¶
Issue: Out of Memory¶
Solution: - Reduce target_rows - Scale up cluster - Generate tables individually
Issue: Slow Generation¶
Solution: - Ensure sufficient cluster cores - Use generation_mode: "pyspark" - Reduce data complexity
Issue: Cannot Find Dataset¶
Solution: - Run ingen_fab package synthetic-data list to see available datasets - Check spelling of dataset_id - Ensure dataset is in configuration
Examples¶
Example 1: Quick Test Dataset¶
# Generate small dataset for unit testing
ingen_fab package synthetic-data generate retail_oltp_small \
--parameters '{"target_rows": 1000, "seed_value": 42}'
Example 2: Large Analytics Dataset¶
# Generate large star schema
ingen_fab package synthetic-data generate retail_star_large \
--parameters '{"target_rows": 10000000, "output_mode": "parquet"}'
Example 3: Reproducible Dataset¶
# Always generates identical data
ingen_fab package synthetic-data generate finance_oltp_small \
--parameters '{"seed_value": 12345, "target_rows": 50000}'
Features Not Yet Implemented¶
The following features are mentioned in configurations but not functional:
CLI Commands¶
generate-incremental- Usegeneratewith mode='series' insteadgenerate-series- Usegeneratewith mode='series' insteadexecute-with-parameters- Not implemented--resumeflag - No resume capability
Configuration Options¶
- State management between runs
- Growth and churn rates
- Holiday detection and multipliers
- Data drift simulation
- Parallel table generation
- Memory management settings
- Custom schema definitions
- Data quality patterns (nulls, duplicates)
- Temporal patterns beyond day-of-week
Performance Features¶
- Chunking for very large datasets
- Disk spilling for memory management
- Parallel processing
- Delta Lake optimizations (Z-ordering, compaction)
- Adaptive query execution
Next Steps¶
- For time-series data, see Incremental Generation
- Review available datasets with
ingen_fab package synthetic-data list - Start with small datasets to understand the data structure
- Scale up gradually for performance testing