Synthetic Data Generator Enhancements¶
Home > Packages > Synthetic Data Generator Enhancements
Related: Synthetic Data Generation | Incremental Generation
This document outlines the comprehensive enhancements made to the synthetic data generator package to address the requirements for:
- Runtime parameter flexibility - Moving more code to
python_libsfor runtime configuration - Enhanced file naming and folder structures - Supporting date-based patterns and custom naming
- Comprehensive logging with date column tracking - Detailed correlation information
Overview of Changes¶
1. Enhanced Configuration System (python_libs/common/synthetic_data_config.py)¶
Key Features: - Runtime Parameter Support: Configuration can be modified at notebook execution time - Flexible Table Definitions: Tables can be configured with different types (snapshot/incremental) - Dynamic Row Calculation: Row counts calculated based on seasonal patterns, growth rates, and multipliers - Validation System: Built-in configuration validation with detailed error reporting
Classes: - TableGenerationConfig: Individual table configuration with runtime calculation methods - DatasetConfiguration: Complete dataset configuration with runtime override support - ConfigurationManager: Manages predefined configurations and template creation
Example Usage:
# Create configuration that can be modified at runtime
config = DatasetConfiguration.from_dict(base_config)
# Apply runtime overrides without recompiling
runtime_overrides = {
"table_configs": {
"customers": {"base_rows": 100000, "growth_rate": 0.003},
"orders": {"base_rows_per_day": 50000, "seasonal_enabled": True}
},
"incremental_config": {
"seasonal_multipliers": {"friday": 1.5, "saturday": 1.3}
}
}
config.apply_runtime_overrides(runtime_overrides)
2. Advanced File Path Generation (python_libs/common/file_path_utils.py)¶
Key Features: - Multiple Path Patterns: 9+ predefined patterns including nested, flat, and Hive-style - Date-Based Naming: Files can include dates in folders or filenames - Custom Pattern Support: Users can define custom path templates - Path Validation: Validates paths for common issues
Available Patterns: - nested_daily: /2024/01/15/table_name.parquet - flat_with_date: /20240115_table_name.parquet - hive_partitioned: /year=2024/month=01/day=15/table_name.parquet - quarter_based: /2024/Q1/table_name_20240115.parquet - And more...
Example Usage:
# Configure custom path pattern at runtime
runtime_overrides = {
"output_settings": {
"path_format": "hive_partitioned", # or "custom"
"custom_path_pattern": "{base_path}/year={yyyy}/month={mm}/{table_name}_{yyyymmdd}",
"file_extension": "parquet"
}
}
3. Enhanced Logging with Date Column Tracking (python_libs/common/synthetic_data_logger.py)¶
Key Features: - Date Column Analysis: Automatically identifies and analyzes date columns in generated data - Correlation Tracking: Links file names to actual date ranges in the data - Comprehensive Metrics: Tracks generation rates, durations, and data quality - Export Capabilities: Logs can be exported to JSON or CSV
Classes: - SyntheticDataLogger: Main logging interface with enhanced capabilities - TableGenerationMetrics: Comprehensive metrics for each table - DatasetGenerationSummary: Overall dataset generation summary - DateColumnInfo: Detailed information about date columns
Example Output:
π Generated orders:
π Rows: 100,000
π
Date columns: order_date (PRIMARY), shipped_date, delivered_date
ποΈ Data date range: 2024-01-15 to 2024-01-15
π File 20240115_orders.parquet contains 100,000 records with order_date = 2024-01-15
4. Enhanced Incremental Data Utils (python_libs/pyspark/incremental_synthetic_data_utils.py)¶
Key Features: - Configuration-Driven Generation: Uses enhanced configuration system - Advanced File Path Support: Integrates with new path generation system - Enhanced Logging Integration: Provides detailed metrics and correlation info - Backward Compatibility: Falls back to legacy methods when enhanced features unavailable
New Methods: - generate_dataset_from_config(): Main method using enhanced configuration - _generate_enhanced_file_path(): Uses advanced path patterns - _save_table_data_enhanced(): Enhanced saving with path validation
5. Configurable Data Generator (python_libs/pyspark/configurable_synthetic_data_generator.py)¶
Key Features: - High-Level Interface: Simplified API for complex data generation - Runtime Scaling: Apply scale factors and multipliers at runtime - Preview Mode: Preview generation plans without generating data - Template System: Create configurations from predefined templates
Classes: - ConfigurableSyntheticDataGenerator: Main generator with runtime configuration - ConfigurableDatasetBuilder: High-level dataset construction utilities
Example Usage:
# Initialize generator
generator = ConfigurableSyntheticDataGenerator(
lakehouse_utils_instance=lh_utils,
enhanced_logging=True
)
# Generate with runtime scaling
results = generator.generate_dataset_with_runtime_scaling(
base_config_id="retail_oltp_enhanced",
generation_date="2024-01-15",
scale_factor=1.5, # 50% more data
table_multipliers={"orders": 2.0} # Double orders specifically
)
# Preview before generating
plan = generator.preview_generation_plan(
dataset_config_id="retail_oltp_enhanced",
generation_date="2024-01-15"
)
6. Enhanced Notebook Template (templates/enhanced_synthetic_data_lakehouse_notebook.py.jinja)¶
Key Features: - Runtime Parameter Cells: Configurable parameters at the top of notebook - Generation Preview: Shows what will be generated before execution - Enhanced Validation: Comprehensive data quality checks - Detailed Reporting: Rich output with correlation information
Runtime Parameters:
# Runtime Configuration Overrides (Configurable at Runtime!)
runtime_overrides = {
"global_scale_factor": 1.0,
"table_configs": {
"customers": {"base_rows": 50000, "growth_enabled": True},
"orders": {"base_rows_per_day": 100000, "seasonal_enabled": True}
},
"incremental_config": {
"seasonal_multipliers": {
"friday": 1.3, "saturday": 1.2, "sunday": 0.7
}
},
"output_settings": {
"path_format": "nested_daily",
"custom_path_pattern": "{base_path}/year={yyyy}/month={mm}/{table_name}_{yyyymmdd}"
}
}
7. Enhanced Package Compiler (synthetic_data_generation.py)¶
Key Features: - Enhanced Compilation Methods: New methods for advanced configuration - Template Selection: Automatically chooses appropriate templates - Configuration Templates: Support for predefined configuration templates - Backward Compatibility: Falls back to legacy methods when needed
New Methods: - compile_enhanced_synthetic_data_notebook(): Uses enhanced features - compile_configurable_dataset_notebook(): Uses configuration templates - compile_all_enhanced_synthetic_data_notebooks(): Batch compilation with enhancements
Benefits Achieved¶
A) Runtime Parameter Flexibility β ¶
Before: Parameters hardcoded at compilation time
After: Parameters configurable at runtime
# Configurable at runtime
runtime_overrides = {
"table_configs": {
"orders": {"base_rows_per_day": 50000} # Can be changed without recompiling
},
"global_scale_factor": 1.5 # Apply 50% scaling to all tables
}
B) Enhanced Date-Based File Naming β ¶
Before: Limited to basic patterns - /YYYY/MM/DD/table_name.parquet - YYYYMMDD_table_name.parquet
After: Multiple flexible patterns - Hive partitioning: /year=2024/month=01/day=15/table_name.parquet - Quarter-based: /2024/Q1/table_name_20240115.parquet - Custom patterns: User-defined templates
C) Comprehensive Logging with Date Correlation β ¶
Before: Basic row count logging
After: Detailed correlation information
β
Generated orders:
π Rows: 100,000
π
Date columns: order_date (PRIMARY), shipped_date, delivered_date
ποΈ Data date range: 2024-01-15 to 2024-01-15
π File 20240115_orders.parquet contains 100,000 records with order_date = 2024-01-15
Usage Examples¶
1. Basic Enhanced Generation¶
from configurable_synthetic_data_generator import ConfigurableSyntheticDataGenerator
generator = ConfigurableSyntheticDataGenerator(lakehouse_utils_instance=lh_utils)
# Generate with runtime overrides
results = generator.generate_dataset_from_config(
dataset_config_id="retail_oltp_enhanced",
generation_date="2024-01-15",
runtime_overrides={
"table_configs": {"orders": {"base_rows_per_day": 50000}},
"output_settings": {"path_format": "hive_partitioned"}
}
)
2. Custom Dataset Creation¶
# Define custom tables at runtime
custom_tables = [
{
"table_name": "events",
"table_type": "incremental",
"base_rows_per_day": 10000,
"date_columns": ["event_timestamp"],
"primary_date_column": "event_timestamp"
}
]
results = generator.generate_custom_dataset(
dataset_id="custom_events",
table_definitions=custom_tables,
generation_date="2024-01-15"
)
3. Preview Generation Plan¶
# Preview what will be generated
plan = generator.preview_generation_plan(
dataset_config_id="retail_oltp_enhanced",
generation_date="2024-01-15",
runtime_overrides={"global_scale_factor": 2.0}
)
print(f"Will generate {plan['total_estimated_rows']:,} rows across {len(plan['tables'])} tables")
Migration Guide¶
For Existing Users¶
- No Breaking Changes: All existing notebooks continue to work
- Gradual Migration: Can adopt enhanced features incrementally
- Fallback Support: Enhanced features gracefully fall back to legacy methods
For New Users¶
- Use Enhanced Templates: Start with
enhanced_synthetic_data_lakehouse_notebook.py.jinja - Configure at Runtime: Modify parameters in the notebook rather than recompiling
- Leverage Advanced Features: Use custom path patterns and enhanced logging
Technical Architecture¶
Enhanced Synthetic Data Generator Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Notebook Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Enhanced Template with Runtime Parameters β
β β’ Configurable parameters at notebook execution β
β β’ Preview generation plans β
β β’ Enhanced validation and reporting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Configuration Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ConfigurableSyntheticDataGenerator β
β β’ Runtime parameter application β
β β’ Template-based configuration β
β β’ High-level dataset building β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββΌββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Configuration β β File Path β β Enhanced β
β Management β β Generation β β Logging β
βββββββββββββββββββ€ ββββββββββββββββ€ ββββββββββββββββ€
β β’ Runtime β β β’ 9+ patternsβ β β’ Date columnβ
β overrides β β β’ Custom β β analysis β
β β’ Validation β β templates β β β’ Correlationβ
β β’ Templates β β β’ Path β β tracking β
βββββββββββββββββββ β validation β β β’ Metrics β
ββββββββββββββββ ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Generation Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Enhanced Incremental Generator + Base Generators β
β β’ Configuration-driven generation β
β β’ Advanced file path support β
β β’ Comprehensive logging integration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Conclusion¶
These enhancements successfully address all three requirements:
- β Runtime Flexibility: Parameters can now be modified at notebook execution time without recompilation
- β Advanced File Naming: Multiple date-based patterns support various organizational needs
- β Enhanced Logging: Comprehensive date column tracking provides clear correlation between files and data
The implementation maintains backward compatibility while providing powerful new capabilities for users who need more flexibility and control over their synthetic data generation workflows.