Incremental Synthetic Data Generation¶
Home > Packages > Incremental Synthetic Data Generation
Related: Synthetic Data Generation | Import Integration | Enhancements
This document describes the incremental synthetic data generation functionality for creating time-series test data.
Overview¶
The incremental generation system creates synthetic data for specific date ranges, with each day's data generated independently. This is useful for: - Testing time-based analytics and reporting - Creating realistic historical data patterns - Simulating daily operational data
How It Works¶
Batch Size¶
The batch_size parameter controls how many days are grouped together for processing: - Purpose: Memory management and progress tracking - Example: batch_size=10 with 30 days creates 3 batches of 10 days each - Note: Data is still generated day-by-day within each batch
Configuration Parameters¶
Working Parameters¶
| Parameter | Description | Example Values |
|---|---|---|
dataset_id | Dataset configuration to use | "retail_oltp_small" |
start_date | Start date for generation | "2024-01-01" |
end_date | End date for generation | "2024-01-30" |
batch_size | Days per processing batch | 1, 7, 30 |
path_format | File organization structure | "flat", "nested" |
output_mode | Output format | "csv", "parquet", "table" |
seed_value | Random seed for reproducibility | 12345 |
enable_seasonal_patterns | Apply day-of-week variations | true, false |
Currently Non-Functional Parameters¶
These parameters exist in the configuration but are not yet implemented: - parallel_workers - No parallel processing implemented - chunk_size - Not used in generation - enable_data_drift / drift_percentage - Data drift not implemented - ignore_state - State is not persisted between runs anyway
Date Filtering¶
Important: The system generates new data for each date with the correct date values embedded. It does NOT filter pre-existing data. Each generated record will have its date columns set to the generation date.
File Organization¶
Nested Format¶
synthetic_data/parquet/series/retail_oltp_small/job_name/nested/
└── orders/
└── 2024/
└── 01/
├── 01/data.parquet # January 1st
├── 02/data.parquet # January 2nd
└── 03/data.parquet # January 3rd
Flat Format¶
synthetic_data/csv/series/retail_oltp_small/job_name/flat/
└── orders/
├── orders_20240101.csv
├── orders_20240102.csv
└── orders_20240103.csv
Table Types¶
Incremental Tables¶
- Generate new records for each day
- Examples: orders, order_items, transactions
- Date columns are set to the generation date
Snapshot Tables¶
- Generate full table based on frequency
- Frequencies:
daily,weekly,monthly - Examples: customers, products (reference data)
Example Configuration¶
job_configs = {
"job_name": "test_feb_2024",
"dataset_id": "retail_oltp_small",
"start_date": "2024-02-01",
"end_date": "2024-02-07",
"batch_size": 1, # Process day by day
"path_format": "flat",
"output_mode": "csv",
"seed_value": 99999,
"enable_seasonal_patterns": True
}
Usage¶
The incremental generation is currently available through: 1. Notebook Templates: Use the generated notebooks in Fabric 2. Python Scripts: Run the notebook Python files locally
Note: CLI commands for synthetic data generation are not yet implemented.
Seasonal Patterns¶
When enable_seasonal_patterns is enabled, the system applies day-of-week multipliers to vary the data volume:
seasonal_multipliers = {
"monday": 0.8,
"tuesday": 0.9,
"wednesday": 1.0,
"thursday": 1.1,
"friday": 1.3,
"saturday": 1.2,
"sunday": 0.7
}
Date Column Mapping¶
Each table configuration includes date column definitions: - date_columns: List of all date fields in the table - primary_date_column: The main date field that gets set to the generation date
Example for orders table: - date_columns: ["order_date", "shipped_date", "delivered_date"] - primary_date_column: "order_date"
Current Limitations¶
- No State Persistence: Each run is independent; no state carried between runs
- No Growth/Churn Simulation: Table sizes remain constant
- No Holiday Detection: Holiday multipliers not implemented
- Sequential Processing: No parallel processing of batches
- No CLI Commands: Must use notebooks or Python scripts directly
Best Practices¶
- Small Batch Sizes (1-7 days): Better for testing and debugging
- Larger Batch Sizes (30+ days): More efficient for generating long date ranges
- Use Seeds: Set
seed_valuefor reproducible test data - Verify Dates: Always check that generated data has correct date values
Future Enhancements¶
Planned improvements include: - CLI command support - State persistence for consistent IDs across runs - Growth and churn rate simulation - Holiday detection and multipliers - Parallel batch processing - Data drift simulation