Skip to content

Extract Generation - Synthetic Data Alignment

Home > Packages > Extract Generation - Synthetic Data Alignment

Related: Extract Generation | Synthetic Data Generation

This document describes how the Extract Generation package has been aligned to work with the Ingenious Fabric synthetic data tables.

Overview

The Extract Generation package sample configurations have been updated to work seamlessly with the synthetic data tables generated by the synthetic_data_generation package. This alignment ensures that extract samples work out-of-the-box without requiring modifications to the synthetic data generation process.

Alignment Approach

1. Compatibility Views

Created database views that map synthetic data tables to the schema expected by the original extract samples:

  • dbo.v_customers_mapped - Maps the synthetic customers table to the expected format
  • Combines first_name and last_name into customer_name
  • Maps address_line1 to address
  • Maps registration_date to created_date

  • dbo.v_transactions - Creates a transaction-like view from orders and order_items

  • Uses order_item_id as transaction_id
  • Joins orders and order_items to simulate transaction data
  • Maintains compatibility with transaction-based extracts

  • reporting.v_sales_summary - Sales analytics view based on orders data

  • Aggregates order data by year, month, and country
  • Uses order status to filter completed transactions

  • reporting.v_product_sales_summary - Product performance analytics

  • Aggregates sales by product category and brand
  • Provides units sold and revenue metrics

2. New Sample Configurations

Added extract configurations specifically designed for synthetic data tables:

Extract Name Source Description
SAMPLE_CUSTOMERS_DAILY customers table Full customer data export
SAMPLE_PRODUCTS_CATALOG products table Product catalog in Parquet format
SAMPLE_ORDERS_EXPORT orders table Daily orders with GZIP compression
SAMPLE_ORDER_ITEMS_EXPORT order_items table Order items with file splitting
SAMPLE_CUSTOMER_SEGMENTS sp_generate_customer_segment_report Customer analytics report
SAMPLE_PRODUCT_SALES v_product_sales_summary Product sales analysis

3. Stored Procedures

Created stored procedures that work with synthetic data:

  • finance.sp_generate_financial_report - Quarterly financial summary
  • dbo.sp_validate_orders_extract - Order data validation
  • reporting.sp_generate_customer_segment_report - Customer segmentation analysis

File Structure

extract_generation/
├── ddl_scripts/
│   └── warehouse/
│       ├── 009_sample_data_insert.sql              # Original samples (legacy)
│       ├── 009_sample_data_insert_aligned.sql     # Synthetic data aligned samples
│       └── 010_synthetic_data_alignment_views.sql # Compatibility views

Usage

1. Deploy the Alignment Views

First, execute the alignment views to create the compatibility layer:

-- Run in your Fabric warehouse
:r 010_synthetic_data_alignment_views.sql

2. Load Sample Configurations

Use the aligned sample configurations:

-- Load samples that work with synthetic data
:r 009_sample_data_insert_aligned.sql

3. Run Extracts

The extracts will now work seamlessly with synthetic data:

# Example: Extract customers data
extract_name = "SAMPLE_CUSTOMERS_DAILY"
# Will extract from the actual customers table with all synthetic data columns

Benefits

  1. No Changes to Synthetic Data - Preserves the synthetic data generation as-is
  2. Backward Compatibility - Original extract samples still available
  3. Flexibility - Can extract either raw synthetic data or mapped views
  4. Real-World Testing - Extract configurations work with realistic data volumes

Example Extracts

Customer Data Export

-- Uses actual customers table
SELECT * FROM [dbo].[customers]
-- Output: All synthetic data columns including first_name, last_name, etc.

Legacy Format Export

-- Uses compatibility view
SELECT * FROM [dbo].[v_customers_mapped]
-- Output: customer_name (combined), address (mapped), etc.

Sales Analytics

-- Uses orders and customers data
SELECT * FROM [reporting].[v_sales_summary]
-- Output: Aggregated sales metrics by period and country

Best Practices

  1. Use Raw Tables - When possible, configure extracts to use raw synthetic data tables
  2. Use Views for Compatibility - Only use compatibility views when required by downstream systems
  3. Test with Real Volumes - Synthetic data provides realistic data volumes for performance testing
  4. Validate Extracts - Use the validation procedures to ensure data quality

Migration Guide

If you have existing extract configurations expecting the old schema:

  1. Update to use compatibility views (e.g., v_customers_mapped instead of customers)
  2. Or update your downstream processes to handle the new schema
  3. Consider using the output_format and column mapping features for transformation

Future Enhancements

  • Additional compatibility views for other synthetic data scenarios
  • Column-level mapping configuration in extract details
  • Automatic schema detection and mapping
  • Integration with data quality validation framework