Generate production-grade synthetic data at petabyte-scale using Apache Spark and Faker on Amazon EMR
Performance testing for big data analytics tools and engines at petabyte scale is an increasingly challenging avenue. Using traditional sample test datasets may not reflect the actual production-grade datasets to determine a real-world performance for processing and query engines. We discuss a solution which allows users to generate synthetic data in Apache Iceberg table format on Amazon S3, which reflects the actual production dataset.
Performance testing of big data applications requires representative datasets that mirror production workloads. We are seeing increasing number of use cases where the users want to evaluate query engines and data processing solutions against a dataset which is as close to production application data as possible. However, not all industry verticals have the flexibility to use an actual production data to test the performance at the scale of 10s and 100s of petabytes. For industries like cyber security, financial services industry, healthcare and life sciences this becomes even more important to use synthetic data.
This article discusses the process of generating a production-grade synthetic data at terabyte and petabyte scales which was executed as a part of a customer proof-of-concept (PoC). The solution leverages Apache Spark on Amazon EMR with Faker, a Python library designed for synthetic data generation. This approach scales to create high-quality synthetic test data for various use cases.
The challenge of synthetic data generation
TPC-DS and similar benchmark datasets serve as valuable tools for database and SQL engine performance evaluation. These benchmarks offer standardized schemas and predetermined data volumes that enable consistent testing environments across different systems. They also provide well-defined query patterns and establish uniform performance metrics that facilitate straightforward comparisons between database solutions.
However, standard benchmarks often fall short in several key areas. They typically lack industry-specific data patterns, miss complex real-world relationships, and oversimplify data distributions. Their rigid schemas also fail to match schema representations that organizations frequently implement to meet their specific business requirements. These limitations significantly reduce their ability to accurately predict real-world performance across diverse business verticals.
Requirements for production-grade synthetic data
To effectively validate workload performance, synthetic data must meet several critical requirements. First, it should accurately mirror production data distributions to ensure realistic testing scenarios and reliable performance metrics. The data must maintain referential integrity across all related tables and entities, preserving the complex relationships found in actual production environments. Additionally, the synthetic data generation process should be capable of horizontal scaling to accommodate growing data volumes and performance demands. Furthermore, the data generation process should deliver consistent results across multiple runs, enabling reproducible testing and validation cycles. Finally, the system should provide deterministic output, ensuring that given the same input parameters, it will generate identical datasets, which is crucial for debugging and comparative analysis. These requirements collectively ensure that synthetic data can effectively simulate real-world scenarios and provide meaningful insights for performance testing and validation.
Solution overview
The solution implements a robust synthetic data generation framework built on Apache Spark running on Amazon EMR, integrated with the Faker Python library to create synthetic data reflecting real production dataset at scale. The solution enables users to define multiple columns for the table, with the needed cardinality and range values by using functions to generate the value for each column. This allows for customization where needed based on use cases, range, relationships, and the data characteristics. This ensures the synthetic data maintains the same statistical properties as production data. For this customer’s PoC evaluation, we created a table with close to 100 columns. We also allow the ability to define schema which establishes the structural framework for data generation. At the core of the architecture lies the data generation logic, which leverages Apache Spark's distributed processing capabilities and the Faker library's extensive synthetic data generation functions. This engine efficiently creates realistic data at the scale of 100s of terabytes and even petabytes while maintaining performance and scalability. Finally, the framework writes the generated data to Amazon S3 in an Apache Iceberg table format.
All the above allows us to ensure consistent and high-quality synthetic data is generated that closely mirrors production data.
Implementation overview
We plan to publish the solution code along with the customer PoC execution story in future, meanwhile the following are some of the important pieces of code which we have implemented:
import random
import datetime, calendar
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pandas as pd
from pyspark.sql.types import *
from faker import Faker
import argparse
.
.
args = parser.parse_args()
NUM_MESSAGES_PER_DAY = args.messages_per_day
LOOPS = args.loop
WAREHOUSE = args.warehouse_path
CATALOG = args.catalog_name
DATABASE = args.database_name
TABLE = args.table_name
spark_builder = (
SparkSession.builder.config("spark.sql.codegen.wholeStage", "false")
.appName("data-gen")
.config("spark.sql.execution.arrow.pyspark.enabled", "true")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.dynamicAllocation.shuffleTracking.enabled", "true")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config(f"spark.sql.catalog.{CATALOG}", "org.apache.iceberg.spark.SparkCatalog")
.config(f"spark.sql.catalog.{CATALOG}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
.config(f"spark.sql.catalog.{CATALOG}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config(f"spark.sql.catalog.{CATALOG}.warehouse", WAREHOUSE)
.config(f"spark.sql.catalog.{CATALOG}.lock.table", "myIcebergLockTable")
)
spark = spark_builder.getOrCreate()
fake = Faker()
def optimized_random_choice(pdf, values):
return pd.Series(random.choices(values, k=len(pdf)))
org_values = ["sales", "finance", "it", "hr", "marketing", "customer_service", "n/a"]
get_random_org = F.pandas_udf(lambda x: optimized_random_choice(x, org_values), StringType())
get_random_ip = F.pandas_udf(lambda x: pd.Series([fake.ipv4() for _ in range(len(x))]), StringType())
get_random_public_ip = F.pandas_udf(lambda x: pd.Series([fake.ipv4_public() for _ in range(len(x))]), StringType())
get_random_private_ip = F.pandas_udf(lambda x: pd.Series([fake.ipv4_private() for _ in range(len(x))]), StringType())
.
.
def data_generator(row_count, date):
df = spark.range(0, row_count)\
.withColumn("date", F.unix_timestamp(F.lit(date),'yyyy-MM-dd').cast("timestamp"))\
.withColumn("org_name", get_random_org(F.col("id")))\
.withColumn("device_ip", get_random_ip(F.col("id")))\
.withColumn("destination_ip", get_random_public_ip(F.col("id")))\
.drop("id")
return df
.
.
df = data_generator(daily_row_count, date_to_generate)
df.writeTo(f"{CATALOG}.{DATABASE}.{TABLE}").partitionedBy("date”).append()
Benefits over traditional benchmark datasets
This solution offers significant advantages over conventional benchmark datasets, providing a robust and versatile approach to data generation and management.
-
Customizable Data Patterns: The solution enables control over data generation patterns, allowing organizations to create datasets that accurately reflect their specific industry needs. Users can define and adjust data distributions to match real-world scenarios, ensuring the generated data closely mirrors actual business conditions. Furthermore, the system supports the implementation of complex business rules and constraints, making the synthetic data more relevant and applicable to specific use cases.
-
Scalability: The architecture uses Apache Spark and Amazon EMR’s distributed computing capabilities, enabling horizontal scaling across multiple nodes for enhanced performance. The solution's efficient memory management ensures optimal resource utilization, even when generating large volumes of data. By supporting parallel data generation processes, the system can handle increasing workloads while maintaining consistent performance levels.
-
Data Quality: Data integrity is maintained throughout the generation process, ensuring all relationships between different data elements remain consistent and valid. The solution provides control over data distributions, allowing users to maintain realistic patterns while avoiding common data quality issues.
-
Flexibility: The solution offers remarkable adaptability in terms of data schema modifications, allowing users to easily adjust to changing business requirements. It supports multiple data output formats, partitioning, and bucketing, making it compatible with various downstream systems and applications. Users can fine-tune generation parameters to meet specific needs, providing granular control over the entire data generation process. The solution delivers reliable, scalable, and customizable synthetic data generation, ideal for scale testing, development, and analytics benchmarking.
Conclusion
While traditional benchmark datasets serve their purpose, generating production-grade synthetic data requires a more sophisticated approach. The solution described here, using Apache Spark on Amazon EMR and Faker, provides a scalable and flexible framework for creating realistic test data that truly represents production workloads.
The combination of distributed processing capabilities from Spark and the versatile data generation features of Faker enables organizations to create high-quality synthetic data that maintains the characteristics and relationships of their production data while supporting the volume requirements of performance testing.
Great article.
Relevant content
- asked 3 years agolg...
- asked 8 months agolg...
- asked 6 years agolg...
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 months ago