- Newest
- Most votes
- Most comments
Hi, Thanks for writing to re:Post.
I Understand that you want help in running benchmarks for EMR Serverless using TPC-DS. The below listed steps should assist you in running the benchmark!
Below are the step to generate 300Gb data and run TPC-DS test. Source code - https://github.com/aws-samples/emr-on-eks-benchmark
-
- Pull public spark benchmark docker image and SSH into docker image
docker pull ghcr.io/aws-samples/eks-spark-benchmark:3.1.2 docker run -it --user root ghcr.io/aws-samples/eks-spark-benchmark:3.1.2 bash
-
- Install AWS CLI version 2 within docker image
Run following commands:
apt update apt install curl unzip -y curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip " -o "awscliv2.zip " unzip awscliv2.zip ./aws/install
Run 'aws --version' to verify AWS CLI is installed successfully.
-
- Configure AWS CLI
Run 'aws configure'
=> The following example shows sample values. Replace them with your own values as described in the following sections.
$ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: json
-
- Prepare tools.tar.gz for dependency
cd /opt/tpcds-kit/ chown -R hadoop:hadoop tools tar -czvf tool.tar.gz tools
-
- Copy following files to S3 location using 'aws s3 cp' command
aws s3 cp /opt/tpcds-kit/tool.tar.gz s3://<your-bucker>/ aws s3 cp /opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar s3://<your-bucker>/
Uploaded jar and tool kit tar in s3 would be required to call when starting benchmark jobs in next steps.
-
- Generate TPCDS benchmark data and put it S3.
aws emr-serverless start-job-run
--application-id "<application id>"
--execution-role-arn "<ARN of emr-serverless-job-role>"
--region <region>
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<your-bucket>/eks-spark-benchmark-assembly-1.0.jar",
"entryPointArguments": [
"s3://<your-bucket>/input-data/",
"/home/hadoop/environment/tools", "parquet","300","200","true", "true", "true"],
"sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.DataGeneration --archives s3://<your-bucket>/tools.tar.gz#environment --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=10g --conf spark.executor.instances=27"
}
}'
--configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": "<s3://<your-bucket>/logs/>"
}
}
}'
-
- Run TPCDS Benchmark test with the data generated in S3
aws emr-serverless start-job-run
--application-id "<application id>"
--execution-role-arn "ARN of emr-serverless-job-role"
--region <region>
--job-driver '{
"sparkSubmit": {
"entryPoint": "<s3://<your-bucket>/eks-spark-benchmark-assembly-1.0.jar>",
"entryPointArguments": [
"s3://<your-bucket>/input-data/",
"s3://<your-bucket>/output/",
"/home/hadoop/environment/tools", "parquet","300","3","false","q1-v2.4","true"],
"sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --archives s3://<your-bucket>/tools.tar.gz#environment --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=10g --conf spark.executor.instances=47"}}'
--configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": "<s3://<your-bucket>/logs/>"
}
}
}'
In above steps please update S3 location, region as per your environment. Please ensure EMR Serverless job execution role have required permissions on the S3 buckets for reading dependency files and writing data.
I hope you find this information helpful.
Thank you and have a good rest of your day!
Relevant content
- asked 12 days ago
- asked a year ago
- AWS OFFICIALUpdated 24 days ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 months ago