emr serverless custom image

0

Hi, We are comparing the emr/eks and emr serverless offer. We are trying to achieve the tpc-ds perf benchmark through (https://github.com/aws-samples/emr-on-eks-benchmark). We are not able to do it on emr/serverless as some pre requiste are on os part like /opt/tpcds-kit/tools (https://github.com/aws-samples/emr-on-eks-benchmark/blob/main/docker/benchmark-util/Dockerfile) .

Is there any way to set os/binaries pre requisite on serverless workers to achieve this use case ?

BR,

Abdel

asked 2 years ago812 views
1 Answer
1

Hi, Thanks for writing to re:Post.

I Understand that you want help in running benchmarks for EMR Serverless using TPC-DS. The below listed steps should assist you in running the benchmark!

Below are the step to generate 300Gb data and run TPC-DS test. Source code - https://github.com/aws-samples/emr-on-eks-benchmark

    • Pull public spark benchmark docker image and SSH into docker image

docker pull ghcr.io/aws-samples/eks-spark-benchmark:3.1.2 docker run -it --user root ghcr.io/aws-samples/eks-spark-benchmark:3.1.2 bash

    • Install AWS CLI version 2 within docker image

Run following commands:

apt update apt install curl unzip -y curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip " -o "awscliv2.zip " unzip awscliv2.zip ./aws/install

Run 'aws --version' to verify AWS CLI is installed successfully.

    • Configure AWS CLI

Run 'aws configure'

=> The following example shows sample values. Replace them with your own values as described in the following sections.

$ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: json

[+] https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config

    • Prepare tools.tar.gz for dependency

cd /opt/tpcds-kit/ chown -R hadoop:hadoop tools tar -czvf tool.tar.gz tools

    • Copy following files to S3 location using 'aws s3 cp' command

aws s3 cp /opt/tpcds-kit/tool.tar.gz s3://<your-bucker>/ aws s3 cp /opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar s3://<your-bucker>/

Uploaded jar and tool kit tar in s3 would be required to call when starting benchmark jobs in next steps.

    • Generate TPCDS benchmark data and put it S3.

aws emr-serverless start-job-run
--application-id "<application id>"
--execution-role-arn "<ARN of emr-serverless-job-role>"
--region <region>
--job-driver '{ "sparkSubmit": { "entryPoint": "s3://<your-bucket>/eks-spark-benchmark-assembly-1.0.jar", "entryPointArguments": [ "s3://<your-bucket>/input-data/", "/home/hadoop/environment/tools", "parquet","300","200","true", "true", "true"], "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.DataGeneration --archives s3://<your-bucket>/tools.tar.gz#environment --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=10g --conf spark.executor.instances=27" } }'
--configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "<s3://<your-bucket>/logs/>" } } }'

    • Run TPCDS Benchmark test with the data generated in S3

aws emr-serverless start-job-run
--application-id "<application id>"
--execution-role-arn "ARN of emr-serverless-job-role"
--region <region>
--job-driver '{ "sparkSubmit": { "entryPoint": "<s3://<your-bucket>/eks-spark-benchmark-assembly-1.0.jar>", "entryPointArguments": [ "s3://<your-bucket>/input-data/", "s3://<your-bucket>/output/", "/home/hadoop/environment/tools", "parquet","300","3","false","q1-v2.4","true"], "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --archives s3://<your-bucket>/tools.tar.gz#environment --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=10g --conf spark.executor.instances=47"}}'
--configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "<s3://<your-bucket>/logs/>" } } }'

In above steps please update S3 location, region as per your environment. Please ensure EMR Serverless job execution role have required permissions on the S3 buckets for reading dependency files and writing data.

I hope you find this information helpful.

Thank you and have a good rest of your day!

AWS
SUPPORT ENGINEER
answered 2 years ago
AWS
EXPERT
reviewed 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions