emr serverless custom image

0

Hi, We are comparing the emr/eks and emr serverless offer. We are trying to achieve the tpc-ds perf benchmark through (https://github.com/aws-samples/emr-on-eks-benchmark). We are not able to do it on emr/serverless as some pre requiste are on os part like /opt/tpcds-kit/tools (https://github.com/aws-samples/emr-on-eks-benchmark/blob/main/docker/benchmark-util/Dockerfile) .

Is there any way to set os/binaries pre requisite on serverless workers to achieve this use case ?

BR,

Abdel

已提问 2 年前825 查看次数
1 回答
1

Hi, Thanks for writing to re:Post.

I Understand that you want help in running benchmarks for EMR Serverless using TPC-DS. The below listed steps should assist you in running the benchmark!

Below are the step to generate 300Gb data and run TPC-DS test. Source code - https://github.com/aws-samples/emr-on-eks-benchmark

    • Pull public spark benchmark docker image and SSH into docker image

docker pull ghcr.io/aws-samples/eks-spark-benchmark:3.1.2 docker run -it --user root ghcr.io/aws-samples/eks-spark-benchmark:3.1.2 bash

    • Install AWS CLI version 2 within docker image

Run following commands:

apt update apt install curl unzip -y curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip " -o "awscliv2.zip " unzip awscliv2.zip ./aws/install

Run 'aws --version' to verify AWS CLI is installed successfully.

    • Configure AWS CLI

Run 'aws configure'

=> The following example shows sample values. Replace them with your own values as described in the following sections.

$ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: json

[+] https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config

    • Prepare tools.tar.gz for dependency

cd /opt/tpcds-kit/ chown -R hadoop:hadoop tools tar -czvf tool.tar.gz tools

    • Copy following files to S3 location using 'aws s3 cp' command

aws s3 cp /opt/tpcds-kit/tool.tar.gz s3://<your-bucker>/ aws s3 cp /opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar s3://<your-bucker>/

Uploaded jar and tool kit tar in s3 would be required to call when starting benchmark jobs in next steps.

    • Generate TPCDS benchmark data and put it S3.

aws emr-serverless start-job-run
--application-id "<application id>"
--execution-role-arn "<ARN of emr-serverless-job-role>"
--region <region>
--job-driver '{ "sparkSubmit": { "entryPoint": "s3://<your-bucket>/eks-spark-benchmark-assembly-1.0.jar", "entryPointArguments": [ "s3://<your-bucket>/input-data/", "/home/hadoop/environment/tools", "parquet","300","200","true", "true", "true"], "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.DataGeneration --archives s3://<your-bucket>/tools.tar.gz#environment --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=10g --conf spark.executor.instances=27" } }'
--configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "<s3://<your-bucket>/logs/>" } } }'

    • Run TPCDS Benchmark test with the data generated in S3

aws emr-serverless start-job-run
--application-id "<application id>"
--execution-role-arn "ARN of emr-serverless-job-role"
--region <region>
--job-driver '{ "sparkSubmit": { "entryPoint": "<s3://<your-bucket>/eks-spark-benchmark-assembly-1.0.jar>", "entryPointArguments": [ "s3://<your-bucket>/input-data/", "s3://<your-bucket>/output/", "/home/hadoop/environment/tools", "parquet","300","3","false","q1-v2.4","true"], "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --archives s3://<your-bucket>/tools.tar.gz#environment --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=10g --conf spark.executor.instances=47"}}'
--configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "<s3://<your-bucket>/logs/>" } } }'

In above steps please update S3 location, region as per your environment. Please ensure EMR Serverless job execution role have required permissions on the S3 buckets for reading dependency files and writing data.

I hope you find this information helpful.

Thank you and have a good rest of your day!

AWS
支持工程师
已回答 2 年前
AWS
专家
已审核 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则