emr serverless custom image

0

Hi, We are comparing the emr/eks and emr serverless offer. We are trying to achieve the tpc-ds perf benchmark through (https://github.com/aws-samples/emr-on-eks-benchmark). We are not able to do it on emr/serverless as some pre requiste are on os part like /opt/tpcds-kit/tools (https://github.com/aws-samples/emr-on-eks-benchmark/blob/main/docker/benchmark-util/Dockerfile) .

Is there any way to set os/binaries pre requisite on serverless workers to achieve this use case ?

BR,

Abdel

질문됨 2년 전825회 조회
1개 답변
1

Hi, Thanks for writing to re:Post.

I Understand that you want help in running benchmarks for EMR Serverless using TPC-DS. The below listed steps should assist you in running the benchmark!

Below are the step to generate 300Gb data and run TPC-DS test. Source code - https://github.com/aws-samples/emr-on-eks-benchmark

    • Pull public spark benchmark docker image and SSH into docker image

docker pull ghcr.io/aws-samples/eks-spark-benchmark:3.1.2 docker run -it --user root ghcr.io/aws-samples/eks-spark-benchmark:3.1.2 bash

    • Install AWS CLI version 2 within docker image

Run following commands:

apt update apt install curl unzip -y curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip " -o "awscliv2.zip " unzip awscliv2.zip ./aws/install

Run 'aws --version' to verify AWS CLI is installed successfully.

    • Configure AWS CLI

Run 'aws configure'

=> The following example shows sample values. Replace them with your own values as described in the following sections.

$ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: json

[+] https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config

    • Prepare tools.tar.gz for dependency

cd /opt/tpcds-kit/ chown -R hadoop:hadoop tools tar -czvf tool.tar.gz tools

    • Copy following files to S3 location using 'aws s3 cp' command

aws s3 cp /opt/tpcds-kit/tool.tar.gz s3://<your-bucker>/ aws s3 cp /opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar s3://<your-bucker>/

Uploaded jar and tool kit tar in s3 would be required to call when starting benchmark jobs in next steps.

    • Generate TPCDS benchmark data and put it S3.

aws emr-serverless start-job-run
--application-id "<application id>"
--execution-role-arn "<ARN of emr-serverless-job-role>"
--region <region>
--job-driver '{ "sparkSubmit": { "entryPoint": "s3://<your-bucket>/eks-spark-benchmark-assembly-1.0.jar", "entryPointArguments": [ "s3://<your-bucket>/input-data/", "/home/hadoop/environment/tools", "parquet","300","200","true", "true", "true"], "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.DataGeneration --archives s3://<your-bucket>/tools.tar.gz#environment --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=10g --conf spark.executor.instances=27" } }'
--configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "<s3://<your-bucket>/logs/>" } } }'

    • Run TPCDS Benchmark test with the data generated in S3

aws emr-serverless start-job-run
--application-id "<application id>"
--execution-role-arn "ARN of emr-serverless-job-role"
--region <region>
--job-driver '{ "sparkSubmit": { "entryPoint": "<s3://<your-bucket>/eks-spark-benchmark-assembly-1.0.jar>", "entryPointArguments": [ "s3://<your-bucket>/input-data/", "s3://<your-bucket>/output/", "/home/hadoop/environment/tools", "parquet","300","3","false","q1-v2.4","true"], "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --archives s3://<your-bucket>/tools.tar.gz#environment --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=10g --conf spark.executor.instances=47"}}'
--configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "<s3://<your-bucket>/logs/>" } } }'

In above steps please update S3 location, region as per your environment. Please ensure EMR Serverless job execution role have required permissions on the S3 buckets for reading dependency files and writing data.

I hope you find this information helpful.

Thank you and have a good rest of your day!

AWS
지원 엔지니어
답변함 2년 전
AWS
전문가
검토됨 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠