emr serverless custom image

0

Hi, We are comparing the emr/eks and emr serverless offer. We are trying to achieve the tpc-ds perf benchmark through (https://github.com/aws-samples/emr-on-eks-benchmark). We are not able to do it on emr/serverless as some pre requiste are on os part like /opt/tpcds-kit/tools (https://github.com/aws-samples/emr-on-eks-benchmark/blob/main/docker/benchmark-util/Dockerfile) .

Is there any way to set os/binaries pre requisite on serverless workers to achieve this use case ?

BR,

Abdel

feita há 2 anos824 visualizações
1 Resposta
1

Hi, Thanks for writing to re:Post.

I Understand that you want help in running benchmarks for EMR Serverless using TPC-DS. The below listed steps should assist you in running the benchmark!

Below are the step to generate 300Gb data and run TPC-DS test. Source code - https://github.com/aws-samples/emr-on-eks-benchmark

    • Pull public spark benchmark docker image and SSH into docker image

docker pull ghcr.io/aws-samples/eks-spark-benchmark:3.1.2 docker run -it --user root ghcr.io/aws-samples/eks-spark-benchmark:3.1.2 bash

    • Install AWS CLI version 2 within docker image

Run following commands:

apt update apt install curl unzip -y curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip " -o "awscliv2.zip " unzip awscliv2.zip ./aws/install

Run 'aws --version' to verify AWS CLI is installed successfully.

    • Configure AWS CLI

Run 'aws configure'

=> The following example shows sample values. Replace them with your own values as described in the following sections.

$ aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: json

[+] https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config

    • Prepare tools.tar.gz for dependency

cd /opt/tpcds-kit/ chown -R hadoop:hadoop tools tar -czvf tool.tar.gz tools

    • Copy following files to S3 location using 'aws s3 cp' command

aws s3 cp /opt/tpcds-kit/tool.tar.gz s3://<your-bucker>/ aws s3 cp /opt/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar s3://<your-bucker>/

Uploaded jar and tool kit tar in s3 would be required to call when starting benchmark jobs in next steps.

    • Generate TPCDS benchmark data and put it S3.

aws emr-serverless start-job-run
--application-id "<application id>"
--execution-role-arn "<ARN of emr-serverless-job-role>"
--region <region>
--job-driver '{ "sparkSubmit": { "entryPoint": "s3://<your-bucket>/eks-spark-benchmark-assembly-1.0.jar", "entryPointArguments": [ "s3://<your-bucket>/input-data/", "/home/hadoop/environment/tools", "parquet","300","200","true", "true", "true"], "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.DataGeneration --archives s3://<your-bucket>/tools.tar.gz#environment --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=10g --conf spark.executor.instances=27" } }'
--configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "<s3://<your-bucket>/logs/>" } } }'

    • Run TPCDS Benchmark test with the data generated in S3

aws emr-serverless start-job-run
--application-id "<application id>"
--execution-role-arn "ARN of emr-serverless-job-role"
--region <region>
--job-driver '{ "sparkSubmit": { "entryPoint": "<s3://<your-bucket>/eks-spark-benchmark-assembly-1.0.jar>", "entryPointArguments": [ "s3://<your-bucket>/input-data/", "s3://<your-bucket>/output/", "/home/hadoop/environment/tools", "parquet","300","3","false","q1-v2.4","true"], "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --archives s3://<your-bucket>/tools.tar.gz#environment --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.executor.cores=4 --conf spark.executor.memory=10g --conf spark.executor.instances=47"}}'
--configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "<s3://<your-bucket>/logs/>" } } }'

In above steps please update S3 location, region as per your environment. Please ensure EMR Serverless job execution role have required permissions on the S3 buckets for reading dependency files and writing data.

I hope you find this information helpful.

Thank you and have a good rest of your day!

AWS
ENGENHEIRO DE SUPORTE
respondido há 2 anos
AWS
ESPECIALISTA
avaliado há 2 anos

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas