EMR serveless disk specification

0

I'm having a lot of problems with disk space in emr serveless :

org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1101b82b : No space left on device 

I have set disk space to 200GB in initial capacity which appears be the maximum for emr serveless but I'm not sure whether the initial capacity configuration is also used for all workers that will be spinning up, since dynamic allocation is enabled. So my question is is there another place to specify disk space for workers ? Enter image description here

Allan
asked 10 months ago931 views
2 Answers
1

The disk space used by EMR Serverless is primarily for storing log files, shuffle data, and resources like auxiliary libraries and user jars​1​. When you're configuring an EMR Serverless application, you can specify the disk size for your pre-initialized capacity and a maximum disk limit for the application​. The initialCapacity parameter that you set when configuring your application refers to the number of workers that are kept pre-initialized, ready to respond quickly. The disk space, along with other resources like CPU and memory, that you specify for these pre-initialized workers will be the resources these workers start with.

https://repost.aws/questions/QUYkOGBhEaRSyW7__XB4AOsA/emr-serverless-application-disk-space

When it comes to specifying disk space for workers, you can configure each worker with temporary storage disks with a minimum size of 20 GB and a maximum of 200 GB. You only pay for additional storage beyond 20 GB that you configure per worker​.

https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-capacity.html

The disk requirements for every worker instance of the worker type is also configurable. This configuration is not required, and it allows for a string value that follows the pattern ^[1-9][0-9]*(\s)?(GB|gb|gB|Gb)$​. It's worth noting that this is a separate configuration from the initial capacity and should be set according to the expected load on each worker.

https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_WorkerResourceConfig.html

In your case, if you're running into issues with disk space, you might want to consider increasing the disk space configuration for the workers, if it's currently set to less than the maximum allowable limit, or optimizing your application to use disk space more efficiently. It's also important to remember that when you're dealing with Spark jobs, the memory they use is more than the specified container sizes due to an overhead, so you should take that into account when choosing your worker sizes​.

profile picture
EXPERT
answered 10 months ago
profile picture
EXPERT
reviewed 10 months ago
  • Tks, for the answer. But do you have any example how to specify disk space for the workers ? I couldn’t figure out how to use API_WorkerResourceConfig with aws cli.

  • let me give an example of how to specify the worker configuration

    you can use the following command line block

    aws emr-serverless update-application

    for detailed information

    https://docs.aws.amazon.com/cli/latest/reference/emr-serverless/update-application.html

    the link show also json configuration for API

  • The link that you sent only shows how to specify workerConfiguration inside initialCapacity section, which I'm doing already. But looks like there is no way to specify workerConfiguration for workers besides initial ones. So if you can paste some piece of code in your answer on how to do that would be very helpful. Tks

  • The initial capacity is reused, but you can also specify per-job resource configurations by using spark.emr-serverless.driver.disk and spark.emr-serverless.executor.disk as listed in the spark job properties. You would use these as part of the sparkSubmitParameters in your job driver (example is in the docs).

    I also noticed your driver disk size is only 20gb - I'm not sure whether the error you mention is on an executor or the spark driver, but good to double-check.

0
Accepted Answer

As @dacort said in your comment, the way to specify disk size besides initialCapacity is using spark job properties with spark.emr-serverless.executor.disk and spark.emr-serverless.driver.disk .

Tks

Allan
answered 10 months ago
AWS
SUPPORT ENGINEER
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions