- Newest
- Most votes
- Most comments
As @dacort said in your comment, the way to specify disk size besides initialCapacity is using spark job properties with spark.emr-serverless.executor.disk
and spark.emr-serverless.driver.disk
.
Tks
The disk space used by EMR Serverless is primarily for storing log files, shuffle data, and resources like auxiliary libraries and user jars1. When you're configuring an EMR Serverless application, you can specify the disk size for your pre-initialized capacity and a maximum disk limit for the application. The initialCapacity parameter that you set when configuring your application refers to the number of workers that are kept pre-initialized, ready to respond quickly. The disk space, along with other resources like CPU and memory, that you specify for these pre-initialized workers will be the resources these workers start with.
https://repost.aws/questions/QUYkOGBhEaRSyW7__XB4AOsA/emr-serverless-application-disk-space
When it comes to specifying disk space for workers, you can configure each worker with temporary storage disks with a minimum size of 20 GB and a maximum of 200 GB. You only pay for additional storage beyond 20 GB that you configure per worker.
https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-capacity.html
The disk requirements for every worker instance of the worker type is also configurable. This configuration is not required, and it allows for a string value that follows the pattern ^[1-9][0-9]*(\s)?(GB|gb|gB|Gb)$. It's worth noting that this is a separate configuration from the initial capacity and should be set according to the expected load on each worker.
https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_WorkerResourceConfig.html
In your case, if you're running into issues with disk space, you might want to consider increasing the disk space configuration for the workers, if it's currently set to less than the maximum allowable limit, or optimizing your application to use disk space more efficiently. It's also important to remember that when you're dealing with Spark jobs, the memory they use is more than the specified container sizes due to an overhead, so you should take that into account when choosing your worker sizes.
Relevant content
- asked 3 years ago
- asked 4 years ago
- asked 4 months ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago
Tks, for the answer. But do you have any example how to specify disk space for the workers ? I couldn’t figure out how to use API_WorkerResourceConfig with aws cli.
let me give an example of how to specify the worker configuration
you can use the following command line block
aws emr-serverless update-application
for detailed information
https://docs.aws.amazon.com/cli/latest/reference/emr-serverless/update-application.html
the link show also json configuration for API
The link that you sent only shows how to specify workerConfiguration inside initialCapacity section, which I'm doing already. But looks like there is no way to specify workerConfiguration for workers besides initial ones. So if you can paste some piece of code in your answer on how to do that would be very helpful. Tks
The initial capacity is reused, but you can also specify per-job resource configurations by using
spark.emr-serverless.driver.disk
andspark.emr-serverless.executor.disk
as listed in the spark job properties. You would use these as part of thesparkSubmitParameters
in your job driver (example is in the docs).I also noticed your driver disk size is only 20gb - I'm not sure whether the error you mention is on an executor or the spark driver, but good to double-check.