My Apache Spark or Apache Hive job on Amazon EMR fails with an HTTP 503 "Slow Down" AmazonS3Exception.
Short description
When the Amazon Simple Storage Service (Amazon S3) request rate for your application exceeds the typically sustained rates and Amazon S3 internally optimizes performance, you receive the following error:
"java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: 2E8B8866BFF00645; S3 Extended Request ID: oGSeRdT4xSKtyZAcUe53LgUf1+I18dNXpL2+qZhFWhuciNOYpxX81bpFiTw2gum43GcOHR+UlJE=), S3 Extended Request ID: oGSeRdT4xSKtyZAcUe53LgUf1+I18dNXpL2+qZhFWhuciNOYpxX81bpFiTw2gum43GcOHR+UlJE="
Resolution
Configure CloudWatch request metrics
To identify the issue with too many requests, it's a best practice to configure Amazon CloudWatch request metrics for the Amazon S3 bucket.
Turn on CloudWatch request metrics for the bucket and define a filter for the prefix.
Modify the retry strategy for Amazon S3 requests
By default, EMR File System (EMRFS) uses an exponential backoff strategy to retry requests to Amazon S3. The default EMRFS retry limit is 15. However, you can increase the retry limit on a new cluster, on a running cluster, or at application runtime.
To increase the retry limit, change the value of fs.s3.maxRetries parameter.
Note: If you set a very high value for this parameter, then you might experience longer job duration.
Set the parameter to a high value, for example, 20 and monitor the duration overhead of the jobs. Then adjust the parameter based on your use case.
For a new cluster, you can add a configuration object similar to the following when you launch the cluster:
[
{
"Classification": "emrfs-site",
"Properties": {
"fs.s3.maxRetries": "20"
}
}
]
After you launch the cluster, Spark and Hive applications that run on Amazon EMR use the new limit.
To increase the retry limit on a running cluster, complete the following steps:
- Open the Amazon EMR console.
- Choose the active cluster that you want to reconfigure.
- Choose the Configurations tab.
- In the Filter dropdown list, select the instance group that you want to reconfigure.
- In Reconfigure dropdown list, choose Edit in table.
- In the configuration classification table, choose Add configuration, and then use the following values:
For Classification use emrfs-site
For Property use fs.s3.maxRetries
For Value use the new value for the retry limit. For example, 20.
- Select Apply this configuration to all active instance groups.
- Choose Save changes.
After you deploy the configuration, Spark and Hive applications use the new limit.
To increase the retry limit at runtime for a Spark application, use a Spark shell session to modify the fs.s3.maxRetries parameter similar to the following example:
spark> sc.hadoopConfiguration.set("fs.s3.maxRetries", "20")
spark> val source_df = spark.read.csv("s3://awsexamplebucket/data/")
spark> source_df.write.save("s3://awsexamplebucket2/output/")
To increase the retry limit at runtime for a Hive application, run a command similar to the following example:
hive> set fs.s3.maxRetries=20;
hive> select ....
Adjust the number of concurrent Amazon S3 requests
- If you have multiple jobs (Spark, Apache Hive, or s-dist-cp) that read and write to the same Amazon S3 prefix, then you can adjust the concurrency. Start with the most read or write heavy jobs and lower their concurrency to avoid excessive parallelism.
Note: If you configured cross-account access for Amazon S3, then other AWS accounts might also submit jobs to the same prefix.
- If you see errors when the job tries to write to the destination bucket, then reduce excessive write parallelism. For example, use Spark .coalesce() or .repartition() operations to reduce number of Spark output partitions before you write to Amazon S3. You can also reduce the number of cores per executor or reduce the number of executors.
- If you see errors when the job tries to read from the source bucket, then adjust the size of objects. To reduce the number of objects read by the job, aggregate smaller objects to larger ones. For example, use s3-dist-cp to merge a large number of small files into a smaller number of large files.
Related information
Best practices design patterns: optimizing Amazon S3 performance
Why does my Amazon EMR application fail with an HTTP 403 "Access Denied" AmazonS3Exception?
Why does my Amazon EMR application fail with an HTTP 404 "Not Found" AmazonS3Exception?