EMRFS and S3 503 slow down responses

Question

I've been seeing the following exception from emrfs using emr 6.5.0

```
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: XXX; S3 Extended Request ID: XXX; Proxy: null), S3 Extended Request ID: XXX
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5445)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5392)
        at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1368)
        at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:26)
        at com.amazon.ws.emr.hadoop.fs.s3.lite.call.GetObjectMetadataCall.perform(GetObjectMetadataCall.java:12)
        at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor$CallPerformer.call(GlobalS3Executor.java:108)
        at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:135)
        at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:191)
        at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:186)
        at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.getObjectMetadata(AmazonS3LiteClient.java:96)
        at com.amazon.ws.emr.hadoop.fs.s3.lite.AbstractAmazonS3Lite.getObjectMetadata(AbstractAmazonS3Lite.java:43)
        at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.getFileMetadataFromCacheOrS3(Jets3tNativeFileSystemStore.java:592)
        at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:318)
        at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:509)
        at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694)
        at org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker.getFileSize(BasicWriteStatsTracker.scala:70)
        at org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker.$anonfun$statCurrentFile$1(BasicWriteStatsTracker.scala:96)
        at org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker.$anonfun$statCurrentFile$1$adapted(BasicWriteStatsTracker.scala:95)
```
My spark application sets the following for fs.s3.:
```
    hadoopConfiguration.set("fs.s3.maxConnections", "2")
    hadoopConfiguration.set("fs.s3.maxRetries", "120")
    hadoopConfiguration.set("fs.s3.retryPeriodSeconds", "1")
    hadoopConfiguration.set("fs.s3.buckets.create.enabled", "false")
```
The maxRetries is obviously crazy large just to see if it had any impact on this issue (answer: no).

Is there an additional setting that needs to be added or changed to get this codepath to honour the retry configuration?

Answer

`503 Slow Down` occurs when you exceeds limits in your account for the S3 bucket/prefix at any given point in time. Your application can achieve at least [3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html) requests per second per prefix in a bucket.

Did you mean your spark application doesn't retry or it failed after 120 retries?

120 retries is definitely a big number and if it is failing after 120 retries it would mean there is sustained S3 API calls in your account for the bucket/prefix and in this case you need look at the options as described in [AWS Knowledge Center Article](https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-503-slow-down/)

In addition, it would help if you increase the `"fs.s3.retryPeriodSeconds"` period as well.

Would recommend reaching out to AWS Support through the [case management system](https://docs.aws.amazon.com/awssupport/latest/user/case-management.html) with details (ClusterId, ApplicationId etc.) so that the team can look into the specifics of this error in your account.

EMRFS and S3 503 slow down responses

Relevant content