S3: AmazonS3Exception: Please reduce your request rate/Timeout waiting for connection from pool exception when running load tests

0

Hi, I'm running load tests against a library which essentially interacts with S3 to put and get objects. It works well until 3000-3500TPS but beyond that it starts throwing the following exceptions.

Seeing throttling exceptions "Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: GTAPGZK7E9KQQX3V; S3 Extended Request ID: O8XMzbmn0G51h01gogNrm0zHRkyGbGnhVJs+6WdvyTnAQ62kV/K7Au4wdz5z69WuhEFyaoXc4gqasaFEd8mSvA==; Proxy: null)" with 0 retries,

Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: GTAPGZK7E9KQQX3V; S3 Extended Request ID: O8XMzbmn0G51h01gogNrm0zHRkyGbGnhVJs+6WdvyTnAQ62kV/K7Au4wdz5z69WuhEFyaoXc4gqasaFEd8mSvA==; Proxy: null)

and Timeout waiting for connection from pool with a adaptive backoff retry strategy with max retries of 5. Majority of them are "Timeout waiting for connection from pool":

com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
com.amazonaws.SdkClientException: Unable to execute HTTP request: The target server failed to respond

I have gone through multiple articles and ensured that

  1. All objects are closed explicitly or used try with resources blocks.
  2. I'm incrementally increasing the TPS to ensure there is enough warm up period. Started with 500TPS for 5mins, and 1000TPS for 5mins and increased 1000TPS for every 5mins until 8000TPS. It starts failing at about 4000TPS.
  3. Partition prefix is unique for objects, so the TPS per partition isn't exceeded the limits.
final String message = UUID.randomUUID().toString();
final String key = String.format("%s/%s/%s.%s", getPartitionPrefix(message), message, message, "kms");
    private String getPartitionPrefix(final String uniqueMessageId) {
        try
        {
            final byte[] hashBytes =
                    MessageDigest.getInstance("MD5").digest(uniqueMessageId.getBytes("UTF-8"));
            final String hashString = Base64.getEncoder()
                    .encodeToString(hashBytes).substring(0, PARTITION_CHARS);
            // If the partition prefix is "soap", S3 interprets the request as an attempt
            // to hit the S3 soap API and returns an error (HTTP 405)
            // So mapping "soap" to another bucket. 
            if(SOAP.equals(hashString))
            {
                return ALTERNATE_PARTIION_NAME_FOR_SOAP;
            }
            return hashString;
        }
        catch (final NoSuchAlgorithmException | UnsupportedEncodingException e)
        {
            throw new RuntimeException(e.getClass() + ": " + e.getMessage(), e);
        }
    }
  1. This is how my S3 Client is built:
AmazonS3ClientBuilder.standard()
                            .withCredentials(
                                    CredentialsModule.getCredentialProvider(clientId, CBEV2_S3_WRITE_ROLE_ARN_MAP.get(clientId))
                            )
                            .withClientConfiguration(new ClientConfiguration()
                                    .withConnectionTimeout(1000)
                                    .withSocketTimeout(4000)
                                    .withMaxConnections(8000)
                                    .withProtocol(Protocol.HTTPS)
                                    .withCacheResponseMetadata(false)
                                    .withTcpKeepAlive(true)
                                    .withRetryPolicy(RetryPolicy.builder()
                                            .withMaxErrorRetry(5)
                                            .withRetryMode(RetryMode.ADAPTIVE)
                                            .withBackoffStrategy(PredefinedRetryPolicies
                                                    .getDefaultBackoffStrategy(RetryMode.ADAPTIVE))
                                            .build()))
                            // Allow global bucket access as buckets are not guaranteed to be in the same region as the endpoint.
                            .withForceGlobalBucketAccessEnabled(true)
                            .build();

I have tried a range of max connections starting from 100-100,000.

  1. I'm running load tests with a TPSGenerator multihost (about 58hosts are working for 4000TPS, which I believe is quite overprovisioned to support it)
Pranavi
asked 7 months ago1247 views
1 Answer
0

Hello,

As S3 is a distributed service, a small amount of 5xx errors are expected(typically less than 0.01%) of the total request rate. As mentioned in the documentation, Your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix. However, this request rate is not available to a bucket or a prefix as soon as it is created. As the request rate increases gradually, S3 should be able to scale up automatically to accommodate the high request rates. Refer to this blog to understand how S3 scales up.

To monitor the number of 5xx status error responses that you receive, use one of these options:

You can increase the number of retries to a higher value(10) to reduce the 503's you are getting as with a constantly high load, S3 will eventually scale up.

Another recommendation you could explore is to reduce the increment size to 500 TPS instead of 1000 TPS and increase the interval for the increment from 5 to 10 minutes to elongate the warm up period and give S3 some more time to scale up.

If you still see a high 5xx error rate, then to investigate your issue, we require details that are non-public information. Please open a support case with AWS using the following link

Regarding the SDK related errors, the possible reasons for 'Timeout waiting for connection from pool' could be as below:-

  • Connections are not being closed properly by the client
  • Number of concurrent connections is greater than the max number of connections setting
  • Size of files being downloaded or uploaded is large causing the connections to be engaged for a longer time

Since you mentioned that you are closing connections explicitly, could you please check on whether at any time, the concurrent connections go beyond the MaxConnections value and the average file size. You could also engage the AWS Java SDK team directly via GitHub issues as well. Refer to the linked repo for Java SDK V1 and Java SDK V2.

Thanks

AWS
SUPPORT ENGINEER
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions