Getting rate limited massively downloading from S3 (AWS public datasets)


Hi all,

I have a large EMR cluster with the typical VPC, private network, public network, and internet gateway. There, each vCPU tries to download a WARC file from S3. I have all the instances on the same VPC but I am getting rate limited. I think that it should not be happening, I mean, instances should independently connect to the URLs using different network paths. However I do not know how to setup a connection with independent connections/IPs, rotating IPs or something. This should be a common issue and there should be an standard solution, otherwise, how does people massively work with not only AWS Public Datasets, but their own S3 buckets without getting limited.

Edit: with more than 160 vCPUs the rate limit starts and the cluster performance degrades from 95% of vCPU usage to ~10%.

Thanks, David

1 Answer

You haven't said how your VPC is connecting to S3. For example, if you're using a NAT Gateway then you may be hitting throughput limits there rather than at the instance level.

My suggestion would be to create a S3 Gateway Endpoint in your VPC and see if that improves the situation.

profile picture
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions