Getting rate limited massively downloading from S3 (AWS public datasets)

0

Hi all,

I have a large EMR cluster with the typical VPC, private network, public network, and internet gateway. There, each vCPU tries to download a WARC file from S3. I have all the instances on the same VPC but I am getting rate limited. I think that it should not be happening, I mean, instances should independently connect to the URLs using different network paths. However I do not know how to setup a connection with independent connections/IPs, rotating IPs or something. This should be a common issue and there should be an standard solution, otherwise, how does people massively work with not only AWS Public Datasets, but their own S3 buckets without getting limited.

Edit: with more than 160 vCPUs the rate limit starts and the cluster performance degrades from 95% of vCPU usage to ~10%.

Thanks, David

已提问 2 年前229 查看次数
1 回答
0

You haven't said how your VPC is connecting to S3. For example, if you're using a NAT Gateway then you may be hitting throughput limits there rather than at the instance level.

My suggestion would be to create a S3 Gateway Endpoint in your VPC and see if that improves the situation.

profile pictureAWS
专家
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则