Getting rate limited massively downloading from S3 (AWS public datasets)

0

Hi all,

I have a large EMR cluster with the typical VPC, private network, public network, and internet gateway. There, each vCPU tries to download a WARC file from S3. I have all the instances on the same VPC but I am getting rate limited. I think that it should not be happening, I mean, instances should independently connect to the URLs using different network paths. However I do not know how to setup a connection with independent connections/IPs, rotating IPs or something. This should be a common issue and there should be an standard solution, otherwise, how does people massively work with not only AWS Public Datasets, but their own S3 buckets without getting limited.

Edit: with more than 160 vCPUs the rate limit starts and the cluster performance degrades from 95% of vCPU usage to ~10%.

Thanks, David

已提問 2 年前檢視次數 226 次
1 個回答
0

You haven't said how your VPC is connecting to S3. For example, if you're using a NAT Gateway then you may be hitting throughput limits there rather than at the instance level.

My suggestion would be to create a S3 Gateway Endpoint in your VPC and see if that improves the situation.

profile pictureAWS
專家
已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南