Getting rate limited massively downloading from S3 (AWS public datasets)

0

Hi all,

I have a large EMR cluster with the typical VPC, private network, public network, and internet gateway. There, each vCPU tries to download a WARC file from S3. I have all the instances on the same VPC but I am getting rate limited. I think that it should not be happening, I mean, instances should independently connect to the URLs using different network paths. However I do not know how to setup a connection with independent connections/IPs, rotating IPs or something. This should be a common issue and there should be an standard solution, otherwise, how does people massively work with not only AWS Public Datasets, but their own S3 buckets without getting limited.

Edit: with more than 160 vCPUs the rate limit starts and the cluster performance degrades from 95% of vCPU usage to ~10%.

Thanks, David

質問済み 2年前214ビュー
1回答
0

You haven't said how your VPC is connecting to S3. For example, if you're using a NAT Gateway then you may be hitting throughput limits there rather than at the instance level.

My suggestion would be to create a S3 Gateway Endpoint in your VPC and see if that improves the situation.

profile pictureAWS
エキスパート
回答済み 2年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ