Getting rate limited massively downloading from S3 (AWS public datasets)

0

Hi all,

I have a large EMR cluster with the typical VPC, private network, public network, and internet gateway. There, each vCPU tries to download a WARC file from S3. I have all the instances on the same VPC but I am getting rate limited. I think that it should not be happening, I mean, instances should independently connect to the URLs using different network paths. However I do not know how to setup a connection with independent connections/IPs, rotating IPs or something. This should be a common issue and there should be an standard solution, otherwise, how does people massively work with not only AWS Public Datasets, but their own S3 buckets without getting limited.

Edit: with more than 160 vCPUs the rate limit starts and the cluster performance degrades from 95% of vCPU usage to ~10%.

Thanks, David

1 Antwort
0

You haven't said how your VPC is connecting to S3. For example, if you're using a NAT Gateway then you may be hitting throughput limits there rather than at the instance level.

My suggestion would be to create a S3 Gateway Endpoint in your VPC and see if that improves the situation.

profile pictureAWS
EXPERTE
beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen