Best way to run GPU containers

0

Bringing existing containerized solution into AWS. I have few questions. I uploaded containers to ECR, created RDS etc. created cluster and task definitions, started, all good. Now need to figure out how to run containerized custom nvidia triton in AWS. This needs 1-2 A100/80GB or similar.

  1. What is the best instance to run one or two A100 GPUs. If instance is p4d.x24large can i use only one GPU from it or does billing run for all 8?
  2. Planning to turn GPU task on only when needed, for example 8 hours a day and destroy when not needed. I investigated and there are many ways to do it but what is the recommended way?
  3. Should i use service discovery to get the GPU server ip and port info as this would be in task2 while container that accesses this is in task1 - is there simpler method for the case when GPU server is started- destroyed/stopped. If instance is started and destroyed the system will give it new ip adderesses. Is there better way. The idea is to save $$$ in GPU cost.

Thanks in advance!

asked 5 months ago345 views
1 Answer
0

Hello,

You can run A100 GPUs in p4d.24xlarge, and you will be charged $32.7726/hr, this pricing is for on demand instnace and you qwill be charged for the whole instance i.e 8 GPUs. You can find more information regarding the pricing here[1].

As you have mentioned there are different ways to run your workloads for a specified time and it would be worth identifying your use case and identifying the best way suits your use case, as this solely depends on your use case and requirements. Service discovery would be the straight forward method for the use case, as the above pricing is for on demand you may consider using spot instances if it suits your use case also try optimizing the most resources in less time.

I have included below links, you may consider visiting them to see if they are helpful.

[1] https://aws.amazon.com/ec2/pricing/on-demand/ [2] https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html#gpu-considerations [3] https://aws.amazon.com/blogs/compute/optimizing-gpu-utilization-for-ai-ml-workloads-on-amazon-ec2/ [4] https://aws.amazon.com/blogs/compute/10-things-you-can-do-today-to-reduce-aws-costs/ [5] https://aws.amazon.com/ec2/cost-and-capacity/

AWS
sanju_s
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions