How do I make an ECS cluster spawn GPU instances with more root volume than default?


I need to deploy an ML app that needs GPU access for its response times to be acceptable (since it uses some heavy networks that run too slowly on CPU). The app is containerized and uses an nvidia/cuda base image, so that it can make use of its host machine's GPU. The image alone weighs ~10GB, and during startup it pulls several ML models and data which takes up about another ~10GB of disk.

We were previously running this app on Elastic Beanstalk, but we realized it doesn't support GPU usage, even if specifying a Deep Learning AMI, so we migrated to ECS, which provides more configurability that the former. However, we soon ran into a new problem: selecting a g4dn instance type when creating a cluster, which defaults the AMI to an ECS GPU one, turns the Root EBS Volume Size field into a Data EBS Volume Size field.

This causes the instance's 22GB root volume (which is the only one that comes formatted and mounted) to be too small for pulling our image and downloading the data it needs during startup. The other volume (of whatever size I specify during creation in the new Data EBS Volume Size field) is not mounted and therefore not accessible by the container. Additionally, the g4dn instances come with a 125GB SSD, that is not mounted either. If either of these were usable or it was possible to enlarge the root volume (which it is if using the default non-GPU AMI) ECS would be the perfect solution for us at this time.

At the moment, we worked around this issue by creating an empty cluster in ECS, and the manually creating and attaching an Auto Scaling group to it, since when using a Launch configuration or template the root volume's size can be correctly specified, even if using the same exact ECS GPU AMI as ECS does. However, this is a tiresome process, and makes us lose valuable ECS functionality such as automatically spawning a new instance during a rolling update to maintain capacity.

Am I missing something here? Is this a bug that will be fixed at some point? If its not, is there a simpler way to achieve what I need? Maybe by specifying a custom launch configuration to the ECS cluster or by automatically mounting the SSD on instance launch?

Any help is more than appreciated. Thanks in advance!

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions