Which GPU instances are supported by the sagemaker algorithm forecasting-deepar?


I previously ran a hyperparameter tuning job for SageMaker DeepAR with the instance type ml.c5.18xlarge but it seems insufficient to complete the tuning job within the max_run time specified in my account. Now, having tried to use the accelerated GPU instance ml.g4dn.16xlarge, I am prompted with an error - "Instance type ml.g4dn.16xlarge is not supported by algorithm forecasting-deepar."

I cannot find any documentation that indicates the list of instance types supported by deepar. What GPU/CPU instances have more compute capacity than ml.c5.18xlarge which I could leverage for my tuning job?

If there isn't, I would appreciate any recommendations as to how I could hasten the run time of the job. I require the tuning job to complete within the max run time of 432000 seconds. Thank you in advance!

Hi, thanks for pointing this out. Indeed, all g4dn instances are currently not supported by the forecasting-deepar algorithm, but as you rightly point out, this is currently not documented. I will raise this with the service team to include in in the documentation.

In the meantime, you can try out the P3 instances instead - these are also powerful GPU instances and should help you speed up the training time.

answered a month ago
  • I appreciate the quick response @Heiko! I see that for training there are 3 P3 instance options available, i.e. - 2xlarge, 8xlarge and 16xlarge. It would be super helpful if you could confirm which of these are configured for deepar.

    Additionally, I was hoping you could help me understand how the parameter 'instance_count' in the sagemaker Estimator class affects training time. The way I understand it is that the number attributed to this parameter results in the number of EC2 instances with the specified instance type to be allocated. For example with an instance_count = '3', we would have 3 EC2 instances, each with a p3.2xlarge (for example) launched to parallelize training.

    If so, which would you say is better in terms of improving training speed - using a higher instance_count / a single higher compute capacity instance? Thank you!

