Optimal notebook instance type for DeepAR in AWS Sagemaker

0

I am currently utilizing an ml.c4.2xlarge instance type for a DeepAR use case to run an Automated Model Tuning job. The data consists of 7157 time series with 152 timesteps in the training set and 52 timesteps in the test set respectively. I estimate the run time for the tuning job on this specific instance type to take about 4-5 days. Looking to find out if DeepAR is engineered to take advantage of GPU computing for training and if it would be advisable to use a 'p' or 'g' compute instance instead for faster results. Also would be great for recommendations as to which Accelerated Computing instance would be optimal for this scenario.

已提问 2 年前537 查看次数
1 回答
1
已接受的回答

(As detailed further on the algorithm details page), yes, the SageMaker DeepAR algorithm implementation is able to train on GPU-accelerated instances to speed up more challenging jobs. There's also a handy reference table here listing all the SageMaker built-in algorithms and whether they're likely to be accelerated with GPU.

However, to be clear, it shouldn't be the notebook instance type that affects this... Typically when training models on SageMaker, the notebook would provide your interactive compute environment but you'd run training in training jobs - for example using the SageMaker Python SDK Estimator class as shown in the sample notebooks for DeepAR electricity and synthetic. The instance type you select for training is independent of the instance type you use for your notebook - for example in the electricity notebook it's set as follows:

estimator = sagemaker.estimator.Estimator(
    image_uri=image_name,
    sagemaker_session=sagemaker_session,
    role=role,
    train_instance_count=1,  # <-- Setting training instance count
    train_instance_type="ml.c4.2xlarge",  # <-- Setting training instance type
    base_job_name="deepar-electricity-demo",
    output_path=s3_output_path,
)

So normally I wouldn't expect you to need to change your notebook instance type to speed up training - just edit the configuration of your training job from within the notebook.

Suggesting a particular type is tricky because DeepAR hyperparameters like context_length, embedding_dimension, and mini_batch_size will affect how much GPU capacity is needed for a particular run. Since you're coming from CPU-only baseline, I'd maybe suggest to start small with trying out single-GPU g4dn.xlarge, g5.xlarge or p3.2xlarge instances, perhaps starting with the lowest cost-per-hour? You can keep an eye on your jobs' GPUUtilization and GPUMemoryUtilization metrics to check whether utilization is low on instances like p3 with "bigger" GPUs. Increasing mini_batch_size should help fill extra capacity on these and complete your job faster, but it will probably affect model convergence - so may need to tune other parameters like learning_rate to try and compensate. So considering all of this, you may find trade-offs between speed and total cost, or speed and accuracy, for good hyperparameter combinations on your dataset. Of course you could also scale up to multi-GPU instance types if you'd like to accelerate further.

If I understood right you're also using SageMaker Automatic Hyperparameter Tuning to search these parameters, something like this XGBoost notebook with the HyperparameterTuner class?

In that case would also mention:

  • Increasing the max_parallel_jobs parameter may accelerate the overall run time (by running more of the individual training jobs in parallel) - with a trade-off on how much information is available when each training job in the budget is kicked off.
  • If you're planning to run this training regularly on a dataset which evolves over time, you probably don't need to run HPO each time: Will likely see good results using your previously-optimized hyperparameters, unless something materially changes in the nature of the data and patterns.
AWS
专家
Alex_T
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则