using data parallelization with SageMaker JumpStart

0

I am trying to train Faster-RCNN model available on SageMaker JumpStart and wonder if it is possible to use Data Parallelization feature to finish the job faster as the size of training data is big? I set the environment variable "LAUNCH_SM_DDP_ENV_NAME" to True inside estimator.JumpStartEstimator class and increased the number of instances to 10 (as example). What happens is that it just launches 10 training jobs running in parallel but it does not finish faster (in fact in finishes the same time as with 1 instance). Any hint is appreciated!!

alex
已提问 5 个月前2081 查看次数
1 回答
0

While I am not sure on the exact Model you are using, I suggest taking a look at training script that JumpStart is using and see if there is any implementation of DDP

AWS
Marc
已回答 5 个月前
  • As per the documentation the only fine-tunable PyTorch Object Detection model on SageMaker JumpStart is "pytorch-od1-fasterrcnn-resnet50-fpn". and I checked its training script and it does not seem to have DDP implemented. So I assume one cannot benefit from the DDP strategy with this model on JS. I assume I will have to implement it by myself by updating the trasnfer_learning.py (docker image entry point).

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则