using data parallelization with SageMaker JumpStart

0

I am trying to train Faster-RCNN model available on SageMaker JumpStart and wonder if it is possible to use Data Parallelization feature to finish the job faster as the size of training data is big? I set the environment variable "LAUNCH_SM_DDP_ENV_NAME" to True inside estimator.JumpStartEstimator class and increased the number of instances to 10 (as example). What happens is that it just launches 10 training jobs running in parallel but it does not finish faster (in fact in finishes the same time as with 1 instance). Any hint is appreciated!!

alex
已提問 5 個月前檢視次數 2081 次
1 個回答
0

While I am not sure on the exact Model you are using, I suggest taking a look at training script that JumpStart is using and see if there is any implementation of DDP

AWS
Marc
已回答 5 個月前
  • As per the documentation the only fine-tunable PyTorch Object Detection model on SageMaker JumpStart is "pytorch-od1-fasterrcnn-resnet50-fpn". and I checked its training script and it does not seem to have DDP implemented. So I assume one cannot benefit from the DDP strategy with this model on JS. I assume I will have to implement it by myself by updating the trasnfer_learning.py (docker image entry point).

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南