using data parallelization with SageMaker JumpStart

0

I am trying to train Faster-RCNN model available on SageMaker JumpStart and wonder if it is possible to use Data Parallelization feature to finish the job faster as the size of training data is big? I set the environment variable "LAUNCH_SM_DDP_ENV_NAME" to True inside estimator.JumpStartEstimator class and increased the number of instances to 10 (as example). What happens is that it just launches 10 training jobs running in parallel but it does not finish faster (in fact in finishes the same time as with 1 instance). Any hint is appreciated!!

alex
asked 4 months ago2065 views
1 Answer
0

While I am not sure on the exact Model you are using, I suggest taking a look at training script that JumpStart is using and see if there is any implementation of DDP

AWS
Marc
answered 4 months ago
  • As per the documentation the only fine-tunable PyTorch Object Detection model on SageMaker JumpStart is "pytorch-od1-fasterrcnn-resnet50-fpn". and I checked its training script and it does not seem to have DDP implemented. So I assume one cannot benefit from the DDP strategy with this model on JS. I assume I will have to implement it by myself by updating the trasnfer_learning.py (docker image entry point).

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions