using data parallelization with SageMaker JumpStart

0

I am trying to train Faster-RCNN model available on SageMaker JumpStart and wonder if it is possible to use Data Parallelization feature to finish the job faster as the size of training data is big? I set the environment variable "LAUNCH_SM_DDP_ENV_NAME" to True inside estimator.JumpStartEstimator class and increased the number of instances to 10 (as example). What happens is that it just launches 10 training jobs running in parallel but it does not finish faster (in fact in finishes the same time as with 1 instance). Any hint is appreciated!!

1개 답변
0

While I am not sure on the exact Model you are using, I suggest taking a look at training script that JumpStart is using and see if there is any implementation of DDP

AWS
Marc
답변함 5달 전
  • As per the documentation the only fine-tunable PyTorch Object Detection model on SageMaker JumpStart is "pytorch-od1-fasterrcnn-resnet50-fpn". and I checked its training script and it does not seem to have DDP implemented. So I assume one cannot benefit from the DDP strategy with this model on JS. I assume I will have to implement it by myself by updating the trasnfer_learning.py (docker image entry point).

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인