Why my sagemaker training job slower than notebook from studiolab.sagemaker.aws?

0

I run neural network tensorflow train on studiolab. and I got:

Epoch 145/4000
1941/1941 - 10s - ... - 10s/epoch - 5ms/step

then I try to make a train job with script_mode with ml.c5.xlarge

estimator = TensorFlow(entry_point='untitled.py',
                       source_dir='./training/',
                       instance_type='ml.c5.xlarge',
                       instance_count=1,
                       output_path="s3://sagemaker-[skip]",
                       role=sagemaker.get_execution_role(),
                       framework_version='2.8.0',
                       py_version='py39',
                       hyperparameters={...},
                       metric_definitions=[...],
                       script_mode=True)

and its got:

Epoch 19/4000
1941/1941 - 49s - ... - 49s/epoch - 25ms/step

Why is it 5 times slower than studiolab notebook? Is it because instance type?

2 Respostas
0

May I know which instance type you are using for training locally on your notebook instance. Including factors that influence training performance, hardware spec of the training node is very critical. You might be either getting bottlenecked on CPU, Storage or Memory. See here for more details

profile pictureAWS
ESPECIALISTA
respondido há 2 anos
0

I get the same issue when using the Sagemaker SDK (Tensorflow estimator) vs training using the Sagemaker with jupyter notebook. Sagemaker SDK (Tensorflow estimator) is much slower (3X slower) with exactly the same: compute power, model and data.

respondido há um ano

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas