Why my sagemaker training job slower than notebook from studiolab.sagemaker.aws?

0

I run neural network tensorflow train on studiolab. and I got:

Epoch 145/4000
1941/1941 - 10s - ... - 10s/epoch - 5ms/step

then I try to make a train job with script_mode with ml.c5.xlarge

estimator = TensorFlow(entry_point='untitled.py',
                       source_dir='./training/',
                       instance_type='ml.c5.xlarge',
                       instance_count=1,
                       output_path="s3://sagemaker-[skip]",
                       role=sagemaker.get_execution_role(),
                       framework_version='2.8.0',
                       py_version='py39',
                       hyperparameters={...},
                       metric_definitions=[...],
                       script_mode=True)

and its got:

Epoch 19/4000
1941/1941 - 49s - ... - 49s/epoch - 25ms/step

Why is it 5 times slower than studiolab notebook? Is it because instance type?

2 Respuestas
0

May I know which instance type you are using for training locally on your notebook instance. Including factors that influence training performance, hardware spec of the training node is very critical. You might be either getting bottlenecked on CPU, Storage or Memory. See here for more details

profile pictureAWS
EXPERTO
respondido hace 2 años
0

I get the same issue when using the Sagemaker SDK (Tensorflow estimator) vs training using the Sagemaker with jupyter notebook. Sagemaker SDK (Tensorflow estimator) is much slower (3X slower) with exactly the same: compute power, model and data.

respondido hace un año

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas