Why my sagemaker training job slower than notebook from studiolab.sagemaker.aws?

0

I run neural network tensorflow train on studiolab. and I got:

Epoch 145/4000
1941/1941 - 10s - ... - 10s/epoch - 5ms/step

then I try to make a train job with script_mode with ml.c5.xlarge

estimator = TensorFlow(entry_point='untitled.py',
                       source_dir='./training/',
                       instance_type='ml.c5.xlarge',
                       instance_count=1,
                       output_path="s3://sagemaker-[skip]",
                       role=sagemaker.get_execution_role(),
                       framework_version='2.8.0',
                       py_version='py39',
                       hyperparameters={...},
                       metric_definitions=[...],
                       script_mode=True)

and its got:

Epoch 19/4000
1941/1941 - 49s - ... - 49s/epoch - 25ms/step

Why is it 5 times slower than studiolab notebook? Is it because instance type?

2 Antworten
0

May I know which instance type you are using for training locally on your notebook instance. Including factors that influence training performance, hardware spec of the training node is very critical. You might be either getting bottlenecked on CPU, Storage or Memory. See here for more details

profile pictureAWS
EXPERTE
beantwortet vor 2 Jahren
0

I get the same issue when using the Sagemaker SDK (Tensorflow estimator) vs training using the Sagemaker with jupyter notebook. Sagemaker SDK (Tensorflow estimator) is much slower (3X slower) with exactly the same: compute power, model and data.

beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen