Why my sagemaker training job slower than notebook from studiolab.sagemaker.aws?

0

I run neural network tensorflow train on studiolab. and I got:

Epoch 145/4000
1941/1941 - 10s - ... - 10s/epoch - 5ms/step

then I try to make a train job with script_mode with ml.c5.xlarge

estimator = TensorFlow(entry_point='untitled.py',
                       source_dir='./training/',
                       instance_type='ml.c5.xlarge',
                       instance_count=1,
                       output_path="s3://sagemaker-[skip]",
                       role=sagemaker.get_execution_role(),
                       framework_version='2.8.0',
                       py_version='py39',
                       hyperparameters={...},
                       metric_definitions=[...],
                       script_mode=True)

and its got:

Epoch 19/4000
1941/1941 - 49s - ... - 49s/epoch - 25ms/step

Why is it 5 times slower than studiolab notebook? Is it because instance type?

2 個答案
0

May I know which instance type you are using for training locally on your notebook instance. Including factors that influence training performance, hardware spec of the training node is very critical. You might be either getting bottlenecked on CPU, Storage or Memory. See here for more details

profile pictureAWS
專家
已回答 2 年前
0

I get the same issue when using the Sagemaker SDK (Tensorflow estimator) vs training using the Sagemaker with jupyter notebook. Sagemaker SDK (Tensorflow estimator) is much slower (3X slower) with exactly the same: compute power, model and data.

已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南