Why my sagemaker training job slower than notebook from studiolab.sagemaker.aws?

Question

I run neural network tensorflow train on studiolab. and I got:

```
Epoch 145/4000
1941/1941 - 10s - ... - 10s/epoch - 5ms/step
```

then I try to make a train job with script_mode with `ml.c5.xlarge`
```
estimator = TensorFlow(entry_point='untitled.py',
                       source_dir='./training/',
                       instance_type='ml.c5.xlarge',
                       instance_count=1,
                       output_path="s3://sagemaker-[skip]",
                       role=sagemaker.get_execution_role(),
                       framework_version='2.8.0',
                       py_version='py39',
                       hyperparameters={...},
                       metric_definitions=[...],
                       script_mode=True)
```

and its got:
```
Epoch 19/4000
1941/1941 - 49s - ... - 49s/epoch - 25ms/step
```

Why is it 5 times slower than studiolab notebook? Is it because instance type?

Answer

May I know which instance type you are using for training [locally](https://aws.amazon.com/blogs/machine-learning/use-the-amazon-sagemaker-local-mode-to-train-on-your-notebook-instance/) on your notebook instance. Including factors that influence training performance, hardware spec of the training node is very critical. You might be either getting bottlenecked on CPU, Storage or Memory.  See [here](https://aws.amazon.com/blogs/machine-learning/identifying-training-bottlenecks-and-system-resource-under-utilization-with-amazon-sagemaker-debugger/) for more details

Answer

I get the same issue when using the Sagemaker SDK (Tensorflow estimator) vs training using the Sagemaker with jupyter notebook.
Sagemaker SDK (Tensorflow estimator) is much slower (3X slower) with exactly the same: compute power, model and data.

Why my sagemaker training job slower than notebook from studiolab.sagemaker.aws?

相關內容