HPO training using Sagemaker SDK vs Sagemaker Jupyter notebook



When training an HPO using the Sagemaker SDK It's much slower than training on Sagemaker jupyter notebook - Both variants have the same:

  1. Hyperparameters (the same Model)
  2. Data - Train / Test (exactly same data)
  3. Resources - 'ml.m5.24xlarge' machine

The SDK is slower at training and at inference (X3 more slow).

Could it be due to different tensorflow versions? Is it a common issue? what things should I check?

Thanks, Or

2 Answers

It's not quite clear exactly what comparison you're running here, but I can probably suggest some tips...

The basics - Infrastructure start-up

Assuming you're talking about end-to-end run times for workflows with your own algorithm/training script, the main difference between running a SageMaker job "with the Python SDK" versus in the notebook, is that a Sagemaker training (Estimator), processing (Processor), or inference (Transformer) job will spin up infrastructure to run your job on-demand.

This means if I take exactly the same code and run it either 1) in my (already-running) notebook compute or 2) through a SM training/etc job on the same instance type - then (2) will probably run at similar speed but need an extra couple of minutes or so to spin up the instance and initialize the container image.

The valuable trade-off here is that by using a ml.m5.24xlarge notebook instance I would be paying for the compute for all the time it's running (perhaps often at low utilization)... Whereas in a typical SageMaker workflow I might use just a ml.t3.medium for my notebook itself and only pay for the seconds of ml.m5.24xlarge that my training job(s) take to run: As soon as the training/processing/inference job gets completed, the instance is shut down automatically. Furthermore, while a notebook is typically limited to scaling up to bigger instances, SageMaker jobs can be easily scaled out to multiple instances so long as your algorithm/script supports it.

If you'd like to orchestrate multiple steps together (e.g. pre-processing, training, tuning, inference), you also don't necessarily need to manage this from a notebook: You could set up a DAG with SageMaker Pipelines and just trigger that. If you're running lots of sequential jobs and would like to accelerate infrastructure start-up, check out SageMaker Warm Pools.

Other ideas

While the infra start-up period seems the most likely factor from your original question, maybe you already knew to adjust for that... There are a few other things worth mentioning in case:

  • If initial data loading is a bottleneck, check out the available training data access modes in SageMaker. For example using FastFile mode may speed up training jobs in some cases, depending on the access patterns. The default File mode downloads all the data up-front before starting your training job, so you'd see this as added start-up time before your script runs.
  • You mentioned "HPO", so I'm not sure if you're comparing a notebook run to a single training job with your own HPO procedure/script, or a SageMaker Automatic Model Tuning job. It's worth mentioning that SageMaker AMT spins up multiple training jobs (with configurable parallelism) so direct comparison to an all-in-notebook HPO would be difficult.
  • If you're comparing your own code's performance to a SageMaker pre-built algorithm, then of course this is apples-to-oranges: For example using open-source XGBoost versus the SageMaker XGBoost in algorithm mode, or a public HuggingFace training script to an algorithm from SageMaker JumpStart.
  • For training, you could try disabling SageMaker Debugger to turn off some hooks which get enabled by default (although I wouldn't usually expect the difference to be this large).
  • As you mention, it's probably worth checking whether your TensorFlow and other key library versions are different between the notebook and the jobs.
  • For inference in particular, a lot depends on how you're running your model... Batch transform? Processing job? Hopefully not deploying an endpoint then manually sending data through it and turning it off? Is the inference well-optimized?

One last thing: I was a little surprised to read you using TensorFlow but not training on a GPU-accelerated instance type like p3, g4dn, inf1/2, etc? If the workload is parallel compute-bound like a lot of neural networks, you might be able to get higher throughput-per-dollar by using accelerators.

answered a year ago

Alex_T's comment already covers the most likely factors for the observed performance difference. In addition to Alex_T's comment, you could also make use of the SageMaker Debugger feature with your Python SageMaker SDK to determine if there is any issues with your job.

If you require further assistance, I would recommend you to open a support case with the specific configuration and reproducers of your two approaches so one of our SageMaker Support Engineers can inspect the differences in closer detail.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions