Sagemaker studio performance issues - why so slow, or am I missing a trick?


Hi all, I am fairly new to sagemaker studio, but I am concerned with the speed of the model pipeline. When I run preprocessing (Tfidf), training and evaluating models locally they take under a minute, however when using sagemaker studio it has taken at least 13 minutes.

I have been running an XgBoost model based on the Abalone pipeline example and running this with my data which is around 5500 rows of text data (the text data ~1.5k characters on average per example).

The below example shows multiple pipeline runs with a HyperparameterTuner and even with the ml.c5.18xlarge and the below hyperparameter ranges, I was only able to complete the pipeline in 14 min run time (this included pre-processing step, hp-tuning, model register and model evaluation). FYI even without the tuning it still took around 14 min.

I was wondering if I am missing a trick here or is it just Sagemaker takes awhile to start? Any help would be much appreciated!


   hyperparameter_ranges = {
                            "lambda": ContinuousParameter(0.01, 10, scaling_type="Logarithmic"),

Enter image description here

1 Answer

If I understand correctly:

  • You're using a multi-step SageMaker Pipeline based on the Abalone example, something like shown on the screenshot here in the docs... With a small dataset of 5,500 examples (but I guess ~8MB since lots of characters per example)
  • Your comparison point is (pre-processing the TF-IDF), training and evaluating the models on local machine


SageMaker jobs (e.g. processing, training, transform) generally run on on-demand compute infrastructure: Meaning you can select the number and size of instances for each step, and only pay for the resources you use without having to manage clusters.

On the other hand, this means when a SageMaker job starts it needs to provision your compute and download your container image + scripts. In my experience, job start-up can take ~2-4 minutes or even longer, and depends on several factors (including instance type, container image size, data size and mode, etc). Reducing that time continues to be a priority for our engineering teams and a lot of progress has already been made since SageMaker was first launched.

So if your workload takes ~a minute on a modestly sized local machine, the majority of the pipeline run time you're seeing is likely dominated by infrastructure provisioning: Especially if you have a set of serial steps e.g. a pre-processing job, a training job, a transform job, etc. Switching to a larger instance type will typically be little help if your actual training/processing workload is not the bottleneck: But some instance types might be faster to provision than others.

There are many benefits to running your ML steps through SageMaker jobs rather than locally, and chaining those jobs through automated pipelines. For example:

  • Automated tracking of parameters, artifacts, and lineage
  • Confidence that each job is running in a clean, reproducible, containerised environment
  • Integration of logging and metrics
  • Other useful features like SageMaker Distributed for large workloads, SageMaker Debugger, and so on
  • Multi-step pipeline automation and the ability to e.g. run your pipelines on a regular schedule, or version control the pipeline definition.

In my experience this extra governance and tooling is useful for production-ready workflows, even if infrastructure introduces some extra end-to-end latency... But of course every situation is different.

If the trade-off is particularly painful in your particular phase of your current project, and interactivity+speed is more important, some tips I could suggest:

  • It is still possible to prepare data, train and evaluate models directly in a SageMaker Studio notebook just like any other notebook environment... And you can switch your notebook's compute using the toolbar menu... Just be aware that the you won't be benefiting from the SageMaker features like job lineage/experiment tracking.
  • If you wanted to still use SageMaker jobs but optimize your pipeline, you could probably squeeze everything into an XGBoostProcessor job so that your data pre-processing, model training, and evaluation happens all in one script... You could even register the model from that same script via boto3/SageMaker Python SDK. But again you're losing the tracking benefits of SageMaker jobs.
  • If you're particularly interested in running multiple training jobs and re-using the infrastructure, check out the recently launched warm pooling feature which lets you keep a training cluster alive and re-use it for faster start-up in a follow-on job.
answered 18 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions