Skip to content

SageMaker Studio Classic Crashing after a certain amount of time

0

Summary

Over the past couple months my Sagemaker Studio Classic Kernel has become unavailable after 1-2 weeks. I tried bumping the instance size assuming that it was just getting bogged down on memory or CPU, but this doesn't seem to have any effect.

When I click to open the Studio Classic instance, the page just infinitely loads and then sometimes gives a connection error. This happens sporadically after 1-2 weeks of normal use. Usually there is some slowdown before the app becomes completely unresponsive. During these slowdowns, I have used htop to look at the CPU / memory usage but have not seen anything that looks out of the ordinary.

asked 2 years ago406 views
1 Answer
0

By far the most common cause of UI slowdowns I saw with SageMaker Studio Classic was when working inside local git repository folders with a very large number of active changes: An unfortunate interaction between the open-source JupyterLab-Git extension (which frequently polls the git changed file list) and EFS' latency characteristics (per-file network traffic, rather than a block store like EBS) - meant that if 1) you had thousands+ of changed/added files that weren't gitignored, and 2) you navigated to any folder within the scope of that git repository - the UI could get super slow. For example I saw one GitHub sample that created 30,000+ tiny JPEG files from MNIST data and didn't .gitignore them. Ouch!

In this particular case, my usual solution was to navigate to any folder outside the affected git repo (to un-block the UI), then use the terminal to add a .gitignore for the relevant files in the repository.

However, that problem's usually pretty instantaneous unless you have something gradually creating lots of extra files. My only other suggestion would be if maybe you're using some custom JupyterLab extension?

I did note you said the kernel was becoming unresponsive, but "the page just infinitely loads and then sometimes gives a connection error" made it sound more like a JupyterServer issue? If it really is the Kernel Gateway app that's bogging down, then 1) it shouldn't be affecting your JupyterLab UI much - just the kernel itself being very slow to respond to requests or not connecting. and 2) Do check that you're running your htop/etc checks on an image terminal and not a system terminal?

AWS
EXPERT
answered 2 years ago
  • Hey Alex, thanks for your reply. I am running htop from an image terminal as root.

    It's very possible this is a JupyterServer issue. The kernel becoming unresponsive is usually just the first symptom. When full-blown the issue occurs, the launcher UI loads very slowly (on the order of 10 minutes). When it finally loads, it is so unresponsive that I am unable to perform any actions in it, be it to launch a new notebook or a image terminal.

    We aren't using any custom JupyterLab extension. We do have a folder mounted onto the server through and EFS volume that has a large number of incoming changes. It is not a git repo however. We also have some cronjobs that run in the background on this server about once an hour. They don't use a lot of CPU and tend to run quite smoothly.

    To be clear, the only solution I've found to this is to restart both the Kernel and the Server at the same time so I think both are becoming overwhelmed / unavailable

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.