Stuck at training cell in Notebook Instance in SageMaker

0

Hello, I have started running a command to train a model using Ultralytics YOLOv8.2.4. Most of the prerequisites should have already been installed. However whenever i run the cell, it will get stuck at the following:

Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  0%|          | 0/143 [00:00<?, ?it/s]

after which it will get stuck there for hours with nothing changing.

There was once i stopped the notebook and tried to restart, and i got the following error message:

IOStream.flush timed out

Does anyone knows what is the issue? My current Jupyter Notebook instance is currently running at ml.t3.medium

1 Answer
0

Is the JupyterLab backend as a whole becoming unresponsive (e.g. not able to finish saving files, move between folders, or other actions that require Jupyter to respond) when this happens?

Recently, I've seen some cases where running out of RAM caused this behaviour in SageMaker Studio JupyterLab... Which is a shame because in Studio Classic it was more graceful, because the JupyterServer ran separately from the kernel.

I'd suggest stopping your notebook, changing it to a larger instance type e.g. ml.t3.large or ml.t3.xlarge (sorry I haven't benchmarked YOLOv8 myself so not sure how much you'd need), and trying again?


...But in general I would note that training neural networks on t* instances is going to be slow, and probably not the best price-performance trade-off. Running training in the notebook itself is also convenient, but you should consider using remote SageMaker Training Jobs instead because:

  • The history of your jobs is automatically recorded, which is useful for tracking your experiments' inputs and outputs
  • Jobs run on (separate) on-demand instances that are released as soon as the job ends: So you can keep your notebook on t3 and pay for only the number of seconds you need to train on a GPU-accelerated instance like g5, p3, etc.
  • The containerized job environment is re-initialized each time, so should be more reliably repeatable
  • They can then be orchestrated and automated in SageMaker Pipelines and etc to drive MLOps workflows

So if you just want to play with training a model ASAP, I'd try bumping up your notebook instance type and trying again... But if you're able to spend a bit longer learning to get the most out of SageMaker, you might want to:

  • Factor your YOLOv8 training into a train.py script and requirements.txt file (If you're using the YOLOv8 CLI, you could just use subprocess to launch it from Python)
  • Run it as a training job on SageMaker with script mode for PyTorch, through the sagemaker SDK that's already installed in SageMaker notebook environments but also available via PyPI.
  • Consider requesting quota for and using SageMaker Warm Pools to accelerate your experiments (Warm Pool jobs usually take only a few seconds to start instead of a few minutes, but the trade-off is you're billed for the warm pool instance while it's kept alive)
  • Alternatively consider (first enabling and installing, if you're in SMStudio) using SageMaker Local Mode to further accelerate your initial functional debugging by spinning up the training container locally rather than launching an API-side training job at all.
AWS
EXPERT
Alex_T
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions