- Newest
- Most votes
- Most comments
Is the JupyterLab backend as a whole becoming unresponsive (e.g. not able to finish saving files, move between folders, or other actions that require Jupyter to respond) when this happens?
Recently, I've seen some cases where running out of RAM caused this behaviour in SageMaker Studio JupyterLab... Which is a shame because in Studio Classic it was more graceful, because the JupyterServer ran separately from the kernel.
I'd suggest stopping your notebook, changing it to a larger instance type e.g. ml.t3.large
or ml.t3.xlarge
(sorry I haven't benchmarked YOLOv8 myself so not sure how much you'd need), and trying again?
...But in general I would note that training neural networks on t*
instances is going to be slow, and probably not the best price-performance trade-off. Running training in the notebook itself is also convenient, but you should consider using remote SageMaker Training Jobs instead because:
- The history of your jobs is automatically recorded, which is useful for tracking your experiments' inputs and outputs
- Jobs run on (separate) on-demand instances that are released as soon as the job ends: So you can keep your notebook on
t3
and pay for only the number of seconds you need to train on a GPU-accelerated instance likeg5
,p3
, etc. - The containerized job environment is re-initialized each time, so should be more reliably repeatable
- They can then be orchestrated and automated in SageMaker Pipelines and etc to drive MLOps workflows
So if you just want to play with training a model ASAP, I'd try bumping up your notebook instance type and trying again... But if you're able to spend a bit longer learning to get the most out of SageMaker, you might want to:
- Factor your YOLOv8 training into a
train.py
script andrequirements.txt
file (If you're using the YOLOv8 CLI, you could just usesubprocess
to launch it from Python) - Run it as a training job on SageMaker with script mode for PyTorch, through the
sagemaker
SDK that's already installed in SageMaker notebook environments but also available via PyPI. - Consider requesting quota for and using SageMaker Warm Pools to accelerate your experiments (Warm Pool jobs usually take only a few seconds to start instead of a few minutes, but the trade-off is you're billed for the warm pool instance while it's kept alive)
- Alternatively consider (first enabling and installing, if you're in SMStudio) using SageMaker Local Mode to further accelerate your initial functional debugging by spinning up the training container locally rather than launching an API-side training job at all.
Relevant content
- asked a year ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 3 days ago
- AWS OFFICIALUpdated a month ago