- Newest
- Most votes
- Most comments
You should probably be using SageMaker training jobs for this, rather than trying to scale up your notebook instance.
SageMaker is more than a managed Jupyter service. By running your model training through the training job APIs (for e.g. as discussed here, using the high-level SageMaker Python SDK, you get benefits of:
- Automatic tracking of runs (e.g. input parameters and code, output artifacts, logs, resource usage metrics, custom algorithm metrics, container image, etc.)
- Reproducible containerized environments (pre-built containers with requirements.txt support, in case you don't want to build customized containers yourself)
- Right-sizing your infrastructure usage to optimize cost - keep your notebook instance small, request bigger instance(s) for your training job, and only pay for the time the training job is actually running.
- Integration with SageMaker options for model deployment / batch inference, etc.
- Training runs separate from the notebook, so you can e.g. restart your notebook kernel, kill the notebook instance, struggle with connectivity, etc... during training with no impact.
So I would suggest to set up your training job referring to the Using XYZ with the SageMaker Python SDK sections of the developer guide, and the Amazon SageMaker Examples. This likely won't immediately fix your scaling challenge, but it should put you in a better position for scaling further (e.g. distributed training) and tracking your work. For most of my work, I just use e.g. t3.medium
notebooks and interact with the SageMaker APIs to run jobs with on-demand infrastructure.
With that being said, your instance already sounds very large (160GB RAM, 16TB disk). The most common causes I've seen of kernel dying are failure to allocate memory - so if you're using in-memory libraries like Scikit-Learn/etc, perhaps it could be that one of them is not able to handle a single massive data structure, even though there is physical memory available? E.g. due to something assuming 32 bit indexing, or some other aspect of the script/libraries being used. It's interesting that you manage to get 20% of the way through training, since often ML training is usually pretty homogeneous (e.g. for gradient descent, often if you can complete one epoch, you can run 'em all). Perhaps you have a memory leak somewhere? Giving more details about what framework & model type you're using might help guide suggestions, but ultimately I think it might require debugging your code to see where exactly things are going wrong.
Relevant content
- asked 4 months ago
- AWS OFFICIALUpdated 3 days ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 2 years ago