12hr login session expiry stopped my model training halfway in SageMaker.


I was developing a machine learning model in SageMaker and it takes about 2-3 days to finish executing my codes. However, I was automatically logged out of the console which killed the kernel that was running.

I found the FAQ https://aws.amazon.com/console/faq-console/ which stated that my log in session will expire after 12 hours and to continue, I have to 'Click login to continue'. I did that but my kernel was killed and I have to run my codes from the start, it doesn't pick up where it left off.

This is a problem as I mentioned earlier that my codes take 3 days to finish running but SageMaker kernel gets killed every 12 hours.

I've tried this guide: https://docs.aws.amazon.com/singlesignon/latest/userguide/howtosessionduration.html which tells me I can set the session duration by creating SSO and set permissions but the duration is limited to 12 hours as well.

Is there a way to remove or extend the automated sign off?

1 Respuesta

@mollythejams 2 approaches:

  1. use Amazon SageMaker API to train your model. Here's the documentation, and here you can find some examples. You can choose the type and the number of instances (assuming you can train your model in parallel), and you'll only pay for the time the compute resources are needed. You can probably reduce the time needed to train your model. Once a training job has started, you can monitor it from the SageMaker console.
  2. even if you're log out because of a timeout, the kernel should still be active (unless you enabled any lifecycle policy to automatically shutdown the kernel to save costs). This is not guaranteed, but generally it should be the case. The problem is that the Jupyter UI will not automatically sync to the state of the kernel, giving the impression that the kernel also stopped. You can try to add the instruction to add new entries in a log file in your training loop, so that you can confirm the training is ongoing and its progress by looking at the log file.

I strongly recommend to adopt the first approach, and decouple the instance used to write the code (running the jupyter kernel) and the compute resources for model training.

Ale C
respondido hace 2 años

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas