1 Resposta
- Mais recentes
- Mais votos
- Mais comentários
0
@mollythejams 2 approaches:
- use Amazon SageMaker API to train your model. Here's the documentation, and here you can find some examples. You can choose the type and the number of instances (assuming you can train your model in parallel), and you'll only pay for the time the compute resources are needed. You can probably reduce the time needed to train your model. Once a training job has started, you can monitor it from the SageMaker console.
- even if you're log out because of a timeout, the kernel should still be active (unless you enabled any lifecycle policy to automatically shutdown the kernel to save costs). This is not guaranteed, but generally it should be the case. The problem is that the Jupyter UI will not automatically sync to the state of the kernel, giving the impression that the kernel also stopped. You can try to add the instruction to add new entries in a log file in your training loop, so that you can confirm the training is ongoing and its progress by looking at the log file.
I strongly recommend to adopt the first approach, and decouple the instance used to write the code (running the jupyter kernel) and the compute resources for model training.
respondido há 2 anos
Conteúdo relevante
- AWS OFICIALAtualizada há 2 anos
- AWS OFICIALAtualizada há 2 anos
- AWS OFICIALAtualizada há 2 anos
- AWS OFICIALAtualizada há 9 meses