Hi!
As of a few days ago, the "Uploading" phase of my SageMaker training jobs jumped from 2 minutes to 3+ hours. The size of my artifacts did not change, but I did enable check-pointing (although this shouldn't affect the zipping and S3 upload of a different directory taking place).
Is there any way to see what SageMaker is doing during that time? I have set sagemaker_container_log_level = 10
(debug), but no additional logs are published. (and I assume that anything after Training
will not be logged.
Hmm... have you also checked the training job's metrics in CloudWatch? If so, maybe you could add any interesting findings about that to your question https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html