AWS SageMaker jobs finish successfully, even though inside the VM an error stops the execution.

0

Issue 1

In AWS SageMaker studio, I ran a pipeline and checked on its status in Pipelines -> cltv-long-term-predict. It seemingly finished successfully: Enter image description here

However, when I click on it, click on the step -> Logs, I see that it failed (library was not installed, I should've commented the import). Enter image description here

What bothers me is that this failure is not reported in the first pic, so it's more difficult to detect such occurrences and makes debugging longer/clumsier. When I look at the same process 10 days later in SageMaker studio, I can no longer see the Logs. Are they deleted after some time? Enter image description here

I went to SageMaker -> on the left panel -> Processing -> Processing Jobs and again, I see the process as completed: Enter image description here

It was difficult to even find my specific execution, as the name is quite different than the one in Studio Pipelines, but I think it is the same. As you can see, it is marked as finished successfully. If I open it: Enter image description here Enter image description here Enter image description here

However, when I click on View logs, you can see it failed: Enter image description here

Issue 2

Another (even more severe) bug occured when I ran an incorrect query with awswrangler: an error is never reported, the VM (sagemaker.processing.FrameworkProcessor) just silently shuts down and reports the process as completed.

In SageMaker -> Studio -> Pipelines, I see it as completed successfully: Enter image description here

When I click on the logs, I see no errors this time (even though the files that should be produced by this process are not on S3): Enter image description here

If I follow the link to cloudwatch, I also see no errors. I see the query that the script prints before using awswrangler.redshift.unload. You can see that it contains characters ' which should be replaced/escaped by \', hence it fails. After that print, just silence. No error reported: Enter image description here Enter image description here

However, if I copy-paste that query to jupyter, connect from a local laptop to the redshift DB and execute the query, I get the error: Enter image description here Enter image description here

Edit: In the second to last picture, I can see that an error is actually reported in CloudWatch, but is not the last entry in the log. Why are entries chronologically shuffled? Or asked differently, why do so many entries in the log have the exact same time (clearly they couldn't have happened simultaneously)?

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions