- Newest
- Most votes
- Most comments
There's not quite enough info here to know for sure what happened, but here's places you can look to find out
- Since the CloudWatch alarm was in Alarm state, it should have been notifying AutoScaling every minute. You'll see the result of the first notification in the Alarm History. Make sure it shows the action was triggered
- Check the AutoScaling Activity History. If nothing comes up, try adding the --include-not-scaled-activities flag, but I'm guessing that won't be needed here, since most likely:
- I'm guessing AutoScaling tried to scale, but Sagemaker couldn't fulfill the request for some reason (vCPU limits maybe?). Check on the Sagemaker side to see if its Desired Capacity was changed, and to see if it gave any errors. You could also check in CloudTrail to see if there were any API calls from AutoScaling to Sagemaker around that time trying to scale
Thanks @Shahad_C for your helpful comment, your guess is right.
- Base on your suggestion I've checked the cloud watch logs and saw that AutoScaling tried to scale but got failed.
- The error was "pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found"
- I used python38, torch==1.12.1, onnxruntime-gpu==1.14.1 & ml.g4dn.xlarge GPU instance
For further information, with initialized instances I can invoke the deployed model successfully (of course using GPU), but when trying to scale I got the error stated above. do you guess problem?
Glad you found the error :D I'm not as familiar with the workings of sagemaker itself. I do see another user who had the same issue (although outside of Sagemaker), and it looks like it was a dependency issue. So my best guess would be a vision mismatch somewhere
I discussed this with a coworker who works with Sagemaker more. Are you using an AWS provided image for the model? If so, can you provide the URI of the container used for deploying the endpoint?
Hi @Shahad_C
- I've used a provided image from AWS, the deployment code as follows:
env={'SAGEMAKER_REQUIREMENTS': 'requirements.txt'} model = PyTorchModel( entry_point="inference.py", source_dir="code", role=role, env=env, model_data=jets_model_data, framework_version="1.12.1", py_version="py38", ) sagemaker_client = boto3.client('sagemaker')
And for more information, I tried to run an exported onnx model by using onnxruntime-gpu library.
- What does "URI of the container" mean? Could you describe more about it?
Thanks @Shahad_C for your helpful comment, your guess is right.
- Base on your suggestion I've checked the cloud watch logs and saw that AutoScaling tried to scale but got failed.
- The error was "pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found"
- I used python38, torch==1.12.1, onnxruntime-gpu==1.14.1 & ml.g4dn.xlarge GPU instance
For further information, with initialized instances I can invoke the deployed model successfully (of course using GPU), but when trying to scale I got the error stated above. Do you guess problem?
Relevant content
- asked a year ago
- asked 3 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 11 days ago
- AWS OFFICIALUpdated 3 months ago
1 quick comment: You crossed out the endpoint name, which IMO isn't very sensitive, but you left your accountID in the output above it, you might want to redact that
Thank Shahad_C, I fixed it!