Monitoring SageMaker Notebook Instance with CloudWatch Custom Metrics

2 minute read
Content level: Intermediate
0

AWS customers using SageMaker notebooks to code, debug, and sometimes even train smaller Machine Learning models often run into resource utilization issues. Large model code or blobs of data, can consume excessive CPU/Mem on the instance. This can lead to resource congestion and errors. Customers need a way to monitor critical OS-level metrics like CPU, memory utilization on SageMaker notebooks. This would allow them to right size instances and avoid overutilization, latency or downtime issues.

Publishing Custom Metrics to CloudWatch

While Amazon SageMaker provides powerful notebooks for ML development, there is no straightforward documentation on monitoring notebook instance resources in real-time. In this post, we'll go through the steps to publish custom metrics from the operating system of SageMaker notebooks to Amazon CloudWatch.

The key steps are:

  1. SSH into the SageMaker notebook instance

  2. Install latest updates

   sudo yum update
   sudo yum install epel-release
  1. Configure the CloudWatch agent using the wizard

    /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
  2. Attach the CloudWatchAgentServerPolicy to the notebook IAM role

  3. Restart the agent to pick up new metrics

    sudo systemctl restart amazon-cloudwatch-agent

By following these steps, customers can gain visibility into instance utilization and avoid resource bottlenecks. They can right-size their SageMaker notebooks to maximize efficiency and minimize cost.

Conclusion

With a few simple configuration changes, AWS customers can now effectively monitor SageMaker notebook instances through CloudWatch custom metrics. This enables data-driven decisions for optimized resource allocation and performance. Sagemaker notebooks can be appropriately sized, and thus minimizing overutilized notebooks which would cause errors or wasted spend. Instance sizing will now be backed by real time usage data.