I want to troubleshoot missing Amazon CloudWatch metrics for my Amazon SageMaker AI endpoint.
Resolution
Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.
Check your CloudWatch metric namespace and dimensions
Make sure that you're checking the correct CloudWatch namespace:
- The AWS/SageMaker namespace includes model loading metrics from API calls to InvokeEndpoint.
- The aws/sagemaker/Endpoints namespace includes instance metrics from API calls to InvokeEndpoint.
- The aws/sagemaker/InferenceComponents namespace includes metrics from API calls to InvokeEndpoint for endpoints that host inference components.
For more information, see Metrics for monitoring Amazon SageMaker AI with Amazon CloudWatch.
Also, the dimensions for SageMaker AI endpoint metrics are EndpointName and VariantName.
Check your IAM permissions
To publish metrics to CloudWatch and manage log groups, the AWS Identity and Access Management (IAM) role that's associated with your endpoint must have the required IAM permissions.
Example permissions:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"cloudwatch:ListMetrics",
"cloudwatch:GetMetricData",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}]
}
Check your CloudWatch metrics 20 minutes after you create or update your endpoint
After you create or update an endpoint, CloudWatch metrics might take up to 20 minutes before they are available. Wait 20 at least minutes before you check your metrics.
Check your SageMaker AI CloudWatch Logs
Check your SageMaker AI logs to identify issues that might cause your metrics not to publish to CloudWatch. To access your SageMaker logs, use the CloudWatch console. Or, run the following get-log-events command:
aws logs get-log-events --log-group-name example-sagemaker-log-group-name
Note: Replace example-sagemaker-log-group-name with the name of your SageMaker AI log group.
Check your metrics retention period
CloudWatch retains metric data for 15 months.
To view historical CloudWatch metrics for your SageMaker AI endpoint, complete the following steps:
- Open the CloudWatch console.
- In the navigation pane, choose Metrics, and then choose All metrics.
- Choose the metric that you want to view.
- To view a graph that displays historical data for your metric in a specified time period, set a time range.
For more information, see Logging with CloudWatch.
Check your endpoint invocation activity
CloudWatch generates metrics when your endpoints have consistent traffic. Check whether there's traffic or invocation activity on your SageMaker AI endpoint.
To check your endpoint invocation history, complete the following steps:
- Open the SageMaker AI console.
- In the navigation pane, choose Inference, and then choose Endpoints.
- Select your endpoint.
- Choose the Monitor tab, and then choose View invocation history.
Or, run the following sagemaker-runtime command to retrieve your endpoint invocation history:
aws sagemaker-runtime get-invocation-history --endpoint-name example-endpoint-name [--max-results example-number] [--starting-time example-timestamp]
Note: Replace example-endpoint-name with your endpoint name, example-number with the maximum number of results that you want to view, and example-timestamp with the start time.