Cloudwatch: Why is there a delay between the data breach and the alarm being triggered?

0

Hello,

I have a cloud watch alarm that triggers when the InvocationsPerInstance metric for our Sagemaker endpoint breaches a threshold in one minute period. The general setup:

MetricName='InvocationsPerInstance',
Namespace='AWS/SageMaker',
Statistic='Sum',
Period=60,
EvaluationPeriods=1,
DatapointsToAlarm=1

From this, I would expect that after we see a one-minute period above the threshold the alarm would be triggered. However, there seems to be a 3 minute delay between when the data breaches the threshold and when the alarm is triggered. I want the alarm to trigger as soon as there is data that breaches if possible.

Here is the newState of the alarm. You can see that the trigger is at 15:24:00 but the alarm does not get triggered until 15:27:26

"newState": { "stateValue": "ALARM", "stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [29752.0 (17/01/24 15:24:00)] was greater than the threshold (28800.0) (minimum 1 datapoint for OK -> ALARM transition).", "stateReasonData": { "version": "1.0", "queryDate": "2024-01-17T15:27:26.429+0000", "startDate": "2024-01-17T15:24:00.000+0000", "statistic": "Sum", "period": 60, "recentDatapoints": [ 29752 ], "threshold": 28800, "evaluatedDatapoints": [ { "timestamp": "2024-01-17T15:24:00.000+0000", "sampleCount": 29752, "value": 29752 } ] } } }

1 Answer
1

I'd recommend to check the "Missing data treatment".

As per documentation "CloudWatch enables you to specify how to treat missing data points when evaluating an alarm. This helps you to configure your alarm so that it goes to ALARM state only when appropriate for the type of data being monitored. You can avoid false positives when missing data doesn't indicate a problem."

For each alarm, you can specify CloudWatch to treat missing data points as any of the following:

notBreaching – Missing data points are treated as "good" and within the threshold

breaching – Missing data points are treated as "bad" and breaching the threshold

ignore – The current alarm state is maintained

missing – If all data points in the alarm evaluation range are missing, the alarm transitions to INSUFFICIENT_DATA.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

Hope that helps to sort out the Alarm after a one-minute period above the threshold the alarm would be in ALARM state.

Thanks

AWS
Takeda
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions