Cloudwatch: Why is there a delay between the data breach and the alarm being triggered?

0

Hello,

I have a cloud watch alarm that triggers when the InvocationsPerInstance metric for our Sagemaker endpoint breaches a threshold in one minute period. The general setup:

MetricName='InvocationsPerInstance',
Namespace='AWS/SageMaker',
Statistic='Sum',
Period=60,
EvaluationPeriods=1,
DatapointsToAlarm=1

From this, I would expect that after we see a one-minute period above the threshold the alarm would be triggered. However, there seems to be a 3 minute delay between when the data breaches the threshold and when the alarm is triggered. I want the alarm to trigger as soon as there is data that breaches if possible.

Here is the newState of the alarm. You can see that the trigger is at 15:24:00 but the alarm does not get triggered until 15:27:26

"newState": { "stateValue": "ALARM", "stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [29752.0 (17/01/24 15:24:00)] was greater than the threshold (28800.0) (minimum 1 datapoint for OK -> ALARM transition).", "stateReasonData": { "version": "1.0", "queryDate": "2024-01-17T15:27:26.429+0000", "startDate": "2024-01-17T15:24:00.000+0000", "statistic": "Sum", "period": 60, "recentDatapoints": [ 29752 ], "threshold": 28800, "evaluatedDatapoints": [ { "timestamp": "2024-01-17T15:24:00.000+0000", "sampleCount": 29752, "value": 29752 } ] } } }

1개 답변
1

I'd recommend to check the "Missing data treatment".

As per documentation "CloudWatch enables you to specify how to treat missing data points when evaluating an alarm. This helps you to configure your alarm so that it goes to ALARM state only when appropriate for the type of data being monitored. You can avoid false positives when missing data doesn't indicate a problem."

For each alarm, you can specify CloudWatch to treat missing data points as any of the following:

notBreaching – Missing data points are treated as "good" and within the threshold

breaching – Missing data points are treated as "bad" and breaching the threshold

ignore – The current alarm state is maintained

missing – If all data points in the alarm evaluation range are missing, the alarm transitions to INSUFFICIENT_DATA.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

Hope that helps to sort out the Alarm after a one-minute period above the threshold the alarm would be in ALARM state.

Thanks

AWS
Takeda
답변함 3달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠