Cloudwatch: Why is there a delay between the data breach and the alarm being triggered?

0

Hello,

I have a cloud watch alarm that triggers when the InvocationsPerInstance metric for our Sagemaker endpoint breaches a threshold in one minute period. The general setup:

MetricName='InvocationsPerInstance',
Namespace='AWS/SageMaker',
Statistic='Sum',
Period=60,
EvaluationPeriods=1,
DatapointsToAlarm=1

From this, I would expect that after we see a one-minute period above the threshold the alarm would be triggered. However, there seems to be a 3 minute delay between when the data breaches the threshold and when the alarm is triggered. I want the alarm to trigger as soon as there is data that breaches if possible.

Here is the newState of the alarm. You can see that the trigger is at 15:24:00 but the alarm does not get triggered until 15:27:26

"newState": { "stateValue": "ALARM", "stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [29752.0 (17/01/24 15:24:00)] was greater than the threshold (28800.0) (minimum 1 datapoint for OK -> ALARM transition).", "stateReasonData": { "version": "1.0", "queryDate": "2024-01-17T15:27:26.429+0000", "startDate": "2024-01-17T15:24:00.000+0000", "statistic": "Sum", "period": 60, "recentDatapoints": [ 29752 ], "threshold": 28800, "evaluatedDatapoints": [ { "timestamp": "2024-01-17T15:24:00.000+0000", "sampleCount": 29752, "value": 29752 } ] } } }

DavidJ
質問済み 4ヶ月前368ビュー
1回答
1

I'd recommend to check the "Missing data treatment".

As per documentation "CloudWatch enables you to specify how to treat missing data points when evaluating an alarm. This helps you to configure your alarm so that it goes to ALARM state only when appropriate for the type of data being monitored. You can avoid false positives when missing data doesn't indicate a problem."

For each alarm, you can specify CloudWatch to treat missing data points as any of the following:

notBreaching – Missing data points are treated as "good" and within the threshold

breaching – Missing data points are treated as "bad" and breaching the threshold

ignore – The current alarm state is maintained

missing – If all data points in the alarm evaluation range are missing, the alarm transitions to INSUFFICIENT_DATA.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

Hope that helps to sort out the Alarm after a one-minute period above the threshold the alarm would be in ALARM state.

Thanks

AWS
Takeda
回答済み 3ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ