Alarm not getting triggered even if the metric crosses threshold

0

Hi,

I have configured a cloudwatch alarm based on an expression which involves 2 metrics.

Expression = m2 - m1, where,

m2: AWS/AutoScaling - GroupInServiceInstances

m1: ECS/ContainerInsights - TaskCount

Alarm Config : Expression <= 5 for 1 datapoints within 1 minute

Missing data case is considered as threshold breach.

The problem is that the alarm is not getting triggered as per the expectation. After the expression breaches the threshold(5), as per the alarm config mentioned above, it should trigger after a minute(1 datapoints within 1 minute) but the alarm gets triggered after a certain(inconsistent) amount of time which is causing the actions(autoscaling) associated to the alarm to be delayed. The delay ranges from 2 - 15 minutes.

Please refer to this link for a screenshot. The blue line in the first graph denotes the expression value and red line the threshold. As can be seen, the expression crosses the threshold at 7:23 but the alarm gets triggered at 7:40. The In Alarm state(red bar) in the second chart is triggered after 17 minutes then it should have.

Any help is really appreciated.

已提問 2 年前檢視次數 935 次
2 個答案
1

From the screenshot actually it looks like alarm is breached when your expression is greater than threshold, not lower.

已回答 2 年前
1

Hi Dhruv,

Given the Alarm configuration and the screenshot that you have provided, they are not quite aligning with each other.

What I can suggest is to look at the Alarm History, and check that particular StateUpdate happened at 7:40 as you mentioned to understand the reason of the triggering which could enlighten us why the alarm triggered. From the History of StateUpdate look for section starting with below for example:

...
"newState": {
      "stateValue": "ALARM",
      "stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [42.7118644067809 (21/01/22 13:00:00)] was greater than or equal to the threshold (40.0) (minimum 1 datapoint for OK -> ALARM transition).",
...

This section will give you explanation on how the Alarm got triggered, and by what reason. The reason data also can provide you confirmation of the Alarm configuration of the threshold and the comparison operator.

Further down information within stateReasonData like recentDatapoints, threshold, and evaluatedDatapoints sections will provide further details into the StateUpdate.

Hope this helps to further troubleshoot your Alarm configuration and the state updates regarding ALARM state.

AWS
支援工程師
已回答 2 年前
  • Hi Munkhbat_T,

    Thanks for the detailed response.

    I am adding more details below from the history section. This is for a different timeframe(24/01/2022) as compared to the 1 mentioned in the original question.

    Here is the screenshot for alarm and metric value: https://ibb.co/s6RjBTH. As can be seen here, the alarm triggered at 4:12 when it was supposed to trigger at 3:49.

    Here is a screenshot from the history section: https://ibb.co/6gyjkgR for the same duration(bottom rows).

    Here is a screenshot of the state change data for the alarm triggering at 4:12 : https://ibb.co/7vLgPQR .

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南