Why did my CloudWatch alarm trigger when its metric doesn't have any breaching data points?

5 minute read
0

My Amazon CloudWatch alarm changed to the ALARM state. When I check the metric that's being monitored, the CloudWatch graph doesn't show any breaching datapoints. However, the Alarm History contains an entry with a breaching data point. Why did my CloudWatch alarm trigger?

Short description

CloudWatch alarms evaluate metrics based on the datapoints that are available at a given moment. Alarm History captures a record of the datapoints that the alarm evaluated at that timestamp. However, it's possible for new samples to be published after the alarm evaluation occurred. These new samples might impact the value that's calculated when CloudWatch aggregates the metric data.

Resolution

Find the breaching datapoints

If your CloudWatch graph doesn't show any breaching datapoints, then those datapoints occurred outside of the alarm evaluation time. To understand how this happens, refer to the following example.

In this example, X number of samples are available when an alarm evaluation occurs, resulting in an aggregated value of A. Later, new samples are posted, resulting in Y number of samples that are retrieved for the same timestamp. This results in a different aggregated value of B.

In this situation, an alarm is configured with the following parameters:

  • Namespace: Web_App
  • Metric: ResponseTime
  • Dimension: host,h_04254448d4e964956
  • Statistic: Average
  • Threshold: 0.005
  • ComparisonOperator: GreaterThanThreshold
  • Period: 60 seconds (1 minute)
  • Evaluation Period: 1

When the alarm evaluates the period from 12:00:00 - 12:01:00 UTC, the following values are retrieved by the metric:

Sample-1: 12:00:00 UTC, numeric value: 0.00675
Sample-2: 12:00:00 UTC, numeric value: 0.00789
Sample-3: 12:00:00 UTC, numeric value: 0.00421

The average of these values is 0.006283333, which breaches the threshold of 0.005 seconds. Therefore, the alarm changes to the ALARM state. The alarm's history captures the aggregated values that exceed the threshold.

The host might temporarily experience a performance issue, which impacts the client application that's responsible for publishing metrics. As a result, the host might not post datapoints at equally spaced intervals. In this situation, samples for 12:00 were published after the alarm evaluation occurred. Below are all the samples for the 12:00 timestamp:

Sample-1: 12:00:00 UTC, numeric value: 0.00675
Sample-2: 12:00:00 UTC, numeric value: 0.00789
Sample-3: 12:00:00 UTC, numeric value: 0.00421
Sample-4: 12:00:00 UTC, numeric value: 0.00002
Sample-5: 12:00:00 UTC, numeric value: 0.00007

After receiving an alert from this alarm, the user renders a CloudWatch graph to review the metric behavior. CloudWatch retrieves the five samples from 12:00:00 - 12:01:00 UTC and aggregates them as an average of 0.003788. This is different from the previously calculated value and is below the threshold. Therefore, the breaching datapoints are not visible in the time range because additional samples were posted after the alarm evaluation occurred.

Increase the Alarm Evaluation Interval

An alarm's evaluation interval is the number of data points multiplied by the period. Configuring Datapoints to Alarm can result in a longer evaluation interval. When an alarm generates false alerts due to delayed metrics, increasing the evaluation interval allows delayed datapoints to be considered in the alarm evaluation. This reduces the number of false alerts.

The evaluation interval can be increased by one of two ways:

1.    Increase the period. In the following example, the period is increased to five minutes:

  • Namespace: Web_App
  • Metric: ResponseTime
  • Dimension: host,h_04254448d4e964956
  • Statistic: Average
  • Threshold: 0.005
  • ComparisonOperator: GreaterThanThreshold
  • Period: 300 seconds (5 minutes)
  • Evaluation Period: 1

-or-

2.    Configure "M out of N" Datapoints to Alarm.

In the following example, M out of N datapoints are configured to two out of three.

  • Namespace: Web_App
  • Metric: ResponseTime
  • Dimension: host,h_04254448d4e964956
  • Statistic: Average
  • Threshold: 0.005
  • ComparisonOperator: GreaterThanThreshold
  • Period: 60 seconds (1 minute)
  • Evaluation Period (N): 3
  • Datapoints To Alarm (M): 2

When you configure Evaluation Periods and Datapoints to Alarm as different values, you set an "M out of N" alarm. Datapoints to Alarm is M and Evaluation Period is N. For example, if you configure four out of five data points with a period of one minute, then the evaluation interval is five minutes. Similarly, if you configure three out of three data points with a period of ten minutes, the evaluation interval is thirty minutes.

With Datapoints to Alarm configured in this way, CloudWatch Alarms evaluate more data points. They also change the alarm state only when a minimum number of data points (M) breach a given set of data points (N). This parameter can adjust the alarm to trigger on a single datapoint or require multiple datapoints to transition to the ALARM state.

For more information, see Create a CloudWatch alarm based on a static threshold and Configuring how CloudWatch alarms treat missing data.


Related information

Why didn't I receive an Amazon Simple Notification Service (Amazon SNS) notification for my CloudWatch alarm trigger?

Why is my CloudWatch alarm in INSUFFICIENT_DATA state?

Why did my CloudWatch alarm send me a notification after a single breached data point?

AWS OFFICIAL
AWS OFFICIALUpdated a year ago