My Amazon CloudWatch alarm changed to the ALARM state. When I check the metric that's monitored, I don't see breaching datapoints on the CloudWatch graph. However, the Alarm History contains an entry with a breaching datapoint. I want to know what initiated my CloudWatch alarm.
Short description
If your application publishes additional datapoints for the same time period after the alarm evaluates, then the CloudWatch graph shows an updated aggregated value. The updated aggregated value includes the delayed datapoints. The graph might not show a breach even though the alarm history recorded one based on the datapoints available during evaluation.
Resolution
Find the breaching datapoints
The following example shows how delayed datapoints cause the CloudWatch graph to differ from the alarm history.
In the following example, an alarm is configured with the following parameters:
- Namespace: Web_App
- Metric: ResponseTime
- Dimension: host,h_04254448d4e964956
- Statistic: Average
- Threshold: 0.005
- ComparisonOperator: GreaterThanThreshold
- Period: 60 seconds (1 minute)
- Evaluation Period: 1
When the alarm evaluates the period from 12:00:00 - 12:01:00 UTC, the metric retrieves the following values:
Sample-1: 12:00:00 UTC, numeric value: 0.00675
Sample-2: 12:00:00 UTC, numeric value: 0.00789
Sample-3: 12:00:00 UTC, numeric value: 0.00421
Because the average of these values is 0.006283333, the average breaches the threshold of 0.005 seconds, and the alarm changes to the ALARM state. The alarm's history shows the aggregated values that exceed the threshold.
A host that temporarily experiences a performance issue affects the client application that's responsible for publishing metrics. As a result, the host might not post datapoints at equal intervals. In this case, your application publishes samples for 12:00 after the alarm evaluation occurs.
The following example represents all samples for the 12:00 timestamp:
Sample-1: 12:00:00 UTC, numeric value: 0.00675
Sample-2: 12:00:00 UTC, numeric value: 0.00789
Sample-3: 12:00:00 UTC, numeric value: 0.00421
Sample-4: 12:00:00 UTC, numeric value: 0.00002
Sample-5: 12:00:00 UTC, numeric value: 0.00007
When you receive an alert from the alarm, generate a CloudWatch graph to review the metric behavior. CloudWatch retrieves the five samples from 12:00:00 - 12:01:00 UTC and aggregates them as an average of 0.003788. The value changed from the previously calculated value and is below the threshold. If your application publishes additional samples after the alarm evaluation occurs, then the breaching datapoints aren't visible in the time range.
Increase the alarm evaluation interval
A longer evaluation interval might occur when you configure Datapoints to Alarm. When an alarm generates false alerts because of delayed metrics, the evaluation interval increases. The longer evaluation interval includes the delayed datapoints in the alarm evaluation. This reduces the number of false alerts.
To increase the evaluation interval, use one of the following options.
Increase the period
In the following example, the period is increased to 5 minutes:
Namespace: Web_App
Metric: ResponseTime
Dimension: host,h_04254448d4e964956
Statistic: Average
Threshold: 0.005
ComparisonOperator: GreaterThanThreshold
Period: 300 seconds (5 minutes)
Evaluation Period: 1
Configure M out of N Datapoints to Alarm
Configure the alarm to require multiple breaching datapoints before the alarm changes to the ALARM state.
In the following example, M out of N datapoints are configured to two out of three datapoints:
Namespace: Web_App
Metric: ResponseTime
Dimension: host,h_04254448d4e964956
Statistic: Average
Threshold: 0.005
ComparisonOperator: GreaterThanThreshold
Period: 60 seconds (1 minute)
Evaluation Period (N): 3
Datapoints To Alarm (M): 2
When you configure Evaluation Periods and Datapoints to Alarm as different values, you create an M out of N alarm. Datapoints to Alarm is set to M and Evaluation Period is set to N. For example, if you configure four out of five datapoints with a period of 1 minute, then the evaluation interval is 5 minutes. If you configure three out of three datapoints with a period of 10 minutes, then the evaluation interval is 30 minutes.
If you configure Datapoints to Alarm with different values, then CloudWatch alarms evaluate more datapoints. CloudWatch alarms also change the alarm state when a minimum number of datapoints breach within a set of datapoints. The parameter allows you to adjust the alarm to activate on a single datapoint or require multiple datapoints to transition to the ALARM state.
For more information, see Create a CloudWatch alarm based on a static threshold and Configuring how CloudWatch alarms treat missing data.
Related information
Why didn't I receive an Amazon SNS notification for my CloudWatch alarm trigger?
How do I troubleshoot my CloudWatch alarm in the INSUFFICIENT_DATA state?
Why did my CloudWatch alarm send me a notification after a single breached data point?