Understanding CloudWatch Alarm Evaluation: Why Alarms Delay, Fire Unexpectedly, or Get Stuck
CloudWatch alarms fire when the graph looks clean. They take minutes to react to obvious spikes. They get stuck in INSUFFICIENT_DATA for no apparent reason. These are among the most common questions on re:Post, and they all trace back to the same root cause: alarm evaluation works differently than most people expect. This article walks through how it actually works: the core settings, why delays happen, how missing data is handled, and practical patterns for batch jobs that only run once a day.
The Three Building Blocks of Alarm Evaluation
Every CloudWatch alarm depends on three settings. Most of the confusion comes from not understanding how they interact.
Period
The length of time (in seconds) that CloudWatch uses to aggregate metric data into a single data point. For example, a Period of 300 seconds (5 minutes) means CloudWatch takes all the raw data points published during that 5-minute window and aggregates them into one value using the statistic you chose (Average, Sum, Maximum, etc.).
Think of Period as the "bucket size" for your data.
Evaluation Periods
The number of the most recent periods (data points) that CloudWatch looks at when deciding whether to change the alarm state. If your Period is 5 minutes and Evaluation Periods is 3, the alarm considers the last 15 minutes of aggregated data.
Datapoints to Alarm
The number of data points within the Evaluation Periods window that must be breaching the threshold to trigger the ALARM state. These breaching data points do not need to be consecutive — they just need to fall within the evaluation window.
When Datapoints to Alarm equals Evaluation Periods, every data point in the window must breach. When it's lower, you get an M out of N alarm — for example, "2 out of 3 data points must breach." M out of N works well on almost every alarm. It's the single best lever for cutting down false positives without sacrificing detection speed.
How They Work Together
Evaluation Interval = Evaluation Periods × Period
Example:
Period = 1 minute
Evaluation Periods = 5
Datapoints to Alarm = 3
→ "3 out of 5" alarm
→ Looks at the last 5 minutes of 1-minute data points
→ Fires if any 3 of those 5 are breaching
Note: If the evaluation interval exceeds one day, the alarm becomes a multi-day alarm and is evaluated once per hour instead of once per minute. See the Infrequent Workloads section for details.
Why Your Alarm Seems Delayed
People ask this constantly: "My metric breached the threshold, but the alarm didn't fire for 3 minutes. Why?"
Three things add up to create that delay:
1. Metric Reporting Lag
Most AWS service metrics are reported at standard resolution (one-minute granularity), though some services like EC2 default to 5-minute intervals under basic monitoring. The data point for a given period may not be available in CloudWatch the instant that period ends — the time it takes varies by service.
2. Alarm Evaluation Frequency
For alarms with a Period of 60 seconds or longer, CloudWatch evaluates the alarm once per minute. This means even with a 1-minute Period, the alarm checks once per minute and uses whatever data is available at that moment.
For high-resolution alarms (Period of 10, 20, or 30 seconds), evaluation happens every 10 seconds.
3. The Evaluation Window Itself
If your alarm requires multiple breaching data points (Evaluation Periods > 1 and Datapoints to Alarm > 1), the alarm must wait for enough periods to pass before it can determine the state.
Putting It All Together: The Timeline
Let's trace a concrete example. You have an alarm with:
- Period: 1 minute
- Evaluation Periods: 3
- Datapoints to Alarm: 3 (all 3 must breach)
Time Event
───── ──────────────────────────────────────────────────
10:00 Metric spikes above threshold
10:01 Data point for 10:00–10:01 not yet available (reporting lag)
10:02 Data point for 10:00–10:01 arrives. Alarm evaluates:
- Only 1 breaching data point available → stays OK
10:03 Data point for 10:01–10:02 arrives. Alarm evaluates:
- 2 breaching data points → stays OK (need 3)
10:04 Data point for 10:02–10:03 arrives. Alarm evaluates:
- 3 breaching data points → transitions to ALARM ✓
Result: ~4 minutes between the actual spike and the alarm firing. That's just the math — reporting lag plus three 1-minute periods that all need to breach.
How to Reduce the Delay
| Approach | Trade-off |
|---|---|
| Lower the Period (e.g., 10 seconds with high-resolution metrics) | Higher cost, requires high-resolution metric publishing |
| Reduce Evaluation Periods and Datapoints to Alarm (e.g., 1 out of 1) | Faster detection, but more susceptible to false positives from transient spikes |
| Use M out of N (e.g., 2 out of 3) | Good balance — tolerates one missing/non-breaching point while still catching sustained issues |
It comes down to how many false pages you're willing to tolerate vs. how fast you need to know. For most production alarms, a good default is "2 out of 3" with a 1-minute Period, then adjust from there.
The Three Alarm States (and Evaluation States)
CloudWatch alarms are always in one of three states:
| State | Meaning |
|---|---|
| OK | The metric is within the defined threshold |
| ALARM | The metric is outside the defined threshold |
| INSUFFICIENT_DATA | The alarm just started, the metric is not available, or not enough data exists to evaluate |
In addition to the alarm state, each alarm has an evaluation state that provides information about the evaluation process itself. You can view this in the alarm details in the console or via the describe-alarms CLI command:
| Evaluation State | Meaning |
|---|---|
| PARTIAL_DATA | Not all available data could be retrieved due to quota limitations |
| EVALUATION_ERROR | Configuration errors in alarm setup that require review (check the StateReason field) |
| EVALUATION_FAILURE | Temporary CloudWatch issues; monitor manually until resolved |
Why Is My Alarm Stuck in INSUFFICIENT_DATA?
This comes up almost as often as the delay question. Usual suspects:
-
The alarm was just created. New alarms start in INSUFFICIENT_DATA and transition within a few minutes once CloudWatch has enough data points to evaluate.
-
The metric stopped reporting. If the metric source (an EC2 instance, a Lambda function, etc.) stops publishing data, and the alarm's treat-missing-data setting is
missing(the default), the alarm transitions to INSUFFICIENT_DATA. -
Wrong metric dimensions. The alarm is watching a metric with dimensions that don't match any published data. Double-check the namespace, metric name, and dimension values.
-
The metric only reports on events. Metrics like error counts or throttled requests only publish data points when events occur. Between events, there's no data — not zero, but no data at all. This is where treat-missing-data becomes critical.
Treat Missing Data: The Setting Most People Get Wrong
Most people never touch this setting and leave it at the default. That's a mistake. It controls what CloudWatch does when a data point is missing from the evaluation window, and the default is wrong for a lot of common use cases.
The Four Options
| Setting | Behavior | Best For |
|---|---|---|
missing (default) | Missing data points are treated as missing. If there aren't enough real data points in the evaluation range to make a determination, the alarm transitions to INSUFFICIENT_DATA | Metrics that should always report data |
notBreaching | Treat missing data as "good" (within threshold) | Error-count metrics that only report when errors occur (e.g., ALB HTTPCode_ELB_5XX_Count, Lambda Errors) |
breaching | Treat missing data as "bad" (threshold violated) | Heartbeat/health-check metrics where silence means trouble |
ignore | Maintain the current alarm state | When you want the alarm to "remember" its last known state during data gaps |
Note on DynamoDB: Alarms on metrics in the AWS/DynamoDB namespace default to ignoring missing data. When a DynamoDB metric has missing data, the alarm remains in its current state. For details on this behavior, see Configuring how CloudWatch alarms treat missing data.
The Most Common Mistake
Leaving the default (missing) on an error-count metric. Here's what happens in practice:
- Your application has no errors → metric publishes nothing (not zero, just no data)
- Alarm evaluates → all data points in the evaluation range are missing → transitions to INSUFFICIENT_DATA
- You get paged for INSUFFICIENT_DATA or the alarm becomes invisible in your dashboard
- You lose trust in the alarm and ignore it
Fix: Set treat-missing-data to notBreaching for any metric that only reports data when something goes wrong.
Decision Guide
Does the metric report data continuously (CPU, memory, request count)?
└─ YES → Use "missing" (default) or "breaching" if silence = problem
└─ NO → Does the metric only report when something bad happens?
└─ YES → Use "notBreaching"
└─ NO → Does the metric report intermittently by design?
└─ YES → Use "ignore" to avoid state flapping
How the Evaluation Range Works Behind the Scenes
This part explains the "phantom alarm" — you look at the graph, see no breaching data points, but the alarm fired anyway. What happened?
CloudWatch doesn't just look at the most recent N data points (where N = Evaluation Periods). It actually retrieves a larger window of data points called the evaluation range. The evaluation range extends further back in time to account for potentially missing data points.
Here's the key behavior:
-
If no data points are missing: CloudWatch uses only the most recent N data points. The extra retrieved data is ignored.
-
If some data points are missing but enough real data exists: CloudWatch uses the most recent real data points, potentially pulling from further back in the evaluation range. Your treat-missing-data setting is ignored because there's enough real data.
-
If there aren't enough real data points: CloudWatch fills in the gaps using your treat-missing-data setting, using as few synthetic points as possible.
Premature Alarm State Prevention
CloudWatch includes logic to avoid false alarms when data is intermittent. Consider an alarm with Evaluation Periods = 3 and Datapoints to Alarm = 3, where treat-missing-data is set to breaching. The evaluation range retrieves 5 data points (more than the 3 Evaluation Periods) to account for potential gaps:
- If the most recent data is
- - - - X(four missing, one breaching as the latest), the alarm does not immediately go to ALARM. CloudWatch holds off because the next data point might be non-breaching. - However, if the pattern is
- - X - -(the breaching data point is old enough — at least as old as the Datapoints to Alarm position — and all more recent data points are missing), the alarm will go to ALARM.
This logic applies to M out of N alarms as well. The key rule: if the oldest breaching data point in the evaluation range is at least as old as the Datapoints to Alarm value, and all more recent data points are either breaching or missing, the alarm transitions to ALARM regardless of the M value.
Why the Graph Doesn't Match the Alarm
The alarm evaluated at a specific moment using the data available at that moment. If your application publishes additional data points for the same time period after the alarm evaluated (late-arriving data), the CloudWatch graph updates to show the new aggregated value — but the alarm already made its decision based on what it had.
This is why the Alarm History is your source of truth, not the metric graph. The history shows exactly which data points the alarm used for each evaluation. You can access it in the CloudWatch console (Alarms → select your alarm → History tab) or via the describe-alarm-history CLI command.
Re-evaluation After Metric Stops Flowing
Here's one that catches people off guard: after a metric stops flowing (say, you terminate an EC2 instance), CloudWatch may keep re-evaluating the last set of data points for a while. That re-evaluation can cause the alarm to change state and re-fire actions even though nothing new happened. Shorter periods reduce the window where this can occur.
Practical Pattern: Alarming on Infrequent or Batch Workloads
This scenario comes up on re:Post all the time, and there's no single obvious answer. You have a data pipeline or batch job that runs once every 24 hours. You want to:
- Get alerted when it fails
- Have the alarm clear when you re-run the job successfully
- Not get stuck in ALARM or INSUFFICIENT_DATA between runs
The Problem with Naive Approaches
Short Period (e.g., 5 minutes): The alarm fires on failure, but clears after 5 minutes because there are no more error data points. You might miss the alert.
Long Period (e.g., 24 hours): The alarm fires on failure and stays in ALARM for 24 hours, even if you manually re-run the job and it succeeds, because "1 error in 24 hours" is still true.
Solution 1: Custom Metric with State Tracking
A practical approach is to publish a custom metric that explicitly tracks the job outcome:
-
Publish a custom metric at the end of each job run:
- Value
1for success - Value
0for failure
- Value
-
Configure the alarm:
- Metric: your custom
JobStatusmetric - Statistic: Minimum
- Period: 1 hour (or shorter)
- Evaluation Periods: 1
- Datapoints to Alarm: 1
- Threshold: < 1 (alarm when the value is 0)
- Treat Missing Data:
ignore
- Metric: your custom
-
Why this works:
- When the job fails → publishes 0 → alarm fires
- When you re-run and it succeeds → publishes 1 → alarm clears
- Between runs → no data →
ignorekeeps the alarm in its last state (OK after success, ALARM after failure)
Solution 2: Multi-Day Alarms
Since January 2025, CloudWatch supports alarming on data up to 7 days old. When the evaluation interval (Evaluation Periods × Period) exceeds one day, the alarm becomes a multi-day alarm with different evaluation behavior:
- The alarm is evaluated once per hour instead of once per minute.
- Each evaluation considers metrics only up to the current hour at the :00 minute mark.
For example, a job that runs every 3 days at 10:00:
- At 10:02, the job fails.
- At 10:03, the alarm evaluates but stays OK (evaluation considers data only up to 10:00).
- At 11:03, the alarm evaluates, sees the failure data up to 11:00, and transitions to ALARM.
- At 11:43, you fix the error and the job succeeds.
- At 12:03, the alarm evaluates, sees the success, and returns to OK.
To create a multi-day alarm, specify a Period of at least 3,600 seconds (1 hour) and set the number of Evaluation Periods so that the total interval exceeds 24 hours.
Solution 3: Using FILL() with Metric Math
If you can't modify the job to publish a custom metric, you can use the FILL() metric math function to handle gaps:
FILL(m1, REPEAT)
FILL(metric, REPEAT) carries forward the last known value into periods with no data. This prevents the alarm from seeing missing data and avoids INSUFFICIENT_DATA transitions.
Other FILL options:
FILL(m1, 0)— Fill missing data with zeroFILL(m1, value)— Fill with a specific valueFILL(m1, REPEAT)— Repeat the last known valueFILL(m1, LINEAR)— Linear interpolation between the values at the beginning and end of the gap
Watch out with FILL in alarms: If your metric has even a slight publishing delay and the most recent period never has data at evaluation time, FILL will replace that gap with the fill value every single time. The alarm ends up always seeing the FILL value as its latest data point, which can lock it into OK or ALARM permanently. Pair FILL with an M out of N alarm so one filled data point can't dominate the evaluation.
Composite Alarms: Fewer Pages, More Signal
If you have dozens of metric alarms, you're likely getting more notifications than you can act on. Composite alarms let you combine them with boolean logic (AND, OR, NOT) so you only get paged when a meaningful combination of conditions is true. A single composite alarm can reference up to 100 child alarms (metric alarms or other composite alarms).
Example: Alert Only When Both Error Rate AND Latency Are High
ALARM("HighErrorRate") AND ALARM("HighLatency")
The composite alarm only fires when both children are in ALARM at the same time. A transient latency spike alone won't page you.
You can also use OK("AlarmName") and INSUFFICIENT_DATA("AlarmName") in rule expressions for more nuanced logic — for example, firing only when one alarm is in ALARM and another is NOT in OK. See Create a composite alarm for the full rule expression syntax.
Action Suppression with Suppressor Alarms
You can also designate a suppressor alarm on a composite alarm. When the suppressor goes into ALARM, the composite alarm stops taking actions entirely — no notifications, no Lambda triggers, nothing. When the suppressor returns to OK, actions resume.
A common pattern: create a suppressor alarm tied to a custom "MaintenanceMode" metric. Flip the metric to 1 before a deployment, and your composite alarm goes quiet until you flip it back.
Two timing parameters control the suppression:
- ActionsSuppressorWaitPeriod: How long to wait for the suppressor alarm to enter ALARM after the composite alarm enters ALARM. This compensates for delays in the suppressor alarm's own evaluation. If the suppressor doesn't enter ALARM within this window, the composite alarm takes its actions.
- ActionsSuppressorExtensionPeriod: How long to keep suppressing actions after the suppressor alarm returns to OK. This gives the composite alarm time to also return to OK before actions resume.
AWS recommends setting both parameters to at least 60 seconds, since metric alarms are evaluated every minute.
If the suppressor alarm pattern is more than you need, look at Alarm Mute Rules instead — they let you silence alarm actions for a defined time window without wiring up a separate alarm. When both mute rules and suppressor alarms are active, mute rules win. See Alarm Mute Rules.
Quick Reference: Common Alarm Configurations
These configurations work well as starting points. Your thresholds will differ — the important thing is getting the structural parameters (Period, M out of N, treat-missing-data) right for each metric type.
Web Application (ALB + ECS/EKS)
| Alarm | Namespace | Metric | Statistic | Threshold (example) | Period | Eval Periods | Datapoints | Treat Missing |
|---|---|---|---|---|---|---|---|---|
| High 5xx Rate | AWS/ApplicationELB | HTTPCode_ELB_5XX_Count | Sum | > 10 per period | 1 min | 3 | 2 | notBreaching |
| High Latency | AWS/ApplicationELB | TargetResponseTime | p99 | > 2 seconds | 1 min | 5 | 3 | ignore |
| Unhealthy Targets | AWS/ApplicationELB | UnHealthyHostCount | Minimum | >= 1 | 1 min | 3 | 3 | missing |
Note on percentile alarms: When using percentile statistics like p99 on low-traffic services, be aware that CloudWatch requires a minimum number of data samples for percentile evaluation. If there are fewer than 10/(1-percentile) data points during the evaluation period — for p99, that's 1,000 samples — CloudWatch lets you choose whether to evaluate the alarm anyway or ignore the metric until enough data is available. See Percentile-based alarms and low data samples.
EC2 Instance
| Alarm | Namespace | Metric | Statistic | Threshold (example) | Period | Eval Periods | Datapoints | Treat Missing |
|---|---|---|---|---|---|---|---|---|
| High CPU | AWS/EC2 | CPUUtilization | Average | > 80% | 5 min | 3 | 3 | missing |
| Status Check Failed | AWS/EC2 | StatusCheckFailed | Maximum | >= 1 | 1 min | 3 | 3 | missing |
| High Memory (agent) | CWAgent | mem_used_percent | Average | > 85% | 5 min | 3 | 2 | missing |
Note on StatusCheckFailed alarms: EC2 metrics can temporarily enter INSUFFICIENT_DATA if metric reporting is interrupted, even when the instance is healthy. For alarms that take recovery actions (stop, terminate, reboot, recover), AWS recommends treating missing data as missing and configuring the alarm to trigger actions only on ALARM state transitions, not on INSUFFICIENT_DATA. See Create CloudWatch alarms for EC2 status checks.
Lambda Function
| Alarm | Namespace | Metric | Statistic | Threshold (example) | Period | Eval Periods | Datapoints | Treat Missing |
|---|---|---|---|---|---|---|---|---|
| High Error Rate | AWS/Lambda | Errors | Sum | > 5 per period | 1 min | 5 | 3 | notBreaching |
| Throttling | AWS/Lambda | Throttles | Sum | > 0 | 1 min | 3 | 1 | notBreaching |
| Duration Near Timeout | AWS/Lambda | Duration | p99 | > 90% of timeout | 1 min | 3 | 3 | ignore |
Batch / Infrequent Job
| Alarm | Namespace | Metric | Statistic | Threshold (example) | Period | Eval Periods | Datapoints | Treat Missing |
|---|---|---|---|---|---|---|---|---|
| Job Failed | Custom | JobStatus | Minimum | < 1 | 1 hr | 1 | 1 | ignore |
Note: Threshold values above are examples — adjust them to your workload's baseline. For the full official list of recommended alarms per AWS service, see AWS Recommended Alarms.
The Short Version
-
The delay isn't broken, it's math. Reporting lag + evaluation frequency + your Evaluation Periods setting. Understanding the math helps set the right expectations.
-
Review your treat-missing-data settings. The default (
missing) often isn't the right choice for error-count and event-based metrics. Note that DynamoDB metrics default to ignoring missing data — check the documentation for details on that behavior. -
For alarm debugging, use the Alarm History — not the graph. Late-arriving data changes the graph after the fact. The alarm already made its decision with the data it had at evaluation time. Check the Alarm History tab or run
describe-alarm-historyto see exactly what the alarm saw. -
Consider M out of N for most alarms. "2 out of 3" or "3 out of 5" will prevent many false pages.
-
For batch jobs, publish a custom status metric and set treat-missing-data to
ignore. If the job runs on a multi-day schedule, look at multi-day alarms. If you're usingFILL(metric, REPEAT), pair it with M out of N or you'll get stuck alarms. -
Consider composite alarms. If you have more than a handful of alarms on a service, combine them. Page on the combination, not each individual signal. Use suppressor alarms or mute rules to keep things quiet during deployments.
Found this useful? Leave a comment or share your own alarm patterns with the community. If you're dealing with an alarm issue affecting production workloads, you can ask follow-up questions here on re:Post, or if you have an AWS Support plan, open a support case for direct assistance.
- Language
- English
Relevant content
- asked 2 years ago
- asked 10 months ago
