Skip to content

Best practice for per-instance CloudWatch alarms at scale with CloudFormation?

0

Hi everyone,

I'm looking for some guidance on the best approach to create per-instance CloudWatch alarms (not composite) for ~40 EC2 instances using CloudFormation.

Constraints:

  • More instances will be added over time
  • Don't want one stack per instance
  • Prefer to avoid manual alarm creation

What's the recommended best practice for scaling this?

I'm considering using a single template with Fn::ForEach looping. But each time new instances are added, I need to pass the full list of existing and new instance IDs again to update or redeploy the current CloudFormation stack. Alternatively, I can rerun the template and create a separate stack for the new instance. But that feels inefficient and harder to manage

Thanks in advance!

3 Answers
5
Accepted Answer

While the AI recommendation to use tag-based automation is the industry standard for large fleets, here is how you should handle it if you want to stick strictly to CloudFormation without the headache of manual ID lists:

1. The "Component" Pattern (Best for IaC consistency)

The most scalable way to use CloudFormation for this is to decouple the logic. Instead of one master list of Alarms, define the AWS::CloudWatch::Alarm resource inside the same template (or module) that defines the AWS::EC2::Instance.

By using !Ref MyInstance for the Dimensions property, the alarm becomes a child of the instance. Every time you add a new instance resource to your template, its corresponding alarm is created automatically. This ensures that the lifecycle of the alarm is tied directly to the instance, eliminating the need for any "list management."

2. Why Fn::ForEach is a "Trap"

You are absolutely right: Using Fn::ForEach with a list of IDs is a maintenance trap. CloudFormation is designed to manage the state of resources it creates. If your instances are created outside of CloudFormation (manually or via CLI), using CloudFormation just to "attach" alarms to them is anti-pattern because the stack doesn't "own" the underlying resource. In that specific case, the Event-Driven approach (Lambda/EventBridge) mentioned in the AI response is the only sustainable path.

3. Scaling the View with Composite Alarms

Since you are managing ~40 instances, 40 individual "Low Disk Space" or "High CPU" alarms will create a lot of noise. Even if you create them individually for granular tracking, consider grouping them into an AWS::CloudWatch::CompositeAlarm. This allows you to have a single "System Health" alarm that only triggers if, for example, more than 10% of your fleet is failing simultaneously.

Summary: If you create EC2 via CloudFormation: Put the Alarm resource in the same stack/module as the Instance.

  • If EC2s are created externally: Use the amazon-cloudwatch-auto-alarms Lambda-based solution to maintain an "observe-and-react" posture.
EXPERT
answered a month ago
  • Thanks for your answer, Florian. EC2 instances are sometimes created externally, so I think your suggestion to use Lambda is ideal.

    I have a couple of follow-up questions about the approach: Some of the custom metrics aren't immediately available when an instance enters the running state. How would you recommend handling this?

    Also, am I correct in understanding that I would rely on Lambda to:

    • bulk-create alarms for existing instances,
    • update alarms (for example, when thresholds change or when new metrics are added),
    • and delete alarms when instances are terminated?
  • By using EventBridge (triggering on running and terminated states), the Lambda acts as a background controller that ensures your monitoring always reflects your actual fleet, regardless of how the instances were launched.

3

Regarding your last comment:

1. Handling Delayed Metrics

CloudWatch Alarms can be created even if the underlying metric does not exist yet.

  • Initial State: The alarm will simply stay in the INSUFFICIENT_DATA state until the first data point is received from the CloudWatch Agent.
  • Best Practice: Set the TreatMissingData property to notBreaching. This prevents the alarm from triggering a "False Positive" before the instance has finished its bootstrap process and sent its first metric.

2. Lambda Lifecycle Responsibilities

Your assessment is exactly how the event-driven pattern should work:

  • Bulk-creation: You can trigger the Lambda manually once to scan all existing instances and create alarms for those with the matching tags.
  • Updates: Since the PutMetricAlarm API call is idempotent, calling it again with the same name will simply update the thresholds or settings. You can trigger this whenever an instance is restarted or via a manual re-scan.
  • Deletion: You should definitely use an EventBridge rule for the terminated state. This triggers the Lambda to call DeleteAlarms, ensuring you don't leave "orphaned" alarms behind, which keeps your dashboard clean and avoids unnecessary costs.
EXPERT
answered a month ago
  • Thanks again for your helpful answer.

    One last question: I have some applications running on the EC2 instance that push custom metrics to CloudWatch as heartbeat signals. These apps are configured and start only after the instance reaches a running state, and in some cases may not begin until hours later.

    Given that setup, is keeping TreatMissingData set to notBreaching still considered best practice? Thank you.

  • That’s a crucial distinction. For heartbeats, notBreaching is actually risky because missing data is the failure signal you want to catch. If the app crashes, the metric stops, and notBreaching would keep the alarm green (a "silent failure").

    I would do the following:

    Don't use notBreaching: Use breaching (to catch crashes) or missing (to keep it in INSUFFICIENT_DATA until the first signal arrives).

    Delayed Alarm Creation: Since you're using the Lambda/EventBridge approach, don't create the heartbeat alarm when the instance starts. Instead, have your app trigger a "Ready" event or tag update when it finally goes live, and let your Lambda create the alarm then.

    Grace Period: Set a wider EvaluationPeriod (e.g., alert only if data is missing for 15+ minutes) to avoid noise during minor lags.

    This way, you avoid false positives during the long bootstrap but stay protected once the app is live.

0

For managing CloudWatch alarms at scale for EC2 instances with CloudFormation, the recommended best practice is to use an automated, tag-based approach rather than manually managing instance IDs in templates.

The amazon-cloudwatch-auto-alarms solution is specifically designed for this use case. This solution automatically creates a standard set of CloudWatch alarms for your EC2 instances based on EC2 instance tags. The key benefits include:

  • Eliminates manual alarm creation - No need to maintain lists of instance IDs in your templates
  • Tag-driven automation - Alarms are created automatically based on tags you apply to instances
  • Scalability - New instances automatically get alarms when tagged appropriately
  • Consistency - Ensures a standard set of alarms across all instances
  • Multi-region and multi-account support - Can be deployed using CloudFormation StackSets

This approach solves your core challenge: you won't need to update CloudFormation templates with new instance IDs each time you add instances. Instead, simply tag new instances appropriately and the alarms are created automatically.

The solution is particularly useful during large-scale migrations and removes the time-consuming, error-prone process of manually creating standard alarms. It's designed to work alongside CloudFormation and can be integrated into your existing infrastructure-as-code workflows.

This is a more maintainable and scalable approach than using Fn::ForEach with hardcoded instance IDs, as it decouples alarm creation from your stack deployment process and allows your alarm infrastructure to grow organically with your EC2 fleet.
Sources
Alarming options with CloudWatch - AWS Prescriptive Guidance

answered a month ago
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.