SCENARIO:
I have a cloudwatch alarm action that triggers an SNS topic.
The alarm metric is configured to filter CRITICAL events in a Lambda Log group.
The Lambda (invoked every 15 minutes) checks for CloudFormation stacks in 'error' states and logs the critical event for each stack in the error state.
Logs::MetricFilter
FilterPattern: '{$.level="CRITICAL"}'
MetricValue: 1
CloudWatch::Alarm
AlarmActions: Send to SNS Topic
Period: 600
TreatMissingData: notBreaching
ComparisonOperator: GreaterThanOrEqualToThreshold
Threshold: 1
EvaluationPeriods: 1
Statistic: Maximum
Cloudwatch alarm works as expected when 1 stack is in the error state:
- Picks the CRITICAL event
- ALARM changes state to 'In Alarm'
- SNS Topic triggered
CHALLENGE:
If any other stack goes into error (like 15 minutes later), and the initial stack is still in error, the Alarm doesn't act on it. i.e. trigger the SNS topic.
I understand from research that this is normal behavior because " If your metric value is still in breach of your threshold, the alarm will remain in the ALARM state until it no longer breaches the threshold."
I have also tested this and confirmed - I used boto3 to set_alarm_state back to OK, invoked the Lambda manually, the Alarm state was changed back to 'In Alarm', and the SNS topic triggered.
QUESTION:
is there any other suitable configuration or logic I can use to trigger the SNS topic for every stack in the error state?
This looks like a very viable solution. Thank you.