I have a cloudwatch alarm action that triggers an SNS topic.
The alarm metric is configured to filter CRITICAL events in a Lambda Log group.
The Lambda (invoked every 15 minutes) checks for CloudFormation stacks in 'error' states and logs the critical event for each stack in the error state.
FilterPattern: '{$.level="CRITICAL"}'
MetricValue: 1
AlarmActions: Send to SNS Topic
Period: 600
TreatMissingData: notBreaching
ComparisonOperator: GreaterThanOrEqualToThreshold
Threshold: 1
EvaluationPeriods: 1
Statistic: Maximum
Cloudwatch alarm works as expected when 1 stack is in the error state:
- Picks the CRITICAL event
- ALARM changes state to 'In Alarm'
- SNS Topic triggered
If any other stack goes into error (like 15 minutes later), and the initial stack is still in error, the Alarm doesn't act on it. i.e. trigger the SNS topic.
I understand from research that this is normal behavior because " If your metric value is still in breach of your threshold, the alarm will remain in the ALARM state until it no longer breaches the threshold."
I have also tested this and confirmed - I used boto3 to set_alarm_state back to OK, invoked the Lambda manually, the Alarm state was changed back to 'In Alarm', and the SNS topic triggered.
is there any other suitable configuration or logic I can use to trigger the SNS topic for every stack in the error state?
This looks like a very viable solution. Thank you.