AWS CloudWatch alarm for infrequently executed process

0

I have a process that runs once every 24 hours (data pipeline). I want to monitor it for failures, but am having trouble defining an alarm that works adequately.

Since the process runs only once every 24 hours, there can be 1 failure every 24 hours at most.

If I define a short period (e.g. 5 minutes), then the alarm will flip back to OK status after 5 minutes, as there are no more errors.

If I define a period of 24 hours, then the alarm will be stuck in ERROR status until the period passes, even if I re-run the process manually and it succeeds, because "one error within 24 hour period" is still true.

How do I get an alarm on failure, but clear it once the process succeeds?

profile picture
m0ltar
asked 2 years ago1727 views
2 Answers
1

Hi m0ltar,

Few questions:

  • Is this a custom metric or AWS Service metric? and whether it emits values when the process is not running?
  • What value does it put when it fails and when it succeed?
  • How long does the Process run? and what will be the timestamp of the metric for that run of the process? Time it Failed/Succeeded or Time it started?
  • Can you please elaborate further for How do I get an alarm on failure, but clear it once the process succeeds?

Generally we can utilize the combination of Alarm Evaluation period of M out of N in this scenario, and most importantly we have to define a feasible Period for the Alarm. Further we can configure Treat Missing Data or use FILL() metric math functions to keep the Alarm away from flapping to Insufficient Data.

Happy to discuss further if you can provide more information to your use case.

AWS
SUPPORT ENGINEER
answered 2 years ago
0

Is this a custom metric or AWS Service metric? and whether it emits values when the process is not running?

The alarm is for step function execution failures.

What value does it put when it fails and when it succeed?

Metric name is ExecutionsFailed

How long does the Process run?

Up to 10 minutes

and what will be the timestamp of the metric for that run of the process?

Time it the Step Function had failed.

Can you please elaborate further for How do I get an alarm on failure, but clear it once the process succeeds?

  1. I run the step function and it fails
  2. Alarm is raised, notification is sent via SNS
  3. I go in and fix whatever caused the failure, and manually re-run the step function
  4. Step function execution succeeds
  5. "In Alarm" screen shall not list the alarm, since the execution has finally succeeded

The last step is what is not clear to me.

Thanks!

profile picture
m0ltar
answered 2 years ago
  • Checking here https://docs.aws.amazon.com/step-functions/latest/dg/procedure-cw-metrics.html#cloudwatch-step-functions-execution-metrics, the Step Function will produce "ExecutionsFailed" metric when the execution fails and "ExecutionsSucceeded" metric when then execution succeeds. These 2 metrics are the opposite of each other.

    Standard CloudWatch Alarms monitors single Metric value or single value created by Metric Math expression. Since you are monitoring "ExecutionsFailed" metric, the Alarm can be raised if this metric reports value greater than 0, then once the function/process runs again and succeeds it will report 0 to "ExecutionsFailed" and report count of 1 to "ExecutionsSucceeded" at the same time. So there is no relevance to "ExecutionsSucceeded" metric value with the Alarm that is monitoring "ExecutionsFailed" metric.

    Further, the configurations of the Alarm will control how much period is to be monitored to trigger the Alarm, and at the same time how long the alarm will be in Alarm state considering the breaching data point stays within the Evaluation period/Evaluation Range. Once the breaching datapoint is outside the Evaluation Period/Evaluation Range the Alarm will transition back to OK state considering criteria is met and other configurations are in place.

    Needs more clarification on step 5.

  • Ok, I am sorry, but I am not following. Are you saying I need to create a math expression and add counts for ExecutionsFailed and ExecutionsSucceeded?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions