Alarm to scale out asg stops getting set

0

We have an alarm which is supposed to trigger based on a metric which returns 1, if the number of instances in group is less than the number of messages in an sws queue, and 0 otherwise.

The alarm works almost all of the time. Occasionally, we see the meteic go from 0 to 1, but then the alarm does not get set. This state can continue for hours. The metric stays as 1 for hours, but the alarm does not trigger.

The way we resolve the issue is simply resaving the alarm, without even making any changes.

Everything else appears to be working correctly though.

Basically what we do is add one to the ASG if there are more items in queue than instances running

Note that this is constantly running, yet we only see an issue about one a month or so.

asked 5 months ago190 views
1 Answer
0

This is probably going to be tricky to resolve without being able to see the actual resources, since it sounds like there's probably some tiny nuance somewhere causing the issue, so if you have a technical support plan I'd recommend opening a case around this.

That said, I have a few clarifying questions:

  1. Does the alarm ever go into the ALARM state? If not, then the issue is something with the alarm settings themselves
  2. What kind of scaling policy are you using? If its Target Tracking, then the alarms are managed by the ASG, and editing them can cause problems. This metric also doesn't work correctly for target tracking, since it doesn't chance proportionally to the instance count. From your description though, I'm guessing you're using step scaling or simple scaling? With either of those, you own the alarms and are able to configure them how you want (that said, I almost always recommend not using simple scaling, other than some extreme edge cases)
  3. If the alarm IS going into the ALARM state, then there's something on the ASG causing issues. But since you said just re-saving the alarm fixes it, I'm guessing this isn't the case and the state is never changing out of OK?

Those out of the way, assuming the alarm is staying in the OK state:

  1. I'm guessing something is getting messed up somehow, and there's a console default that somehow fixes it. Possibly a script modifying the alarm, or some other dependency changing, like the scaling policy, ASG, or metrics.
  2. Next time this happens, can you go to the CLI and call the below command both before and after fixing the issue and share the (sanitized) outputs?
aws cloudwatch describe-alarms --alarm-names "myalarm"
AWS
answered 5 months ago
  • The alarm normally goes into the ALARM state when the metric is equal to or greater than one. The issue that we are having is that occasionally, even though the metric is showing a value that normally triggers the alarm, the alarm stays in the OK state. This can last for hours, i.e. the metric is showing a value of "1" for hours but doesn't trigger. Here is some additional info:

    Metric is the math expression GroupInServiceInstances < (ApproxVisibleMessages_Q1 + ApproxVisibleMessages_Q2 + ApproxNonVisible_Q1 + ApproxNonVisible_Q2) Metric period is 1 minute

    Under "Additional configuration" Datapoints to alarm 1 out of 1 Missing data treatment Treat missing data as missing

    Auto Scaling Action EC2 Auto Scaling group scale-out (Add 1 instance)

  • Note that when this stops working, we haven't touched the ASG, the scaling policy or the metric itself.

  • Ahhhh, its metric math. Cloudwatch will normally look back extra periods for delayed/missing data; but with a math expression it will just treat the value as 0 as long as at least one of the metrics in the math expression is reporting data. If you look very closely at the metric graph during the issue, the current minute is probably displaying the wrong data; but then when the metric values come in, the graph is later updated to look like it should have triggered. If the SQS metric which is high at the time is delayed by a minute, the alarm just keeps seeing the expression as false at the moment its being evaluated. For an expression like yours, there's 2 good options to fix it. 1) use the FILL(REPEAT) math expression, so tell the alarm to look at the last known datapoint for each of the metrics. 2) Set the period to 5 minutes and use the MAX statistic. The second one might result in occasional over scaling if you have a short (<5 min) warmup period on the scaling policy.

  • @Shahad_C - Looking at the documentation for FILL(x, REPEAT), there is a note that this won't necessarily fix the issue if metrics are being published with a slight delay.

    Quote: Note When you use this function in an alarm, you can encounter an issue if your metrics are being published with a slight delay, and the most recent minute never has data. In this case, FILL replaces that missing data point with the requested value. That causes the latest data point for the metric to always be the FILL value, which can result in the alarm being stuck in either OK state or ALARM state. You can work around this by using a M out of N alarm. For more information, see "Evaluating an alarm"

  • For the specific metrics you're using, I don't think that should be an issue. Even if you're always 1 minute behind on them, it would result in at most a 1 minute delay in reacting to a spike, vs what's happening now where the values are just being treated as 0.

    M out of N is another option you can go with; however the problem is that AutoScaling won't be aware of the M/N. All N datapoints are sent to AutoScaling, and AutoScaling then aggregates all N together to come up with the single value to use for the scaling policy. If you're using StepScaling, you can get around that by changing the MetricAggregationType to Max or Min (vs average) so that AutoScaling only uses the highest/lowest value of the N values sent https://docs.aws.amazon.com/cli/latest/reference/autoscaling/put-scaling-policy.html

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions