Cloudwatch Alarm with 'sum' over 1 day never stops training

0

I have a step function that pulls in data and creates S3 objects related. Additionally I have a lamda that is invoked on those S3 objects in order to process the data. The step function is invoked once per day, and typically takes around 90 minutes to complete, so the Lambda invocations occur over that 90 minute window. Depend on the availability of the external service the data is being pulled from, the processing window could be longer. The number of objects/invocations varies from day to day, and I want to create an alarm that lets me know when there is variance above a threshold.

I have created a Cloudwatch Alarm (via the console) that monitors the lambda 'invocations' metric, uses 'Sum' as the statistic type and '1 day' as the period. I created the alarm about 9 days ago, and it continues to show as 'in alarm' with "Analyses is outside the band (width: 1.1) for 1 datapoints within 1 day" as the explanation. When I view the alarm in the alarms panel, there is an exclamation triangle next to the statistic with "The anomaly detection model has not finished training, so the band is not yet available".

I can't choose a period less than 1 day because the process only runs once per day, and don't want to use a hardcoded threshold because the data does vary by small amounts day-by-day, and I just want to be notified if there is a big variance which might point to an external failure.

I'm not sure how long I should wait for the detection model, or if I can prime it somehow. This process has been running for months so there is historical data that can be used to determine the anomaly bounds. When creating the alarm it shows a preview of the bounds and that seems to work instantly, so I'm unclear on why I need to wait for training.

asked a month ago221 views
1 Answer
0

Hi Dave,

CloudWatch anomaly detection might take up to two weeks to fully train the model. The more data is available the earlier the model will be ready. Given the daily period and only 9 days of past data, it might take a few more days until the detection model is ready. Please see https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html#CloudWatch_Anomaly_Detection_Algorithm for more information

When you enable anomaly detection for a metric, CloudWatch applies machine learning algorithms to the metric's past data to create a model of the metric's expected values. The model assesses both trends and hourly, daily, and weekly patterns of the metric. The algorithm trains on up to two weeks of metric data, but you can enable anomaly detection on a metric even if the metric does not have a full two weeks of data.

profile pictureAWS
EXPERT
answered a month ago
  • After 14 days the model training seemed to finalise and I started getting alerts. However, the band is suddenly super narrow, not like was illustrated before. As a result, the alarms are going off, seemingly multiple times per day. Not sure if this is still a case of waiting longer.

    I tried editing the model and increating the anomaly detection threshold (was 1.1, tried increasing to 2 as a preview) but it still thinks the band is very narrow. At present I get around 1100 invocations per 24 hours, and I'd ideally like an alert if there is more than a +/- 5-10% deviation, as that might be indicative of a problem with the external source.

    In the screenshot, you can see that the data source provides fairly consistent data day-to-day, and that prior to the training completion the illustrated spread was actually far wider than expected. Now the spread is actually around ideal, but is actually set higher than the trend, so its causing an alarm:

    - Reason for State Change:    Thresholds Crossed: 1 out of the last 1 datapoints [1156.0 (08/04/24 06:48:00)] was less than the lower thresholds [1183.6088329700558] or greater than the upper thresholds [1211.9662714629724] (minimum 1 datapoint for OK -> ALARM transition).
    - Timestamp:                  Tuesday 09 April, 2024 06:48:00 UTC
    

    Screenshot of Alarm detail

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions