"Insufficient data" for alarm created for SageMaker Endpoint

0

I wanted to create a SageMaker Endpoint that will autoscale based on CPU usage, at first I created a StepScalingPolicy as such:

def set_step_scaling(endpoint_name, variant_name):
    policy_name = "step-scaling-{}".format(str(round(time.time())))
    resource_id = "endpoint/{}/variant/{}".format(endpoint_name, variant_name)
    response = client.put_scaling_policy(
        PolicyName=policy_name,
        ServiceNamespace="sagemaker",
        ResourceId=resource_id,
        ScalableDimension="sagemaker:variant:DesiredInstanceCount",
        PolicyType="StepScaling",
        StepScalingPolicyConfiguration={
            "AdjustmentType": "ChangeInCapacity",
            "StepAdjustments": [
                {
                    "MetricIntervalLowerBound": 0,
                    "MetricIntervalUpperBound": 60,
                    "ScalingAdjustment": -1,
                },
                {"MetricIntervalLowerBound": 60, "ScalingAdjustment": 1},
            ],
            "MetricAggregationType": "Average",
            "Cooldown": 10,
        },
    )
    return policy_name, response

However I found StepScalingPolicyConfiguration alone would not scale the endpoint (TargetTrackingScaling kind of works, but it's a managed policy, which doesn't allow edit the period etc). I can't tell exactly what's going on but I realized one difference between StepScalingPolicyConfiguration and TargetTrackingScaling is that the latter creates a alarm for the endpoint vs the former won't, so I suspect I need to manually create one. So then I did this:

_, response = set_step_scaling(endpoint_name=endpoint_name, variant_name=variant_name)

step_scaling_policy_arn = response['PolicyARN']

cw_client = boto3.client('cloudwatch')
response = cw_client.put_metric_alarm(
    AlarmName=f'step_scaling_policy_alarm_name_-{endpoint_name}',
    MetricName='CPUUtilization',
    Namespace='/aws/sagemaker/Endpoints',
    Statistic='Average',
    EvaluationPeriods=1,
    DatapointsToAlarm=1,
    Threshold=70,
    ComparisonOperator='GreaterThanThreshold',
    TreatMissingData='missing',
    Dimensions=[
        { 'Name':'EndpointName', 'Value':endpoint_name },
    ],
    Period=60,
    Unit="Percent",
    AlarmActions=[step_scaling_policy_arn]
)

This effectively creates an alarm for the endpoint (on UI end I can confirm), but the issue is that there's always no data (Insufficient data) for this alarm.

BUT the alarm created through TargetTrackingScaling seems to be fine, it enters alarm/OK state probably with the data (but like I mentioned, the period setup doesn't fit). And if I create an alarm from stretch, I can see the CPUUtilization data too, but I can't associate it to an autoscale action through the UI.

Any tips?


Update:

So I just made a hack, I copied the alarm from the one that's auto-generated from TargetTrackingScaling and edit it to be triggering the policy for StepScalingPolicy (I don't know if this even makes sense). But anyway the new alarm was able to get the CPU usage data and stay in 'in alarm' / 'OK' state as expected. The issue now is that I still don't see new instances being kicked off even when the alarm is in 'in alarm' state, am I supposed to NOT manually create this alarm at all? Or how else should StepScalingPolic work..

2 Answers
0
Accepted Answer

A cloudwatch metric is a unique combination of a Namespace, MetricName, Dimension(s), and Unit

Some metrics won't be pushed with any Dimensions or a Unit, but when they are, the full list must match exactly, or its a different metric. Sagemaker pushes these metrics with EndpointName + VariantName as 2 dimensions on the metric: https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-endpoint-invocation

With Step Scaling, you are in charge of creating the alarm for the policy. With target tracking, you define the policy and AutoScaling will create/manage the alarms for you. Make sure the alarm is going into the ALARM state, and that its action actually got created correctly (pointing at the step scaling policy).

Generally you'll create 2 alarms. One linked to a step scale-in policy; and a second linked to a step scale-out policy (whereas with target tracking, you create a single policy which handles both scale-in and scale-out). This is likely why you're not seeing scaling happen as expected, since the alarm action is only triggering when usage is above 70; and the policy is set to -1 when that happens. The +1 adjustment on that policy will never trigger, since its set to happen when CPU is greater than 130% (step upper/lower bounds are relative to the alarm threshold) https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html#as-scaling-steps

AWS
answered 8 months ago
  • Thanks! I was able to fix it by creating 2 alarms manually, each maps to one policy, following your advice!

    step upper/lower bounds are relative to the alarm threshold -> I wasn't aware this fact apparently.

0

Hi, did you try with a longer period or higher number of datapoints to alarm? If the alarm remains in insufficient data although you can see the metric's graph on the alarm detail page, it may be that the metric is ingested with a slightly higher delay than you expect, so the alarm doesn't see fresh data when it evaluates. If that is the case, you can workaround it either by using a higher evaluation period, or by wrapping the metric in a FILL(m1, REPEAT) metric math function (see https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html). Whether that helps you or not, I'd also suggest you raise this as an issue to support.

profile pictureAWS
Jsc
answered 8 months ago
  • Hey thanks for the reply; so I just made a hack, I copied the alarm from the one that's auto-generated from TargetTrackingScaling and edit it to be triggering the policy for StepScalingPolicy (I don't know if this even makes sense).

    But anyway the new alarm was able to get the CPU usage data and stay in 'in alarm' / 'OK' state as expected.

    The issue now is that I still don't see new instances being kicked off even when the alarm is in 'in alarm' state, am I supposed to NOT manually create this alarm at all? Or how else should StepScalingPolicy work..

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions