cloud watch alarm continues in the state if 'In Alarm' even after the instances come down to zero.

0

I have an aws asynchronous endpoint with auto scaling attached to it. The expected behavior is to scale in the instance count to zero when there are no requests and scale up when at least one request comes. Its working as expected. However when I check the cloud watch alarms , I could see that the alarm for scale in remain in the state of 'in Alarm' even after the instance count reduces to zero. Is this the normal behavior? How to make its state ok once the instance count reaches 0? I doubt this is delaying the scale out action when a new request comes. My autoscaling configuration is as follows:

# application-autoscaling client
asg_client = boto3.client("application-autoscaling")

# This is the format in which application autoscaling references the endpoint
resource_id = f"endpoint/{async_predictor.endpoint_name}/variant/AllTraffic"

# Configure Autoscaling on asynchronous endpoint down to zero instances
response = asg_client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=5,
)

response = asg_client.put_scaling_policy(
    PolicyName = f'HasBacklogWithoutCapacity-ScalingPolicy-{async_predictor.endpoint_name}',
    ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
        "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
        "Cooldown": 300, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
        "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
        [ 
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
          ]
    },    
)

cw_client = boto3.client('cloudwatch')
step_scaling_policy_arn = response['PolicyARN']

response = cw_client.put_metric_alarm(
    AlarmName=f'step_scaling_policy_alarm_name-{async_predictor.endpoint_name}',
    MetricName='HasBacklogWithoutCapacity',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 2,
    DatapointsToAlarm= 2,
    Threshold= 1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[
        { 'Name':'EndpointName', 'Value':async_predictor.endpoint_name },
    ],
    Period= 60,
    AlarmActions=[step_scaling_policy_arn]
)

response_scalein = asg_client.put_scaling_policy(
    PolicyName = f'scaleinpolicy-{async_predictor.endpoint_name}',
    ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
    StepScalingPolicyConfiguration={
        "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
        "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
        "Cooldown": 300, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
        "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
        [ 
            {
              "MetricIntervalUpperBound": 0,
              "ScalingAdjustment": -1
            }
          ]
    },    
)

cw_client = boto3.client('cloudwatch')
stepin_scaling_policy_arn = response_scalein['PolicyARN']

response = cw_client.put_metric_alarm(
    AlarmName=f'step_scale-in_policy-{async_predictor.endpoint_name}',
    MetricName='ApproximateBacklogSizePerInstance',
    Namespace='AWS/SageMaker',
    Statistic='Average',
    EvaluationPeriods= 2,
    DatapointsToAlarm= 2,
    Threshold= 0.5,
    ComparisonOperator='LessThanOrEqualToThreshold',
    TreatMissingData='missing',
    Dimensions=[
        { 'Name':'EndpointName', 'Value':async_predictor.endpoint_name },
    ],
    Period= 60,
    AlarmActions=[stepin_scaling_policy_arn]
)
Neethu
已提問 1 年前檢視次數 398 次
2 個答案
1

Hello,

Regarding

I doubt this is delaying the scale-out action when a new request comes If an alarm doesn't change states, it doesn't trigger Amazon EC2 Auto Scaling policies.

You would like to refer to https://repost.aws/knowledge-center/autoscaling-policy-cloudwatch-alarm

Also, check the Autoscaling tab for Details/Activity to verify if there is some inconsistency.

Evaluation periods, Threshold values, and Global Timeouts can also be checked as these factors can influence the CW Alarm State Change/Transition.

I see you're using StepScaling, it could be adding delays as the scaling event does not occur during the cooldown period. Try changing/adjusting the threshold or use TargetTracking to maintain Average CPU/other metrics as needed and see if that helps in faster transitions.

HTH!

profile picture
已回答 1 年前
0

Yes, this behavior is expected. The CloudWatch alarm will remain in the state of "In Alarm" until the alarm condition is no longer met, even if the instance count has reached zero. In this case, the alarm condition is based on the metric "ApproximateBacklogSizePerInstance", and if the value of this metric is less than or equal to the threshold of 0.5 for two consecutive evaluation periods, the alarm will transition to the "OK" state.

To avoid delaying the scale-out action when a new request comes, you can reduce the cooldown period in the scaling policies. Currently, the cooldown period is set to 300 seconds, which means that after a scaling activity, the Auto Scaling group will wait for 300 seconds before performing another scaling activity. You can reduce this cooldown period to a lower value, such as 60 seconds, to make the Auto Scaling group more responsive to changes in demand.

已回答 1 年前
  • Thank you for the answer. So in that case , when a new request is coming even after the cooldown period, the cloudwatch alarm that monitors the backlog size first changes it state from 'in alarm' to 'ok'. After that only, the alarm that is monitoring hasbacklogwithoutcapacity changes from 'ok' state to 'in alarm.' Then only the scale up action occurs. This is creating a delay in scaling up. Is there any work around to give priority for the scaling up alarm

  • Changing the cooldown shouldn't be needed here. A scale-in cooldown doesn't block a scale-out action: https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html#step-scaling-cooldown

    The 2 alarms aren't connected to each other, the scale-out alarm doesn't have to wait for the scale-in alarm to change states. However, the conditions for the scale-out alarm have to be met before it can change to 'ALARM' state. Based on your alarm settings, that means waiting for 2 consecutive breaching minutes

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南