Autoscalling in ECS cluster service does not behave as expected causing target deregistration and service downtime

0

Hello,

I have an ecs service running an nlp model for inference. The service has the following scaling policy:

 ChatSpamV3ScalingPolicy:
    DeletionPolicy: "Delete"
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: !Join [ "-", [ "CPU", "at85", !Ref ChatSpamV3ServiceName ] ]
      PolicyType: "TargetTrackingScaling"
      ScalingTargetId: !Ref ChatSpamV3AppAutoScalingScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: "ECSServiceAverageCPUUtilization"
        ScaleInCooldown: 40
        ScaleOutCooldown: 60
        TargetValue: 40

  ChatSpamV3ScalingRAMPolicy:
    DeletionPolicy: "Delete"
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: !Join [ "-", [ "RAM", "at", !Ref ChatSpamV3ServiceName ] ]
      PolicyType: "TargetTrackingScaling"
      ScalingTargetId: !Ref ChatSpamV3AppAutoScalingScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: "ECSServiceAverageMemoryUtilization"
        ScaleInCooldown: 300
        ScaleOutCooldown: 100
        TargetValue: 90

  ChatSpamV3InternalTargetGroup:
    Type: "AWS::ElasticLoadBalancingV2::TargetGroup"
    Properties:
      HealthCheckIntervalSeconds: 40
      HealthCheckPath: "/ping"
      Port: 8080
      Protocol: "HTTP"
      HealthCheckPort: "traffic-port"
      HealthCheckProtocol: "HTTP"
      HealthCheckTimeoutSeconds: 15
      UnhealthyThresholdCount: 7
      TargetType: "ip"
      Matcher:
        HttpCode: "200"
      HealthyThresholdCount: 3
      VpcId: "vpc-ddc697ba"
      Name: !Join [ '-', [ 'ml', !Ref ChatSpamV3ServiceName ] ]
      HealthCheckEnabled: true
      TargetGroupAttributes:
        - Key: "stickiness.enabled"
          Value: "false"
        - Key: "deregistration_delay.timeout_seconds"
          Value: "30"
        - Key: "stickiness.type"
          Value: "lb_cookie"
        - Key: "stickiness.lb_cookie.duration_seconds"
          Value: "86400"
        - Key: "slow_start.duration_seconds"
          Value: "0"
        - Key: "load_balancing.algorithm.type"
          Value: "least_outstanding_requests"

As you can see there is a policy on cpu, memory and I have also experimented with request per target count.

Now the issue I face is that occasionally one of the tasks blows up - not sure why but it is not a bug in the code, it is an issue with the load (perhaps memory leak)?

I would expect based on autoscaling for the service to provision and register another task or to scale up before the task blows up. Instead the service never scales up and while the first task crashes, the extra traffic accumulated on the remaining tasks causes them to crash as well.

I end up having 0 healthy tasks running while I have an extra capacity for up to 20 tasks to be spun up which never happens. Does anyone know why this is the case, why the task that crashes does not get replaced soon enough to mitigate a cascade of failures, and why the service does not scale up prior to allowing tasks to start failing?

1 Answer
1

Hello.

Have you set "AWS::ApplicationAutoScaling::ScalableTarget"?
https://docs.aws.amazon.com/ja_jp/AWSCloudFormation/latest/UserGuide/aws-resource-applicationautoscaling-scalabletarget.html

AutoScaling is easy to handle when the load increases gradually, but I think it is difficult to handle when the load increases rapidly.
I thought that if we knew in advance the times when the load is most likely to occur, we would need to take measures such as increasing the number of tasks in advance.

profile picture
EXPERT
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions