Autoscalling in ECS cluster service does not behave as expected causing target deregistration and service downtime

0

Hello,

I have an ecs service running an nlp model for inference. The service has the following scaling policy:

 ChatSpamV3ScalingPolicy:
    DeletionPolicy: "Delete"
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: !Join [ "-", [ "CPU", "at85", !Ref ChatSpamV3ServiceName ] ]
      PolicyType: "TargetTrackingScaling"
      ScalingTargetId: !Ref ChatSpamV3AppAutoScalingScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: "ECSServiceAverageCPUUtilization"
        ScaleInCooldown: 40
        ScaleOutCooldown: 60
        TargetValue: 40

  ChatSpamV3ScalingRAMPolicy:
    DeletionPolicy: "Delete"
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: !Join [ "-", [ "RAM", "at", !Ref ChatSpamV3ServiceName ] ]
      PolicyType: "TargetTrackingScaling"
      ScalingTargetId: !Ref ChatSpamV3AppAutoScalingScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: "ECSServiceAverageMemoryUtilization"
        ScaleInCooldown: 300
        ScaleOutCooldown: 100
        TargetValue: 90

  ChatSpamV3InternalTargetGroup:
    Type: "AWS::ElasticLoadBalancingV2::TargetGroup"
    Properties:
      HealthCheckIntervalSeconds: 40
      HealthCheckPath: "/ping"
      Port: 8080
      Protocol: "HTTP"
      HealthCheckPort: "traffic-port"
      HealthCheckProtocol: "HTTP"
      HealthCheckTimeoutSeconds: 15
      UnhealthyThresholdCount: 7
      TargetType: "ip"
      Matcher:
        HttpCode: "200"
      HealthyThresholdCount: 3
      VpcId: "vpc-ddc697ba"
      Name: !Join [ '-', [ 'ml', !Ref ChatSpamV3ServiceName ] ]
      HealthCheckEnabled: true
      TargetGroupAttributes:
        - Key: "stickiness.enabled"
          Value: "false"
        - Key: "deregistration_delay.timeout_seconds"
          Value: "30"
        - Key: "stickiness.type"
          Value: "lb_cookie"
        - Key: "stickiness.lb_cookie.duration_seconds"
          Value: "86400"
        - Key: "slow_start.duration_seconds"
          Value: "0"
        - Key: "load_balancing.algorithm.type"
          Value: "least_outstanding_requests"

As you can see there is a policy on cpu, memory and I have also experimented with request per target count.

Now the issue I face is that occasionally one of the tasks blows up - not sure why but it is not a bug in the code, it is an issue with the load (perhaps memory leak)?

I would expect based on autoscaling for the service to provision and register another task or to scale up before the task blows up. Instead the service never scales up and while the first task crashes, the extra traffic accumulated on the remaining tasks causes them to crash as well.

I end up having 0 healthy tasks running while I have an extra capacity for up to 20 tasks to be spun up which never happens. Does anyone know why this is the case, why the task that crashes does not get replaced soon enough to mitigate a cascade of failures, and why the service does not scale up prior to allowing tasks to start failing?

1 回答
1

Hello.

Have you set "AWS::ApplicationAutoScaling::ScalableTarget"?
https://docs.aws.amazon.com/ja_jp/AWSCloudFormation/latest/UserGuide/aws-resource-applicationautoscaling-scalabletarget.html

AutoScaling is easy to handle when the load increases gradually, but I think it is difficult to handle when the load increases rapidly.
I thought that if we knew in advance the times when the load is most likely to occur, we would need to take measures such as increasing the number of tasks in advance.

profile picture
专家
已回答 2 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则