Autoscalling in ECS cluster service does not behave as expected causing target deregistration and service downtime

0

Hello,

I have an ecs service running an nlp model for inference. The service has the following scaling policy:

 ChatSpamV3ScalingPolicy:
    DeletionPolicy: "Delete"
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: !Join [ "-", [ "CPU", "at85", !Ref ChatSpamV3ServiceName ] ]
      PolicyType: "TargetTrackingScaling"
      ScalingTargetId: !Ref ChatSpamV3AppAutoScalingScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: "ECSServiceAverageCPUUtilization"
        ScaleInCooldown: 40
        ScaleOutCooldown: 60
        TargetValue: 40

  ChatSpamV3ScalingRAMPolicy:
    DeletionPolicy: "Delete"
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: !Join [ "-", [ "RAM", "at", !Ref ChatSpamV3ServiceName ] ]
      PolicyType: "TargetTrackingScaling"
      ScalingTargetId: !Ref ChatSpamV3AppAutoScalingScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: "ECSServiceAverageMemoryUtilization"
        ScaleInCooldown: 300
        ScaleOutCooldown: 100
        TargetValue: 90

  ChatSpamV3InternalTargetGroup:
    Type: "AWS::ElasticLoadBalancingV2::TargetGroup"
    Properties:
      HealthCheckIntervalSeconds: 40
      HealthCheckPath: "/ping"
      Port: 8080
      Protocol: "HTTP"
      HealthCheckPort: "traffic-port"
      HealthCheckProtocol: "HTTP"
      HealthCheckTimeoutSeconds: 15
      UnhealthyThresholdCount: 7
      TargetType: "ip"
      Matcher:
        HttpCode: "200"
      HealthyThresholdCount: 3
      VpcId: "vpc-ddc697ba"
      Name: !Join [ '-', [ 'ml', !Ref ChatSpamV3ServiceName ] ]
      HealthCheckEnabled: true
      TargetGroupAttributes:
        - Key: "stickiness.enabled"
          Value: "false"
        - Key: "deregistration_delay.timeout_seconds"
          Value: "30"
        - Key: "stickiness.type"
          Value: "lb_cookie"
        - Key: "stickiness.lb_cookie.duration_seconds"
          Value: "86400"
        - Key: "slow_start.duration_seconds"
          Value: "0"
        - Key: "load_balancing.algorithm.type"
          Value: "least_outstanding_requests"

As you can see there is a policy on cpu, memory and I have also experimented with request per target count.

Now the issue I face is that occasionally one of the tasks blows up - not sure why but it is not a bug in the code, it is an issue with the load (perhaps memory leak)?

I would expect based on autoscaling for the service to provision and register another task or to scale up before the task blows up. Instead the service never scales up and while the first task crashes, the extra traffic accumulated on the remaining tasks causes them to crash as well.

I end up having 0 healthy tasks running while I have an extra capacity for up to 20 tasks to be spun up which never happens. Does anyone know why this is the case, why the task that crashes does not get replaced soon enough to mitigate a cascade of failures, and why the service does not scale up prior to allowing tasks to start failing?

1回答
1

Hello.

Have you set "AWS::ApplicationAutoScaling::ScalableTarget"?
https://docs.aws.amazon.com/ja_jp/AWSCloudFormation/latest/UserGuide/aws-resource-applicationautoscaling-scalabletarget.html

AutoScaling is easy to handle when the load increases gradually, but I think it is difficult to handle when the load increases rapidly.
I thought that if we knew in advance the times when the load is most likely to occur, we would need to take measures such as increasing the number of tasks in advance.

profile picture
エキスパート
回答済み 2ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ