Autoscalling in ECS cluster service does not behave as expected causing target deregistration and service downtime

0

Hello,

I have an ecs service running an nlp model for inference. The service has the following scaling policy:

 ChatSpamV3ScalingPolicy:
    DeletionPolicy: "Delete"
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: !Join [ "-", [ "CPU", "at85", !Ref ChatSpamV3ServiceName ] ]
      PolicyType: "TargetTrackingScaling"
      ScalingTargetId: !Ref ChatSpamV3AppAutoScalingScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: "ECSServiceAverageCPUUtilization"
        ScaleInCooldown: 40
        ScaleOutCooldown: 60
        TargetValue: 40

  ChatSpamV3ScalingRAMPolicy:
    DeletionPolicy: "Delete"
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: !Join [ "-", [ "RAM", "at", !Ref ChatSpamV3ServiceName ] ]
      PolicyType: "TargetTrackingScaling"
      ScalingTargetId: !Ref ChatSpamV3AppAutoScalingScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: "ECSServiceAverageMemoryUtilization"
        ScaleInCooldown: 300
        ScaleOutCooldown: 100
        TargetValue: 90

  ChatSpamV3InternalTargetGroup:
    Type: "AWS::ElasticLoadBalancingV2::TargetGroup"
    Properties:
      HealthCheckIntervalSeconds: 40
      HealthCheckPath: "/ping"
      Port: 8080
      Protocol: "HTTP"
      HealthCheckPort: "traffic-port"
      HealthCheckProtocol: "HTTP"
      HealthCheckTimeoutSeconds: 15
      UnhealthyThresholdCount: 7
      TargetType: "ip"
      Matcher:
        HttpCode: "200"
      HealthyThresholdCount: 3
      VpcId: "vpc-ddc697ba"
      Name: !Join [ '-', [ 'ml', !Ref ChatSpamV3ServiceName ] ]
      HealthCheckEnabled: true
      TargetGroupAttributes:
        - Key: "stickiness.enabled"
          Value: "false"
        - Key: "deregistration_delay.timeout_seconds"
          Value: "30"
        - Key: "stickiness.type"
          Value: "lb_cookie"
        - Key: "stickiness.lb_cookie.duration_seconds"
          Value: "86400"
        - Key: "slow_start.duration_seconds"
          Value: "0"
        - Key: "load_balancing.algorithm.type"
          Value: "least_outstanding_requests"

As you can see there is a policy on cpu, memory and I have also experimented with request per target count.

Now the issue I face is that occasionally one of the tasks blows up - not sure why but it is not a bug in the code, it is an issue with the load (perhaps memory leak)?

I would expect based on autoscaling for the service to provision and register another task or to scale up before the task blows up. Instead the service never scales up and while the first task crashes, the extra traffic accumulated on the remaining tasks causes them to crash as well.

I end up having 0 healthy tasks running while I have an extra capacity for up to 20 tasks to be spun up which never happens. Does anyone know why this is the case, why the task that crashes does not get replaced soon enough to mitigate a cascade of failures, and why the service does not scale up prior to allowing tasks to start failing?

1 Respuesta
1

Hello.

Have you set "AWS::ApplicationAutoScaling::ScalableTarget"?
https://docs.aws.amazon.com/ja_jp/AWSCloudFormation/latest/UserGuide/aws-resource-applicationautoscaling-scalabletarget.html

AutoScaling is easy to handle when the load increases gradually, but I think it is difficult to handle when the load increases rapidly.
I thought that if we knew in advance the times when the load is most likely to occur, we would need to take measures such as increasing the number of tasks in advance.

profile picture
EXPERTO
respondido hace 2 meses

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas