Capacity Provider never scale container instances on AWS Batch Unmanaged ECS

1

I am trying to implement ECS Autoscaling with Capacity Provider in an AWS Batch Unmanaged Compute Environment.

The following CloudFormation template was used to create the environment. The initial Desired Capacity of AutoScalingGroup is 0.

I submitted a job to AWS Batch, but the Capacity Provider does not scale Container Instances, so the job is stuck in the Runnable state. In this state, if you manually increase the Desired Capacity of the AutoScalingGroup, the Container Instances will scale and the job will run.

Also, when the Desired Capacity of the AutoScalingGroup is 0, if you execute an ECS task manually, the Capacity Provider will change the Desired Capacity of the AutoScalingGroup and the Container Instances will be scaled.

What changes should be made so that the Capacity Provider can successfully scale Container Instances and execute jobs by submitting a Job in AWS Batch?

[CloudFormation Template]:

AWSTemplateFormatVersion: '2010-09-09'
Description: >
  AWS Batch Unmanged ECS Capacity Provider Test

Parameters:
  ServiceName:
    Type: String
    Default: "test-batch-unmanaged"
  AvailabilityZone:
    Type: String
    Default: "ap-northeast-1a"
  BatchInstanceAMI:
    Type: AWS::EC2::Image::Id
    Description: Batch ECS Instance AMI
    Default: ami-0049422eda1bb52a7 # ECS Optimized AMI

Resources:
  BatchVPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.123.0.0/24
      EnableDnsSupport: true
      EnableDnsHostnames: true
      InstanceTenancy: default
      Tags:
        - Key: Name
          Value: !Sub "${ServiceName}-vpc"

  BatchInstanceRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - ec2.amazonaws.com
                - spotfleet.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: "/"
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
        - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  
  BatchInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Path: "/"
      Roles:
        - !Ref BatchInstanceRole
  
  BatchInstanceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      VpcId: !Ref BatchVPC
      GroupDescription: "Youtube Transcriber Batch Security Group"
      SecurityGroupIngress:
        - IpProtocol: "tcp"
          FromPort: "22"
          ToPort: "22"
          CidrIp: 0.0.0.0/0

  JobRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - ecs-tasks.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: "/"
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
  
  PublicRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref BatchVPC
      Tags:
        - Key: Name
          Value: !Sub "${ServiceName}-public-route"
  
  PublicSubnet:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref BatchVPC
      CidrBlock: 10.123.0.0/26
      AvailabilityZone: !Ref AvailabilityZone
      MapPublicIpOnLaunch: true
      Tags:
        - Key: Name
          Value: !Sub "${ServiceName}-public-subnet"

  PublicSubnetRouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      SubnetId: !Ref PublicSubnet
      RouteTableId: !Ref PublicRouteTable

  InternetGateway:
    Type: AWS::EC2::InternetGateway
    Properties:
      Tags:
        - Key: Name
          Value: !Sub "${ServiceName}-igw"

  AttachGateway:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      VpcId: !Ref BatchVPC
      InternetGatewayId: !Ref InternetGateway
  
  PublicRoutes:
    Type: AWS::EC2::Route
    DependsOn: AttachGateway
    Properties:
      RouteTableId: !Ref PublicRouteTable
      DestinationCidrBlock: 0.0.0.0/0
      GatewayId: !Ref InternetGateway

  FleetRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - spotfleet.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: "/"
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole
  
  BatchServiceRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - batch.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: "/"
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

  ComputeEnvironment:
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      Type: UNMANAGED
      ServiceRole: !GetAtt BatchServiceRole.Arn
      ComputeEnvironmentName: !Sub "${ServiceName}-ce-${BatchInstanceAMI}"
      State: ENABLED

  EcsClusterArnOfCELambdaRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
        - arn:aws:iam::aws:policy/AWSBatchFullAccess

  EcsClusterArnOfCELambda:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: CustomResourceEcsClusterArnOfCE
      Handler: index.lambda_handler
      Runtime: python3.9
      Role: !GetAtt EcsClusterArnOfCELambdaRole.Arn
      MemorySize: 128
      Timeout: 300
      Code:
        ZipFile: |
          import boto3
          import logging

          logger = logging.getLogger("EcsClusterArnOfCE")
          logger.setLevel(logging.INFO)
          batchClient = boto3.client('batch')

          def lambda_handler(event, context):
            logger.info(event)

            import cfnresponse
            try:
              if event['RequestType'] == 'Delete':
                cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Response': 'Success', 'EcsClusterArn': '' })
                return
              
              # Following Create or Update
              isWaitForValid = event['ResourceProperties']['WaitForValid']
              isWaitForValid = bool(isWaitForValid) if isWaitForValid else True
              ceName = event['ResourceProperties']['CEName']

              while True:
                response = batchClient.describe_compute_environments(
                  computeEnvironments = [
                    ceName
                  ]
                )
                logger.info(response)
                ce = response['computeEnvironments'][0]
                if not isWaitForValid or ce['status'] == 'VALID':
                  break
                logger.info('wait for status to valid')
                logger.info(ce)
                sleep(5)
              
              ecsClusterArn = ce['ecsClusterArn']
              if ecsClusterArn:
                cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Response': 'Success',  'EcsClusterArn': ecsClusterArn})
              else:
                logger.error("EcsClusterArn is null")
                cfnresponse.send(event, context, cfnresponse.FAILED, {'Response': 'Failure', 'EcsClusterArn': ''})
            
            except Exception as e:
              logger.error(e)
              cfnresponse.send(event, context, cfnresponse.FAILED, {'Response': 'Failure', 'EcsClusterArn': ''})

  EcsClusterArnOfCE:
    Type: Custom::EcsClusterArnOfCE
    Properties:
      ServiceToken: !GetAtt EcsClusterArnOfCELambda.Arn
      CEName: !Ref ComputeEnvironment
      WaitForValid: True

  BatchComputeLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: !Sub "${ServiceName}-batch-launch-template"
      LaunchTemplateData:
        ImageId: !Ref BatchInstanceAMI
        IamInstanceProfile:
          Arn: !GetAtt BatchInstanceProfile.Arn
        InstanceType: t3.micro
        InstanceMarketOptions:
          MarketType: spot
          SpotOptions:
            SpotInstanceType: one-time
        EbsOptimized: True
        UserData:
          Fn::Base64:
            !Sub |
              #!/bin/bash
              cat <<'EOF' >> /etc/ecs/ecs.config
              ECS_CLUSTER=${EcsClusterArnOfCE.EcsClusterArn}
              EOF
  
  ASGCompute:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      CapacityRebalance: True
      MinSize: 0
      MaxSize: 5
      NewInstancesProtectedFromScaleIn: False
      LaunchTemplate:
        LaunchTemplateId: !Ref BatchComputeLaunchTemplate
        Version: !GetAtt BatchComputeLaunchTemplate.LatestVersionNumber
      VPCZoneIdentifier:
        - !Ref PublicSubnet
      Tags:
        - Key: Name
          Value: !Sub "${ServiceName}-batch-asg"
          PropagateAtLaunch: True
    UpdatePolicy:
      AutoScalingReplacingUpdate:
        WillReplace: True

  BatchCapacityProvider:
    Type: AWS::ECS::CapacityProvider
    Properties:
      AutoScalingGroupProvider:
        AutoScalingGroupArn: !Ref ASGCompute
        ManagedScaling:
          Status: ENABLED
          TargetCapacity: 100
          MaximumScalingStepSize: 10
          MinimumScalingStepSize: 1
          InstanceWarmupPeriod: 60
        ManagedTerminationProtection: DISABLED
        ManagedDraining: ENABLED
  
  BatchCapacityProviderAssociations:
    Type: AWS::ECS::ClusterCapacityProviderAssociations
    Properties:
      CapacityProviders:
        - !Ref BatchCapacityProvider
      Cluster: !GetAtt EcsClusterArnOfCE.EcsClusterArn
      DefaultCapacityProviderStrategy:
        - CapacityProvider: !Ref BatchCapacityProvider
          Weight: 1
          Base: 0

  BatchJobQueue:
    Type: AWS::Batch::JobQueue
    Properties:
      JobQueueName: !Sub "${ServiceName}-job-queue"
      ComputeEnvironmentOrder:
        - ComputeEnvironment: !Ref ComputeEnvironment
          Order: 1
      Priority: 1
      State: ENABLED

  BatchJobDefinition:
    Type: AWS::Batch::JobDefinition
    Properties:
      Type: container
      JobDefinitionName: !Sub "${ServiceName}-batch"
      Parameters:
        Param: 'test'
      ContainerProperties:
        Command:
          - echo
          - 'Ref::Param'
        ResourceRequirements:
          - Type: MEMORY
            Value: 256
          - Type: VCPU
            Value: 1
        JobRoleArn: !Ref JobRole
        Image: !Sub "busybox:latest"
      Timeout:
        AttemptDurationSeconds: 3600
      RetryStrategy:
        Attempts: 1

Outputs:
  BatchJobQueue:
    Value: !Ref BatchJobQueue
  BatchJobDefinition:
    Value: !Ref BatchJobDefinition

[Reproduction codes (CLI)]:

STACK_NAME=batch-unmanaged-test

# create stack
STACK_ARN=$(aws cloudformation create-stack --stack-name $STACK_NAME --template-body file://`pwd`/batch-stack-template.yaml --capabilities CAPABILITY_NAMED_IAM | jq -r .StackId)


# wait for complete
aws cloudformation wait stack-create-complete --stack-name $STACK_ARN


# read parameter from stack outputs
BATCH_JOB_QUEUE=$(aws cloudformation describe-stacks --stack-name $STACK_ARN | jq -r '.Stacks[0].Outputs[] | select(.OutputKey == "BatchJobQueue") | .OutputValue')
BATCH_JOB_DEFINITION=$(aws cloudformation describe-stacks --stack-name $STACK_ARN | jq -r '.Stacks[0].Outputs[] | select(.OutputKey == "BatchJobDefinition") | .OutputValue')


# submit batch job (job submit ok, but never it runs, because of a capacity provider don't scale container instances)
aws batch submit-job --job-name batch-submit-test --job-queue $BATCH_JOB_QUEUE --job-definition $BATCH_JOB_DEFINITION


# delete stack
# aws cloudformation delete-stack --stack-name $STACK_NAME
  • Experiencing the same thing. AWS Batch is quite confusing, when you try to understand how it interacts with AutoScaling.

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions