Capacity Provider never scale container instances on AWS Batch Unmanaged ECS
I am trying to implement ECS Autoscaling with Capacity Provider in an AWS Batch Unmanaged Compute Environment.
The following CloudFormation template was used to create the environment. The initial Desired Capacity of AutoScalingGroup is 0.
I submitted a job to AWS Batch, but the Capacity Provider does not scale Container Instances, so the job is stuck in the Runnable state. In this state, if you manually increase the Desired Capacity of the AutoScalingGroup, the Container Instances will scale and the job will run.
Also, when the Desired Capacity of the AutoScalingGroup is 0, if you execute an ECS task manually, the Capacity Provider will change the Desired Capacity of the AutoScalingGroup and the Container Instances will be scaled.
What changes should be made so that the Capacity Provider can successfully scale Container Instances and execute jobs by submitting a Job in AWS Batch?
[CloudFormation Template]:
AWSTemplateFormatVersion: '2010-09-09' Description: > AWS Batch Unmanged ECS Capacity Provider Test Parameters: ServiceName: Type: String Default: "test-batch-unmanaged" AvailabilityZone: Type: String Default: "ap-northeast-1a" BatchInstanceAMI: Type: AWS::EC2::Image::Id Description: Batch ECS Instance AMI Default: ami-0049422eda1bb52a7 # ECS Optimized AMI Resources: BatchVPC: Type: AWS::EC2::VPC Properties: CidrBlock: 10.123.0.0/24 EnableDnsSupport: true EnableDnsHostnames: true InstanceTenancy: default Tags: - Key: Name Value: !Sub "${ServiceName}-vpc" BatchInstanceRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: - ec2.amazonaws.com - spotfleet.amazonaws.com Action: - sts:AssumeRole Path: "/" ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore BatchInstanceProfile: Type: AWS::IAM::InstanceProfile Properties: Path: "/" Roles: - !Ref BatchInstanceRole BatchInstanceSecurityGroup: Type: AWS::EC2::SecurityGroup Properties: VpcId: !Ref BatchVPC GroupDescription: "Youtube Transcriber Batch Security Group" SecurityGroupIngress: - IpProtocol: "tcp" FromPort: "22" ToPort: "22" CidrIp: 0.0.0.0/0 JobRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: - ecs-tasks.amazonaws.com Action: - sts:AssumeRole Path: "/" ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy PublicRouteTable: Type: AWS::EC2::RouteTable Properties: VpcId: !Ref BatchVPC Tags: - Key: Name Value: !Sub "${ServiceName}-public-route" PublicSubnet: Type: AWS::EC2::Subnet Properties: VpcId: !Ref BatchVPC CidrBlock: 10.123.0.0/26 AvailabilityZone: !Ref AvailabilityZone MapPublicIpOnLaunch: true Tags: - Key: Name Value: !Sub "${ServiceName}-public-subnet" PublicSubnetRouteTableAssociation: Type: AWS::EC2::SubnetRouteTableAssociation Properties: SubnetId: !Ref PublicSubnet RouteTableId: !Ref PublicRouteTable InternetGateway: Type: AWS::EC2::InternetGateway Properties: Tags: - Key: Name Value: !Sub "${ServiceName}-igw" AttachGateway: Type: AWS::EC2::VPCGatewayAttachment Properties: VpcId: !Ref BatchVPC InternetGatewayId: !Ref InternetGateway PublicRoutes: Type: AWS::EC2::Route DependsOn: AttachGateway Properties: RouteTableId: !Ref PublicRouteTable DestinationCidrBlock: 0.0.0.0/0 GatewayId: !Ref InternetGateway FleetRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: - spotfleet.amazonaws.com Action: - sts:AssumeRole Path: "/" ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole BatchServiceRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: - batch.amazonaws.com Action: - sts:AssumeRole Path: "/" ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole ComputeEnvironment: Type: AWS::Batch::ComputeEnvironment Properties: Type: UNMANAGED ServiceRole: !GetAtt BatchServiceRole.Arn ComputeEnvironmentName: !Sub "${ServiceName}-ce-${BatchInstanceAMI}" State: ENABLED EcsClusterArnOfCELambdaRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: - lambda.amazonaws.com Action: - sts:AssumeRole ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole - arn:aws:iam::aws:policy/AWSBatchFullAccess EcsClusterArnOfCELambda: Type: AWS::Lambda::Function Properties: FunctionName: CustomResourceEcsClusterArnOfCE Handler: index.lambda_handler Runtime: python3.9 Role: !GetAtt EcsClusterArnOfCELambdaRole.Arn MemorySize: 128 Timeout: 300 Code: ZipFile: | import boto3 import logging logger = logging.getLogger("EcsClusterArnOfCE") logger.setLevel(logging.INFO) batchClient = boto3.client('batch') def lambda_handler(event, context): logger.info(event) import cfnresponse try: if event['RequestType'] == 'Delete': cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Response': 'Success', 'EcsClusterArn': '' }) return # Following Create or Update isWaitForValid = event['ResourceProperties']['WaitForValid'] isWaitForValid = bool(isWaitForValid) if isWaitForValid else True ceName = event['ResourceProperties']['CEName'] while True: response = batchClient.describe_compute_environments( computeEnvironments = [ ceName ] ) logger.info(response) ce = response['computeEnvironments'][0] if not isWaitForValid or ce['status'] == 'VALID': break logger.info('wait for status to valid') logger.info(ce) sleep(5) ecsClusterArn = ce['ecsClusterArn'] if ecsClusterArn: cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Response': 'Success', 'EcsClusterArn': ecsClusterArn}) else: logger.error("EcsClusterArn is null") cfnresponse.send(event, context, cfnresponse.FAILED, {'Response': 'Failure', 'EcsClusterArn': ''}) except Exception as e: logger.error(e) cfnresponse.send(event, context, cfnresponse.FAILED, {'Response': 'Failure', 'EcsClusterArn': ''}) EcsClusterArnOfCE: Type: Custom::EcsClusterArnOfCE Properties: ServiceToken: !GetAtt EcsClusterArnOfCELambda.Arn CEName: !Ref ComputeEnvironment WaitForValid: True BatchComputeLaunchTemplate: Type: AWS::EC2::LaunchTemplate Properties: LaunchTemplateName: !Sub "${ServiceName}-batch-launch-template" LaunchTemplateData: ImageId: !Ref BatchInstanceAMI IamInstanceProfile: Arn: !GetAtt BatchInstanceProfile.Arn InstanceType: t3.micro InstanceMarketOptions: MarketType: spot SpotOptions: SpotInstanceType: one-time EbsOptimized: True UserData: Fn::Base64: !Sub | #!/bin/bash cat <<'EOF' >> /etc/ecs/ecs.config ECS_CLUSTER=${EcsClusterArnOfCE.EcsClusterArn} EOF ASGCompute: Type: AWS::AutoScaling::AutoScalingGroup Properties: CapacityRebalance: True MinSize: 0 MaxSize: 5 NewInstancesProtectedFromScaleIn: False LaunchTemplate: LaunchTemplateId: !Ref BatchComputeLaunchTemplate Version: !GetAtt BatchComputeLaunchTemplate.LatestVersionNumber VPCZoneIdentifier: - !Ref PublicSubnet Tags: - Key: Name Value: !Sub "${ServiceName}-batch-asg" PropagateAtLaunch: True UpdatePolicy: AutoScalingReplacingUpdate: WillReplace: True BatchCapacityProvider: Type: AWS::ECS::CapacityProvider Properties: AutoScalingGroupProvider: AutoScalingGroupArn: !Ref ASGCompute ManagedScaling: Status: ENABLED TargetCapacity: 100 MaximumScalingStepSize: 10 MinimumScalingStepSize: 1 InstanceWarmupPeriod: 60 ManagedTerminationProtection: DISABLED ManagedDraining: ENABLED BatchCapacityProviderAssociations: Type: AWS::ECS::ClusterCapacityProviderAssociations Properties: CapacityProviders: - !Ref BatchCapacityProvider Cluster: !GetAtt EcsClusterArnOfCE.EcsClusterArn DefaultCapacityProviderStrategy: - CapacityProvider: !Ref BatchCapacityProvider Weight: 1 Base: 0 BatchJobQueue: Type: AWS::Batch::JobQueue Properties: JobQueueName: !Sub "${ServiceName}-job-queue" ComputeEnvironmentOrder: - ComputeEnvironment: !Ref ComputeEnvironment Order: 1 Priority: 1 State: ENABLED BatchJobDefinition: Type: AWS::Batch::JobDefinition Properties: Type: container JobDefinitionName: !Sub "${ServiceName}-batch" Parameters: Param: 'test' ContainerProperties: Command: - echo - 'Ref::Param' ResourceRequirements: - Type: MEMORY Value: 256 - Type: VCPU Value: 1 JobRoleArn: !Ref JobRole Image: !Sub "busybox:latest" Timeout: AttemptDurationSeconds: 3600 RetryStrategy: Attempts: 1 Outputs: BatchJobQueue: Value: !Ref BatchJobQueue BatchJobDefinition: Value: !Ref BatchJobDefinition
[Reproduction codes (CLI)]:
STACK_NAME=batch-unmanaged-test # create stack STACK_ARN=$(aws cloudformation create-stack --stack-name $STACK_NAME --template-body file://`pwd`/batch-stack-template.yaml --capabilities CAPABILITY_NAMED_IAM | jq -r .StackId) # wait for complete aws cloudformation wait stack-create-complete --stack-name $STACK_ARN # read parameter from stack outputs BATCH_JOB_QUEUE=$(aws cloudformation describe-stacks --stack-name $STACK_ARN | jq -r '.Stacks[0].Outputs[] | select(.OutputKey == "BatchJobQueue") | .OutputValue') BATCH_JOB_DEFINITION=$(aws cloudformation describe-stacks --stack-name $STACK_ARN | jq -r '.Stacks[0].Outputs[] | select(.OutputKey == "BatchJobDefinition") | .OutputValue') # submit batch job (job submit ok, but never it runs, because of a capacity provider don't scale container instances) aws batch submit-job --job-name batch-submit-test --job-queue $BATCH_JOB_QUEUE --job-definition $BATCH_JOB_DEFINITION # delete stack # aws cloudformation delete-stack --stack-name $STACK_NAME
- Più recenti
- Maggior numero di voti
- Maggior numero di commenti
Contenuto pertinente
- AWS UFFICIALEAggiornata 2 anni fa
- AWS UFFICIALEAggiornata 2 anni fa
- AWS UFFICIALEAggiornata 2 anni fa
Experiencing the same thing. AWS Batch is quite confusing, when you try to understand how it interacts with AutoScaling.