How to run AWS Batch jobs that uses managed EC2 compute environment that is based on a custom AMI?

0

Expected use-case: running batch jobs on EC2 machines that are based on pre-created AMI. The job is the default ["echo", "hello world"].

Problem: job stuck on RUNNABLE.

Question: are my configurations correct? Are there additional steps that I needed to take?

A similar question had been posted here: https://repost.aws/questions/QUMDT469UpSnesqNRS4EubXg/aws-batch-job-stuck-in-runnable-state. But since that post have no answers as of yet, I decided to create this post.

I studied the following articles to create my AMI and configure the compute environment (exact commands I used is described in previous sections):

To crease the AMI, I perform the following steps:

  1. Create an **EC2 instance **with the following setup:
{
    "MaxCount": 1,
    "MinCount": 1,
    "ImageId": "ami-05c13eab67c5d8861",
    "InstanceType": "r4.2xlarge",
    "KeyName": "...",
    "EbsOptimized": true,
    "UserData": ...,
    "BlockDeviceMappings": [
        {
            "DeviceName": "/dev/xvda",
            "Ebs": {
                "Encrypted": false,
                "DeleteOnTermination": true,
                "Iops": 3000,
                "SnapshotId": "snap-05a6245e68e6545b5",
                "VolumeSize": 60,
                "VolumeType": "gp3",
                "Throughput": 125
            }
        }
    ],
    "NetworkInterfaces": [
        {
            "AssociatePublicIpAddress": true,
            "DeviceIndex": 0,
            "Groups": [
                "sg-24b3286f"
            ]
        }
    ],
    "TagSpecifications": [
        {
            "ResourceType": "instance",
            "Tags": [...]
        }
    ],
    "IamInstanceProfile": {
        "Arn": "arn:aws:iam::627610429839:instance-profile/allow_all"
    },
    "PrivateDnsNameOptions": {
        "HostnameType": "ip-name",
        "EnableResourceNameDnsARecord": true,
        "EnableResourceNameDnsAAAARecord": false
    }
}
  1. Perform the following steps in the machine:
  • dnf install necessary packages
  • pip install python packages
  • sudo dnf install -y docker ecs-init
  • sudo systemctl enable docker
  • sudo systemctl enable ecs
  • sudo systemctl stop ecs
  • sudo rm -rf /var/lib/ecs/data/*
  • sudo rm -rf /var/lib/cloud/*
  1. Stop the machine
  2. Select the instance -> Actions -> Image and templates -> Create image

My AWS Batch compute environment have the following setup (it's a managed EC2 spot instance):

{
    "computeResources": {
        "type": "SPOT",
        "instanceTypes": [
            "r4.2xlarge"
        ],
        "minvCpus": 0,
        "desiredvCpus": 0,
        "maxvCpus": 256,
        "allocationStrategy": "SPOT_PRICE_CAPACITY_OPTIMIZED",
        "instanceRole": "arn:aws:iam::627610429839:instance-profile/ecsInstanceRole",
        "ec2KeyPair": ...,
        "ec2Configuration": [
            {
                "imageType": "ECS_AL2023",
                "imageIdOverride": <my AMI's id>
            }
        ],
        "launchTemplate": {},
        "bidPercentage": 55,
        "subnets": [...],
        "securityGroupIds": [...]
    },
    "serviceRole": null,
    "type": "MANAGED",
    "state": "ENABLED",
    "computeEnvironmentName": ...
}

My job definition have the following setup (it runs the default "hello world"):

{
    "containerProperties": {
        "command": [
            "echo",
            "hello world"
        ],
        "image": "public.ecr.aws/amazonlinux/amazonlinux:latest",
        "resourceRequirements": [
            {
                "type": "VCPU",
                "value": "1"
            },
            {
                "type": "MEMORY",
                "value": "2048"
            }
        ],
        "executionRoleArn": "arn:aws:iam::627610429839:role/ecsTaskExecutionRole",
        "jobRoleArn": "arn:aws:iam::627610429839:role/ecsTaskExecutionRole",
        "environment": [],
        "secrets": [],
        "linuxParameters": {
            "tmpfs": [],
            "devices": []
        },
        "mountPoints": [],
        "ulimits": [],
        "logConfiguration": {
            "logDriver": "awslogs",
            "options": {},
            "secretOptions": []
        }
    },
    "platformCapabilities": [
        "EC2"
    ],
    "type": "container",
    "jobDefinitionName": ...,
    "timeout": {},
    "retryStrategy": {},
    "parameters": {}
}

Based on the configuration above, when I submit a job, the following happened:

  • Job is stuck on RUNNABLE
  • EC2 Autoscaling Group spin up an instance
  • New ECS Cluster is created but no task is scheduled

Things I tried:

I tried to troubleshoot the problem based on this article (https://repost.aws/knowledge-center/batch-job-stuck-runnable-status) and here are my findings:

Making sure that the ecsInstanceRole and ecsTaskExecutionRole IAM are configured correctly (it is). They are the same as described in the following articles:

Making sure that the ECS service runs on boot (it does). This article lead me to believe that I need to take actions to make sure that the ECS service starts: https://docs.aws.amazon.com/batch/latest/userguide/create-batch-ami.html

  • Scheduling ECS and Docker to start on boot by adding User Data. I later realized that User Data is only meant for one-time initialization of the EC2 instance, so I tried something else.
  • Create a @reboot cron job to start both services.
  • Create init.d script to start both services on boot.
  • Don't try to start ecs on boot at all, but instead simply enable the ecs service before creating the AMI. As I later discovered, AWS Batch only needed ecs service to be present, and will start that service automatically when ec2 machines are created for the job (please let me know if I misunderstood).

Making sure that the job runs with Amazon Linux 2023 base AMI (it does). The machine specs are exactly the same, but when I create the compute environment based on AL2023, the job runs perfectly fine. So I assume that the problem have nothing to do with the machine resources configuration.

Making sure that the /etc/ecs/ecs.config file is created on the EC2 machine that spun up (it isn't). I took a look at the batch's autoscaling group's launch template and saw that it have user data that create /etc/ecs/ecs.config and set the cluster variables. So it made me curious if this config file is actually being created, since the ECS Cluster doesn't seems to be receiving the jobs. Since the job is stuck on RUNNABLE, the EC2 instance is always active, so I tried to SSH in and check /etc/ecs/ and found no ecs.config file. At this point I think it might be a permission issue so I tried to "sudo vim /etc/ecs/ecs.config" and got Permission Error. I saw this post, but since I don't even have ecs.config, I don't think it addresses my issue: https://repost.aws/knowledge-center/ecs-iam-task-roles-config-errors.

Please let me know if I need to provide more information.

asked 6 months ago147 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions