ResourceInitializationError when running a job in AWS Batch

0

I've created a docker image, pushed it into a private ECR Repository, and configured an AWS Batch cluster/queue/job definition. When I submit a job, it immediately goes to the STARTING state, and then fails with

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval
failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s):
RequestError: send request failed caused by: Post https://api.ecr.us-west-2.amazonaws.com/: dial
tcp 54.240.255.116:443: i/o timeout

This seems to be a problem with the container image not being pulled. My cluster has the following specs:

  • Fargate provision model
  • Default VPC
  • Default security group (allows all outbound traffic, but only inbound from the default SG)
  • Default subnets (4 subnets with a route to an internet gateway and a single ACL rule allowing all traffic)

The job definition has an execution role with the managed policy AmazonECSTaskExecutionRolePolicy and has the "Public IP" option disabled.

The network configuration seems to be enough to pull images from the internet, but I'm still getting the timeout error. Also, the IAM Role seems to have the relevant policies to authenticate with my private ECR. Can someone help me debug this?

1 Answer
0

I ran into the same problem when working on AWS Financial Industry Quest: Grid computing for capital markets. The research told me to check network connections and VPC endpoints, but those should not be a problem when working on AWS built and managed console. SO weird.

SST
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions