Occasional "CannotPullContainerError: context cancelled" error for AWS Batch job

0

I have a container hosted on ECR and when I try to use it with AWS batch, I frequently get this container error. Other times it works just fine. Recently I believe I had a 30 to 40% error rate for the jobs I was running.

How can I diagnose what is happening here? I know that the image address is valid because it works sometimes.

I specify the container like this in the job definition:

{  
    "containerProperties": {
        "image": "public.ecr.aws/s5z5a3q9/parliament2:latest", ... 
    }
}

EDIT: One new thing I noticed is that my ECR repository is in us-east-2 but everything else in my organization is us-east-1. Could that be an issue?

epowell
asked 10 months ago349 views
2 Answers
0

Hi, this page will give you possible causes of such an error: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_cannot_pull_image.html

Given what you describe, I'd suggest to see if you are in one of the transient error cases that it describes at it works most of the time for you: Amazon ECR endpoint connection issue, Docker Hub rate limiting, etc.

profile pictureAWS
EXPERT
answered 10 months ago
profile picture
EXPERT
reviewed 10 months ago
  • One new thing I noticed is that my ECR repository is in us-east-2 but everything else in my organization is us-east-1. Could that be an issue?

    =======

    Thank you for your comment, @Didier_AWS. Yes, I saw that page. My full error is "CannotPullContainerError: Context canceled", and for that it says:

    "The common cause for this error is because the VPC your task is using doesn't have a route to pull the container image from Amazon ECR."

    I don't understand what that is implying. Is "doesn't have a route" a transient issue? It doesn't sound like it.

0

In addition to Thomas comments you check also the following links for the solution

https://stackoverflow.com/questions/56914290/cannotpullcontainererror-aws-batch-job

There might be a problem with how you are specifying the image in the job definition. Instead of just the image name, you should be using the full repository URI. You can find this URI in your ECR > Repositories in the AWS console​. Ensure that you are using the full image URI, including the repository host (e.g., 0123456789.dkr.ecr.us-east-1.amazonaws.com/dockerimagename).

https://www.reddit.com/r/aws/comments/hm00iv/cannotpullcontainererror_context_cancelled_from/

If the networking setup is correct, it's possible that there might be a problem with the IAM permissions. There might be certain permissions required that are not immediately obvious. Review your IAM permissions and roles to make sure that the role assigned to the task has the necessary permissions to pull from the ECR repository​.

About your ECR repository being in a different region than the rest of your services (us-east-2 vs us-east-1), Amazon ECR is a regional service and it's designed to let you deploy images flexibly. However, if you pull images from a different AWS Region than where your Docker cluster runs, you might experience additional latency and data transfer costs. Ideally, for the best performance, you should push/pull images to the same AWS Region where your Docker cluster runs​. So, it's possible that having your ECR repository in a different region could be contributing to the issues you're experiencing.

profile picture
EXPERT
answered 10 months ago
  • I saw that SO link and deemed it irrelevant because the docs say I can specify the image as I am doing and it works sometimes. I don't think it could cause the occasional issue I'm experiencing.

    Neither do I think it is permission related.

    Thank you for your input on the regions.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions