ECS Fargate task running in private subnet can't pull container from private ECR repository

1

UPDATE: Turns out the subnet ACLs didn't allow inbound traffic on the ephemeral ports. After adding a rule for that it started working. NAT gateway is enough, no need for the VPC endpoints.

I've got a private ECR repository and a VPC with a private and a public subnet. The public subnet has an internet gateway and a NAT gateway in it. From the private subnet 0.0.0.0/0 is routed through the NAT gateway. I've got endpoints defined for:

  • com.amazonaws.us-east-1.ecr.api
  • com.amazonaws.us-east-1.ecr.dkr
  • com.amazonaws.us-east-1.logs
  • com.amazonaws.us-east-1.s3 (gateway)

When I don't have the endpoints in the private subnet, this is what I get:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post "https://api.ecr.us-east-1.amazonaws.com/"

The furthest I can get with pulling the image is when I have the endpoints inside the private subnet, but even then I receive this error:

CannotPullContainerError: ref pull has been retried 5 time(s): failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://prod-us-east-1-starport-layer-bucket.s3.us-east-1.amazonaws.com/

I tried all the variations for the endpoints, public subnet, private subnet, gateway endpoint in both or either.

The NAT policy allows everything. The security group of the task allows All for 0.0.0.0/0 inbound and outbound. The ecsTaskExecutionRole has AmazonECSTaskExecutionRolePolicy and I also tried allowing S3 and ECR list and get actions for all resources. Didn't help.

The only thing I can think of is that the ACL list of the private subnet restricts some inbound traffic that it should allow. It only allows a few ports from specific IP addresses. But then I don't know why moving the endpoints into the private subnet let's me get as far as attempting to pull the image.

Outbound traffic is allowed in each subnet for all traffic, 0.0.0.0/0. It's a Linux image.

I think it should work with either the NAT gateway (maybe not, because of the private ECR) or the VPC endpoints.

Did someone face the same issue at one point? Am I missing something that I don't allow in the ACL?

  • Were you able to find a solution? I am facing the same issue.

1 Answer
0

You should definitely run a test without the NACLs in place to ensure that the network configuration is correct. Then you can try putting back the NACLs to see when things fail.

As a general note (and to try and help with your troubleshooting): NACLs are stateless - so you do need to add the ephemeral ports if you want to use NACLs.

But in this case, I would ask "why use NACL?" - because if most of your traffic is outbound (i.e. initiated from instances/containers in your VPC) from a private subnet then (a) NAT Gateway won't allow traffic to be initiated from the internet to your resources; and (b) security groups (which are stateful) are there to protect your resources.

The advice I normally give customers is: use security groups as much as possible because they are stateful and easy to manage. Use NACLs where you must but only as a blunt object - for example, to stop two networks from communicating with each other completely. Trying to nail down ephemeral ports with NACLs is a lot of hard work for (probably) little benefit. Of course, every situation is different and NACLs are a useful tool; but useful when used for the right reasons.

profile pictureAWS
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions