- Newest
- Most votes
- Most comments
Based on the information provided, it seems you've taken many of the correct steps to set up your ECS tasks with Fargate to pull images from a private ECR repository. However, there are a few potential issues and considerations to address:
-
Network Configuration: For Fargate tasks in a public subnet to pull container images, you need to assign a public IP address to the task's elastic network interface. You've set
assignPublicIpto True in your service configuration, which is correct. -
VPC Endpoints: You've set up the necessary VPC endpoints for ECR and ECS, which is good. Make sure these endpoints are associated with the correct route tables for your subnets.
-
Security Group Configuration: While you've set up your security group to allow all traffic, it's worth double-checking that the security group associated with your VPC endpoints allows inbound traffic from the security group associated with your Fargate tasks.
-
Task Execution Role: You've correctly given the necessary ECR permissions to your task execution role.
-
Subnet Configuration: Ensure that your subnets have a route to the internet via an Internet Gateway (for public subnets) or a NAT Gateway (for private subnets). Even with VPC endpoints, some internet connectivity may still be required.
-
DNS Resolution: Verify that your VPC has DNS hostnames and DNS resolution enabled, which you've mentioned you've done.
-
ECR Repository Policy: Check your ECR repository policy to ensure it allows access from your task execution role.
-
Fargate Platform Version: If you're not already, consider using the latest Fargate platform version, as it may include improvements in networking and image pulling.
-
Image Size and Task Resources: Ensure your task has sufficient CPU and memory allocated to pull and run your container image. Fargate doesn't cache images, so the entire image is pulled each time a task runs.
-
Region Consistency: You've mentioned everything is in the same region (us-east-1), which is correct and eliminates cross-region issues.
If you've verified all these points and are still experiencing issues, you may want to check the CloudWatch logs for your tasks to see if there are any more detailed error messages. Also, consider temporarily enabling VPC flow logs to analyze the network traffic and identify any potential connectivity issues.
If the problem persists, you might want to try creating a test task with a public ECR image to isolate whether the issue is specific to your private ECR setup or if it's a more general networking problem.
Sources
Amazon ECS task definition differences for the Fargate launch type - Amazon Elastic Container Service
Container image pull behavior - Amazon Elastic Container Service
Fargate deployment can not pull from ECR | AWS re:Post
Can't pull a docker image from a private ECR repository inside an EC2 instance | AWS re:Post
- All ECS Services / Tasks, VPCs, and VPC endpoints are set up to use the same security group currently.
- We are using the latest version of Fargate
- The images are using Google's "Distroless", and are running rust, and don't do much right now. The image size is small, and so is the resource usage.
- The docs EXPLICITLY SAY that having an Internet Gateway / NAT Gateway are not required for ECR.
- The ECR Policy allows both the Task and Execution roles full access to ecr resources.
- The security group allows all inbound and outbound traffic (for now).
Hello.
Judging from the error message, it appears that the connection to "com.amazonaws.us-east-1.ecr.dkr" failed.
Please check that the security group associated with the VPC endpoint allows HTTPS and that the NACL set for the subnet is not set to deny any communication.
<ACCOUNT_ID>.dkr.ecr.us-east-1.amazonaws.com/<REGISTRY>:latest: dial tcp <IP>: i/o timeout
For testing purposes, launch an EC2 instance in the same VPC as Fargate, connect to the EC2 OS, and use the dig command to perform name resolution and check whether you can confirm the IP address of the VPC endpoint.
If you can confirm the public IP address with this command, there may be an error in the VPC endpoint settings.
dig dkr.ecr.us-east-1.amazonaws.com
dig <ACCOUNT_ID>.dkr.ecr.us-east-1.amazon.com
After re-reading the AWS PrivateLink post many times, and finding others' questions, it turns out the issues were just due to missing the route tables for the S3 gateway.
For anyone that is interested, I've attached a Terraform File that demonstrates our final config, with confidential stuff removed.
# See https://search.opentofu.org/provider/opentofu/aws/latest
variable aws_region {
default = "us-east-1"
}
terraform {
required_providers {
aws = {
source = "hashicorp/aws",
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.aws_region
}
###############################################################################
## VPC
###############################################################################
# See https://search.opentofu.org/provider/opentofu/aws/latest/docs/resources/vpc
resource "aws_vpc" "my_vpc" {
# This can be any valid cidr block
cidr_block = "10.0.0.0/24"
# For some services, like ECR / DKR, we need DNS names
enable_dns_support = true
enable_dns_hostnames = true
}
###############################################################################
## VPC Subnets
###############################################################################
# See https://search.opentofu.org/provider/opentofu/aws/latest/docs/resources/subnet
# Private Subnets
# Here we set up multiple subnets, for availability zones a-c.
# You don't have to do this, but you will need at least 1.
resource "aws_subnet" "my_subnets_private" {
for_each = {
a = "10.0.0.0/26"
b = "10.0.0.64/26"
c = "10.0.0.128/26"
}
vpc_id = aws_vpc.my_vpc.id
cidr_block = each.value
availability_zone = "${var.aws_region}${each.key}"
}
# Public Subnets
# A public subnet is how we will receive inbound traffic.
# Even if you have a public IP assigned, you will not receive traffic
# unless you have Security Groups and Subnets setup correctly.
# Using separate subnets for public and private allow us to isolate
# private from public trafic
resource "aws_subnet" "my_subnets_public" {
vpc_id = aws_vpc.my_vpc.id
cidr_block = "10.0.0.192/26"
availability_zone = "${var.aws_region}d"
}
###############################################################################
## VPC Security Group + Rules
###############################################################################
# See https://search.opentofu.org/provider/opentofu/aws/latest/docs/resources/security_group
# See https://search.opentofu.org/provider/terraform-providers/aws/latest/docs/resources/vpc_security_group_egress_rule
# See https://search.opentofu.org/provider/terraform-providers/aws/latest/docs/resources/vpc_security_group_ingress_rule
resource "aws_security_group" "my_security_group_private" {
name = "my_security_group_private"
description = "ALlows all requests on the private vpc"
vpc_id = aws_vpc.my_vpc.id
}
resource "aws_vpc_security_group_egress_rule" "private_allow_outbound_ipv4" {
security_group_id = aws_security_group.my_security_group_private.id
cidr_ipv4 = "0.0.0.0/0"
ip_protocol = "-1"
}
resource "aws_vpc_security_group_ingress_rule" "private_allow_inbound_all" {
security_group_id = aws_security_group.my_security_group_private.id
cidr_ipv4 = "0.0.0.0/0"
ip_protocol = "-1"
}
resource "aws_security_group" "my_security_group_public" {
name = "my_security_group_public"
description = "Allows public HTTP + HTTPS requests on the public vpc"
vpc_id = aws_vpc.my_vpc.id
}
resource "aws_vpc_security_group_ingress_rule" "public_allow_inbound_http" {
security_group_id = aws_security_group.my_security_group_public.id
cidr_ipv4 = "0.0.0.0/0"
from_port = 80
to_port = 80
ip_protocol = "tcp"
}
resource "aws_vpc_security_group_ingress_rule" "public_allow_inbound_https" {
security_group_id = aws_security_group.my_security_group_public.id
cidr_ipv4 = "0.0.0.0/0"
from_port = 443
to_port = 443
ip_protocol = "tcp"
}
###############################################################################
## Endpoints
###############################################################################
# See https://search.opentofu.org/provider/opentofu/aws/latest/docs/resources/vpc_endpoint
# See https://search.opentofu.org/provider/opentofu/aws/latest/docs/resources/vpc_endpoint_policy
# See https://docs.aws.amazon.com/AmazonECS/latest/developerguide/vpc-endpoints.html
# See https://aws.amazon.com/blogs/compute/setting-up-aws-privatelink-for-amazon-ecs-and-amazon-ecr/
# See https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch-logs-and-interface-VPC.html
# Endpoint Policy Definitions
# You do not *have* to write your own policies.
# By default, if you don't add a policy, then AWS will attach an implicit policy
# that allows all access
locals {
my_endpoint_policies = {
ecr = <<JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ECRAccess",
"Effect": "Allow",
"Principal": "*",
"Resource": "*",
"Action": [
"ecr:*"
]
}
]
}
JSON
ecs = <<JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ECSAccess",
"Effect": "Allow",
"Principal": "*",
"Resource": "*",
"Action": [
"ecs:*"
]
}
]
}
JSON
s3 = <<JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3Access",
"Effect": "Allow",
"Principal": "*",
"Resource": "*",
"Action": [
"s3:*"
]
}
]
}
JSON
logs = <<JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "LogsAccess",
"Effect": "Allow",
"Principal": "*",
"Resource": "*",
"Action": [
"logs:*"
]
}
]
}
JSON
}
}
locals {
endpoints = {
# NOTE: For the ECR and S3 registries, you'll need to use the region for those
# services, which may not be the same as the deployment region of the app.
"com.amazonaws.us-east-1.ecr.dkr" = {
# !!!! THIS IS IMPORTANT !!!!
# You must enable private dns for the ECR endpoint, otherwise requests
# will fail, and you'll get timeout errors when trying to deploy your service.
private_dns_enabled = true
endpoint_type = "Interface"
policy = local.my_endpoint_policies.ecr
}
"com.amazonaws.us-east-1.ecr.api" = {
private_dns_enabled = true
endpoint_type = "Interface"
policy = local.my_endpoint_policies.ecr
}
# S3 is needed, because that's how ECR stores image layers.
"com.amazonaws.us-east-1.s3" = {
private_dns_enabled = false
endpoint_type = "Gateway"
policy = local.my_endpoint_policies.s3
}
# ECS and CloudWatch logs are specific to deployment region
"com.amazonaws.${var.aws_region}.ecs-agent" = {
private_dns_enabled = false
endpoint_type = "Interface"
policy = local.my_endpoint_policies.ecs
}
"com.amazonaws.${var.aws_region}.ecs-telemetry" = {
private_dns_enabled = false
endpoint_type = "Interface"
policy = local.my_endpoint_policies.ecs
}
"com.amazonaws.${var.aws_region}.ecs" = {
private_dns_enabled = false
endpoint_type = "Interface"
policy = local.my_endpoint_policies.ecs
}
"com.amazonaws.${var.aws_region}.logs" = {
# See the comments on the DKR endpoint
private_dns_enabled = true
endpoint_type = "Interface"
policy = local.my_endpoint_policies.logs
},
"com.amazonaws.${var.aws_region}.elasticache" = {
private_dns_enabled = false
endpoint_type = "Interface"
policy = local.my_endpoint_policies.elasticache
}
}
}
# Create an endpoint and a policy for each the endpoints we defined above.
resource "aws_vpc_endpoint" "my_interfaces" {
for_each = local.endpoints
vpc_id = aws_vpc.my_vpc.id
service_name = each.key
vpc_endpoint_type = each.value.endpoint_type
private_dns_enabled = each.value.private_dns_enabled
subnet_ids = each.value.endpoint_type != "Gateway" ? [for subnet in aws_subnet.my_subnets_private: subnet.id] : []
security_group_ids = each.value.endpoint_type == "Interface" ? [aws_security_group.my_security_group_private.id] : []
# !!!! THIS IS IMPORTANT !!!!
# For any Gateway endpoints (S3 is our only one currently), you must add a route table.
# If you don't, then requests will fail with timeouts.
# In my experience, the errors aren't obvious, and will suggest it's something with ecr.api or ecr.dkr,
# even though it's S3 that's broken
route_table_ids = each.value.endpoint_type == "Gateway" ? [aws_vpc.my_vpc.main_route_table_id] : []
}
resource "aws_vpc_endpoint_policy" "my_endpoint_policies" {
for_each = local.endpoints
vpc_endpoint_id = aws_vpc_endpoint.my_interfaces[each.key].id
policy = each.value.policy
}
# ###############################################################################
# # Logs
# ###############################################################################
# # See https://search.opentofu.org/provider/terraform-providers/aws/v5.98.0/docs/datasources/cloudwatch_log_groups
resource "aws_cloudwatch_log_group" "my_log_group" {
name = "my-log-group"
}
###############################################################################
# ECS
###############################################################################
# NOTE: About CPU / Memory
# AWS uses "vCpu units", which are 1 / 1024 of a vCpu.
# E.G. 1 vCpu = 1024 units.
#
# AWS uses MB for memory.
# E.G. 2 GB = 2048 MB
#
# Change these as necessary
# See https://search.opentofu.org/provider/terraform-providers/aws/latest/docs/resources/ecs_task_definition
resource "aws_ecs_task_definition" "my_ecs_task_definition" {
family = "my-task-family"
cpu = "1024"
memory = "2048"
# Add your Task and Execution roles here.
# We don't define these in this terrform file, but they are fairly easy to set up.
# I had to make sure both had access to pull images from ECR and to push logs to Cloudwatch.
# !!!! THIS IS IMPORTANT !!!!
# If you're using a private ECR Registry, you need to make sure
# you Task Role and Execution Role have access to it.
# You can do this by going to the AWS Console > ECR > Registry > Permissions, and attaching
# an appropriate policy.
execution_role_arn = "arn:aws:iam::123456789012:role/my-execution-role"
task_role_arn = "arn:aws:iam::123456789012:role/my-task-role"
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
runtime_platform {
cpu_architecture = "X86_64"
operating_system_family = "LINUX"
}
container_definitions = jsonencode([
{
name = "my-container-name"
image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/docker-image:image-tag"
portMappings = [
{
containerPort = 80
}
]
essential = true
# If you're using FARGATE, then you need some log provider in order to actually
# view logs. You can use things like splunk, but we'll just use awslogs
logConfiguration = {
logDriver = "awslogs"
options = {
awslogs-group = "my-log-group"
awslogs-region = var.aws_region
awslogs-stream-prefix = "my-server"
}
}
}
])
}
# See https://search.opentofu.org/provider/terraform-providers/aws/latest/docs/resources/ecs_cluster
resource "aws_ecs_cluster" "my_ecs_cluster" {
name = "my-cluster"
}
# See https://search.opentofu.org/provider/terraform-providers/aws/latest/docs/resources/ecs_service
resource "aws_ecs_service" "my_ecs_service" {
name = "my-service"
desired_count = 1
cluster = aws_ecs_cluster.my_ecs_cluster.id
task_definition = aws_ecs_task_definition.my_ecs_task_definition.id
launch_type = "FARGATE"
platform_version = "LATEST"
network_configuration {
assign_public_ip = true
subnets = concat(
[for subnet in aws_subnet.my_subnets_private: subnet.id],
[aws_subnet.my_subnets_public.id]
)
security_groups = [
aws_security_group.my_security_group_private.id,
aws_security_group.my_security_group_public.id
]
}
depends_on = [
aws_cloudwatch_log_group.my_log_group
]
}
