EC2 user-data installs NVIDIA drivers successfully but ECS agent update only works when run manually

0

I’m launching an ECS-optimized Amazon Linux 2 instance with this user-data script. The NVIDIA driver and runtime install without error at boot, but my ECS agent update never takes effect until I SSH in and run those commands by hand.

#!/bin/bash -xe

# --- NVIDIA setup (works on boot) ---
yum install -y wget gcc kernel-devel-$(uname -r) dkms

wget https://us.download.nvidia.com/tesla/535.154.05/NVIDIA-Linux-x86_64-535.154.05.run
chmod +x NVIDIA-Linux-x86_64-535.154.05.run
./NVIDIA-Linux-x86_64-535.154.05.run --no-opengl-files -s

# add NVIDIA Docker repo & install toolkit
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo \
  | tee /etc/yum.repos.d/nvidia-docker.repo
yum clean expire-cache
yum install -y nvidia-container-runtime nvidia-container-toolkit

# configure ECS for GPU & restart
echo "ECS_ENABLE_GPU_SUPPORT=true" | tee -a /etc/ecs/ecs.config
systemctl restart ecs

# configure Docker default runtime
cat > /etc/docker/daemon.json <<EOF
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
EOF
systemctl restart docker

docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

# --- ECS agent update (failing at boot) ---
docker pull amazon/amazon-ecs-agent:latest

mkdir -p /var/cache/ecs
docker save amazon/amazon-ecs-agent:latest \
  -o /var/cache/ecs/ecs-agent.tar

systemctl stop ecs || true
docker rm -f ecs-agent 2>/dev/null || true
rm -rf /var/lib/ecs/data

systemctl daemon-reload
systemctl enable --now ecs

sleep 5
curl -s http://localhost:51678/v1/metadata \
  || echo "Metadata endpoint unavailable."

If I copy and paste these exact commands manually after SSHing in, they work perfectly—the agent is pulled, saved, and restarted, and the metadata endpoint responds. But when EC2 runs this script at launch, seeing [/var/log/cloud-init-output.log], it never actually updates the agent (and the metadata endpoint is never available). What’s different about user-data execution, and how can I make this same series of commands reliably succeed at boot?

asked 19 days ago56 views
1 Answer
0

The issue you're experiencing is likely related to the timing and environment differences between user-data execution at boot and manual execution after SSH login. Here are a few reasons why this might be happening and some suggestions to improve the reliability of your ECS agent update during instance launch:

  1. Service dependencies: At boot time, some required services might not be fully initialized when your script runs. The ECS agent and Docker service may need more time to start up properly.

  2. Network availability: The instance might not have full network connectivity when the script runs at boot, which could affect operations that require internet access.

  3. Environment variables: The user-data script runs in a different environment compared to when you SSH into the instance, which might affect how some commands behave.

To address these issues and make your ECS agent update more reliable during instance launch, try the following modifications:

  1. Add a delay before attempting to update the ECS agent. This gives time for other services to initialize:

    sleep 30
    
  2. Use systemd to manage the ECS service instead of directly manipulating the container:

    systemctl stop ecs
    systemctl disable ecs
    
  3. Ensure Docker is running before attempting to pull or save images:

    systemctl is-active docker || systemctl start docker
    
  4. After updating the agent, use systemd to start and enable the ECS service:

    systemctl enable ecs
    systemctl start ecs
    
  5. Add more robust error handling and logging:

    if ! docker pull amazon/amazon-ecs-agent:latest; then
      echo "Failed to pull latest ECS agent image" >&2
      exit 1
    fi
    
  6. Consider using cloud-init's runcmd module instead of a shell script for better integration with the instance launch process.

  7. Add a final check and retry mechanism:

    for i in {1..5}; do
      if curl -s http://localhost:51678/v1/metadata; then
        echo "ECS agent updated successfully"
        break
      else
        echo "Waiting for ECS agent to become available... (attempt $i)"
        sleep 30
      fi
    done
    

By implementing these changes, you should see more consistent results when updating the ECS agent during instance launch via user-data. Remember to test thoroughly, as the exact behavior can vary depending on instance type and other factors.
Sources
Updating the Amazon ECS container agent on an Amazon ECS-optimized AMI - Amazon Elastic Container Service
Manually updating the Amazon ECS container agent (for non-Amazon ECS-Optimized AMIs) - Amazon Elastic Container Service

profile picture
answered 19 days ago
AWS
EXPERT
reviewed 3 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions