EMR ECR auto-login failing starting with emr-6.9.0

1

Hi all, I was wondering if someone has encountered the following and it this is a known problem (with hopefully a simple workaround).

We are using docker images from our private ECR repo in Spark jobs on EMR following the documentation here. Previous to emr-6.1.0 we had a script to log in to ECR, and from emr-6.1.0 until emr-6.8.0 we made use of the auto-login capabilities. This has worked fine.

Recently, we tried to upgrade to emr-6.13, but realized (after trial and error), that the auto login seems to no longer work starting with emr-6.9.0. Note, during these tests, we changed no cluster configuration other than the release version.

Drilling in to some container logs:

emr-6.8.0:

2023-10-20 11:55:50,957 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler (NM ContainerManager dispatcher): Starting container [container_1697802855871_0001_01_000001]
2023-10-20 11:55:50,989 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl (NM ContainerManager dispatcher): Container container_1697802855871_0001_01_000001 transitioned from SCHEDULED to RUNNING
2023-10-20 11:55:50,989 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (NM ContainerManager dispatcher): Starting resource-monitoring for container_1697802855871_0001_01_000001
2023-10-20 11:55:52,512 INFO org.apache.hadoop.security.authentication.EcrDockerClientCredentialProvider (ContainersLauncher #0): Got token from AmazonECR
2023-10-20 11:56:03,954 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime (Container Monitor): container_1697802855871_0001_01_000001 : docker inspect output ,ip-172-28-0-65
 
2023-10-20 11:56:03,954 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime (Container Monitor): Docker inspect output for container_1697802855871_0001_01_000001: ,ip-172-28-0-65

emr-6.9.0:

2023-10-20 12:10:41,962 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler (NM ContainerManager dispatcher): Starting container [container_1697803709136_0001_01_000001]
2023-10-20 12:10:41,996 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl (NM ContainerManager dispatcher): Container container_1697803709136_0001_01_000001 transitioned from SCHEDULED to RUNNING
2023-10-20 12:10:41,996 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (NM ContainerManager dispatcher): Starting resource-monitoring for container_1697803709136_0001_01_000001
2023-10-20 12:10:42,009 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.volume.csi.ContainerVolumePublisher (ContainersLauncher #0): Initiate container volume publisher, containerID=container_1697803709136_0001_01_000001, volume local mount rootDir=/mnt/yarn/usercache/hadoop/filecache/csivolumes/application_1697803709136_0001/container_1697803709136_0001_01_000001
2023-10-20 12:10:42,009 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.volume.csi.ContainerVolumePublisher (ContainersLauncher #0): publishing volumes
2023-10-20 12:10:42,009 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.volume.csi.ContainerVolumePublisher (ContainersLauncher #0): Found 0 volumes to be published on this node
2023-10-20 12:10:42,353 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor (ContainersLauncher #0): Exit code from container container_1697803709136_0001_01_000001 is : 7
2023-10-20 12:10:42,353 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor (ContainersLauncher #0): Exception from container-launch with container ID: container_1697803709136_0001_01_000001 and exit code: 7
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.launchContainer(DockerLinuxContainerRuntime.java:907)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:178)
	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:606)
	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:521)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:585)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:373)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): Exception from container-launch.
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): Container id: container_1697803709136_0001_01_000001
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): Exit code: 7
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): Exception message: Launch container failed
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): Shell error output: Unable to find image 'xxxxx.dkr.ecr.region.amazonaws.com/yyy:v367' locally
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): docker: Error response from daemon: Head "https://xxxx.dkr.ecr.region.amazonaws.com/v2/yyy/manifests/v367": no basic auth credentials.

In the second, I don't see any call like:

2023-10-20 11:55:52,512 INFO org.apache.hadoop.security.authentication.EcrDockerClientCredentialProvider (ContainersLauncher #0): Got token from AmazonECR

Both clusters have the following configuration set:

[
  {
    "Classification": "container-executor",
    "Configurations": [
        {
            "Classification": "docker",
            "Properties": {
                "docker.trusted.registries": "local,xxxx.dkr.ecr.region.amazonaws.com",
                "docker.privileged-containers.registries": "local,xxxx.dkr.ecr.region.amazonaws.com"
            }
        }
    ]
  }
]

Note, I expect that the move to hadoop 3.3.3 (from 3.1.0) might have done something here, but I can't find anything in the release notes that indicate I need to change anything.

Does anyone have any ideas, and/or workarounds (other than reverting to how we were doing it pre emr-6.1.0)? Thanks for your help!

asked 6 months ago381 views
1 Answer
4
Accepted Answer

Hello,

yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled might not not automatically set to true when it was expected to be in the later version.

As a workaround, you can add the below configuration in yarn-site classification:

[    
    {
        "classification": "yarn-site",
        "properties": {
            "yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled": "true",
            "yarn.nodemanager.runtime.linux.docker.docker-client-credential-provider.class": "org.apache.hadoop.security.authentication.EcrDockerClientCredentialProvider"
        },
        "configurations": []
    }
]

Another approach is to pass the value of DOCKER_CLIENT_CONFIG in the Spark submit command like below:

spark-submit --master yarn --deploy-mode cluster --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json" --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json​" --num-executors 2 main.py -v

Note, hdfs:///user/hadoop/config.json is uploaded from ~/.docker/config.json after running "aws ecr get-login-password | docker login --username". Let me know if you have any issues.

AWS
SUPPORT ENGINEER
answered 6 months ago
profile picture
EXPERT
reviewed 6 months ago
  • Thanks for the answer. So I did have the first setting ("yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled": "true") explicitly set, but not the second ("yarn.nodemanager.runtime.linux.docker.docker-client-credential-provider.class": "org.apache.hadoop.security.authentication.EcrDockerClientCredentialProvider"). Adding that second seemed to fix the issue. It would be really great to add that to the documentation to avoid this question in the future!

    I was also aware of the second method, which is what we did before, but I wanted to avoid that.

  • Sure. I will cascade your feedback internally.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions