EMR ECR auto-login failing starting with emr-6.9.0

1

Hi all, I was wondering if someone has encountered the following and it this is a known problem (with hopefully a simple workaround).

We are using docker images from our private ECR repo in Spark jobs on EMR following the documentation here. Previous to emr-6.1.0 we had a script to log in to ECR, and from emr-6.1.0 until emr-6.8.0 we made use of the auto-login capabilities. This has worked fine.

Recently, we tried to upgrade to emr-6.13, but realized (after trial and error), that the auto login seems to no longer work starting with emr-6.9.0. Note, during these tests, we changed no cluster configuration other than the release version.

Drilling in to some container logs:

emr-6.8.0:

2023-10-20 11:55:50,957 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler (NM ContainerManager dispatcher): Starting container [container_1697802855871_0001_01_000001]
2023-10-20 11:55:50,989 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl (NM ContainerManager dispatcher): Container container_1697802855871_0001_01_000001 transitioned from SCHEDULED to RUNNING
2023-10-20 11:55:50,989 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (NM ContainerManager dispatcher): Starting resource-monitoring for container_1697802855871_0001_01_000001
2023-10-20 11:55:52,512 INFO org.apache.hadoop.security.authentication.EcrDockerClientCredentialProvider (ContainersLauncher #0): Got token from AmazonECR
2023-10-20 11:56:03,954 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime (Container Monitor): container_1697802855871_0001_01_000001 : docker inspect output ,ip-172-28-0-65
 
2023-10-20 11:56:03,954 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime (Container Monitor): Docker inspect output for container_1697802855871_0001_01_000001: ,ip-172-28-0-65

emr-6.9.0:

2023-10-20 12:10:41,962 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler (NM ContainerManager dispatcher): Starting container [container_1697803709136_0001_01_000001]
2023-10-20 12:10:41,996 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl (NM ContainerManager dispatcher): Container container_1697803709136_0001_01_000001 transitioned from SCHEDULED to RUNNING
2023-10-20 12:10:41,996 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (NM ContainerManager dispatcher): Starting resource-monitoring for container_1697803709136_0001_01_000001
2023-10-20 12:10:42,009 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.volume.csi.ContainerVolumePublisher (ContainersLauncher #0): Initiate container volume publisher, containerID=container_1697803709136_0001_01_000001, volume local mount rootDir=/mnt/yarn/usercache/hadoop/filecache/csivolumes/application_1697803709136_0001/container_1697803709136_0001_01_000001
2023-10-20 12:10:42,009 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.volume.csi.ContainerVolumePublisher (ContainersLauncher #0): publishing volumes
2023-10-20 12:10:42,009 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.volume.csi.ContainerVolumePublisher (ContainersLauncher #0): Found 0 volumes to be published on this node
2023-10-20 12:10:42,353 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor (ContainersLauncher #0): Exit code from container container_1697803709136_0001_01_000001 is : 7
2023-10-20 12:10:42,353 WARN org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor (ContainersLauncher #0): Exception from container-launch with container ID: container_1697803709136_0001_01_000001 and exit code: 7
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.launchContainer(DockerLinuxContainerRuntime.java:907)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:178)
	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:606)
	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:521)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:585)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:373)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): Exception from container-launch.
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): Container id: container_1697803709136_0001_01_000001
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): Exit code: 7
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): Exception message: Launch container failed
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): Shell error output: Unable to find image 'xxxxx.dkr.ecr.region.amazonaws.com/yyy:v367' locally
2023-10-20 12:10:42,354 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #0): docker: Error response from daemon: Head "https://xxxx.dkr.ecr.region.amazonaws.com/v2/yyy/manifests/v367": no basic auth credentials.

In the second, I don't see any call like:

2023-10-20 11:55:52,512 INFO org.apache.hadoop.security.authentication.EcrDockerClientCredentialProvider (ContainersLauncher #0): Got token from AmazonECR

Both clusters have the following configuration set:

[
  {
    "Classification": "container-executor",
    "Configurations": [
        {
            "Classification": "docker",
            "Properties": {
                "docker.trusted.registries": "local,xxxx.dkr.ecr.region.amazonaws.com",
                "docker.privileged-containers.registries": "local,xxxx.dkr.ecr.region.amazonaws.com"
            }
        }
    ]
  }
]

Note, I expect that the move to hadoop 3.3.3 (from 3.1.0) might have done something here, but I can't find anything in the release notes that indicate I need to change anything.

Does anyone have any ideas, and/or workarounds (other than reverting to how we were doing it pre emr-6.1.0)? Thanks for your help!

gefragt vor 7 Monaten401 Aufrufe
1 Antwort
4
Akzeptierte Antwort

Hello,

yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled might not not automatically set to true when it was expected to be in the later version.

As a workaround, you can add the below configuration in yarn-site classification:

[    
    {
        "classification": "yarn-site",
        "properties": {
            "yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled": "true",
            "yarn.nodemanager.runtime.linux.docker.docker-client-credential-provider.class": "org.apache.hadoop.security.authentication.EcrDockerClientCredentialProvider"
        },
        "configurations": []
    }
]

Another approach is to pass the value of DOCKER_CLIENT_CONFIG in the Spark submit command like below:

spark-submit --master yarn --deploy-mode cluster --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json" --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG="hdfs:///user/hadoop/config.json​" --num-executors 2 main.py -v

Note, hdfs:///user/hadoop/config.json is uploaded from ~/.docker/config.json after running "aws ecr get-login-password | docker login --username". Let me know if you have any issues.

AWS
SUPPORT-TECHNIKER
beantwortet vor 7 Monaten
profile picture
EXPERTE
überprüft vor 7 Monaten
  • Thanks for the answer. So I did have the first setting ("yarn.nodemanager.runtime.linux.docker.ecr-auto-authentication.enabled": "true") explicitly set, but not the second ("yarn.nodemanager.runtime.linux.docker.docker-client-credential-provider.class": "org.apache.hadoop.security.authentication.EcrDockerClientCredentialProvider"). Adding that second seemed to fix the issue. It would be really great to add that to the documentation to avoid this question in the future!

    I was also aware of the second method, which is what we did before, but I wanted to avoid that.

  • Sure. I will cascade your feedback internally.

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen