Why did my Amazon EMR cluster terminate with an "application provisioning failed" error?

7 minute read
0

My Amazon EMR cluster terminated with an "application provisioning failed" error. What does this error mean and how can I fix it?

Resolution

You can see an "application provisioning failed" error when Amazon EMR can't install, configure, or start specified software when it's launching an EMR cluster. The following sections show you how to find and review provisioning logs. They also show you different types of errors and steps you can take to resolve those errors.

Review Amazon EMR provisioning logs stored in Amazon S3

Amazon EMR provisioning logs are stored in an Amazon Simple Storage Service (Amazon S3) bucket specified at cluster launch. The storage location of the logs uses the following Amazon S3 URI syntax:

s3://example-log-location/example-cluster-ID/node/example-primary-node-ID/provision-node/apps-phase/0/example-UUID/puppet.log.gz

Note: Replace example-log-location, example-cluster-ID, example-primary-node-ID, and example-UUID with your system's naming.

  1. Open the Amazon EMR console. In the navigation pane, choose Clusters. Then, choose the failed EMR cluster to see the cluster details.
  2. In the Summary section, choose "Terminated with errors" and note the primary node ID included in the error message.
  3. In the Cluster logs section, choose the Amazon S3 location URL to be redirected to the cluster logs in the Amazon S3 console.
  4. Navigate to your UUID folder by following this path: node/example-primary-node-ID/provision-node/apps-phase/0/example-UUID/.
    Note: Replace example-primary-node-ID and example-UUID with your system's naming.
  5. In the resulting list, select puppet.log.gz and choose Open to see the provisioning on a new browser tab.

Identify the reasons for failures in provisioning logs

Unsupported configuration parameters can cause errors. Errors can also be caused by wrong hostnames, incorrect passwords, or general operating system issues. Search logs for related keywords, including "error" or "fail."

The following is a list of common error types:

  • Issues connecting to an external metastore with an Amazon Relational Database Service (Amazon RDS) instance.
  • Issues connecting to an external key distribution center (KDC).
  • Issues when starting services, such as YARN ResourceManager and Hadoop NameNode.
  • Issues when downloading or installing applications.
  • S3 logs aren't available.

Issues connecting to an external metastore with an Amazon RDS instance

Some Amazon EMR applications, such as Hive, Hue, or Oozie, can be configured to store data in an external database, such as Amazon RDS. When there's an issue with a connection, a message appears.

The following is an example error message from Hive:

2022-11-26 02:59:36 +0000 /Stage[main]/Hadoop_hive::Init_metastore_schema/Exec[init hive-metastore schema]/returns (notice): org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
2022-11-26 02:59:36 +0000 /Stage[main]/Hadoop_hive::Init_metastore_schema/Exec[init hive-metastore schema]/returns (notice): Underlying cause: java.sql.SQLNonTransientConnectionException : Could not connect to address=(host=hostname)(port=3306)(type=master) : Socket fail to connect to host:hostname, port:3306. hostname
2022-11-26 02:59:36 +0000 /Stage[main]/Hadoop_hive::Init_metastore_schema/Exec[init hive-metastore schema]/returns (notice): SQL Error code: -1

To resolve this type of error:

  • Verify that the RDS instance hostname, user, password, and database are correct.
  • Verify that the RDS instance security group inbound rules allow connections from the Amazon EMR primary node security group.

Issues connecting to an external KDC

Amazon EMR lets you configure an external KDC to add an additional layer of security. You can also create a trust relationship with an Active Directory server. When there's an issue with contacting the KDC or joining a domain, a message appears.

The following is an example error message from Puppet:

2022-11-26 03:02:01 +0000 Puppet (err): 'echo "${AD_DOMAIN_JOIN_PASSWORD}" | realm join -v -U "${AD_DOMAIN_JOIN_USER}"@"${CROSS_REALM_TRUST_REALM}" "${CROSS_REALM_TRUST_DOMAIN}"' returned 1 instead of one of [0]
2022-11-26 03:02:01 +0000 /Stage[main]/Kerberos::Ad_joiner/Exec[realm_join]/returns (err): change from 'notrun' to ['0'] failed: 'echo "${AD_DOMAIN_JOIN_PASSWORD}" | realm join -v -U "${AD_DOMAIN_JOIN_USER}"@"${CROSS_REALM_TRUST_REALM}" "${CROSS_REALM_TRUST_DOMAIN}"' returned 1 instead of one of [0]

To resolve this type of error:

  • Verify that the Kerberos realm is spelled correctly.
  • Verify that the KDC administrative password is spelled correctly.
  • Verify that the Active Directory join user and password are spelled correctly.
  • Verify that the Active Directory join user exists in Active Directory and has the correct permissions.
  • Verify that KDC and Active Directory servers are on Amazon EC2. Then, verify that the KDC and Active Directory security group inbound rules allow connections from the Amazon EMR primary node security group.
  • Verify that KDC and Active Directory aren't on Amazon EC2. Then, verify that KDC and Active Directory allow connections from the EMR cluster virtual private cloud (VPC) and subnet.

Issues when starting services, such as YARN ResourceManager, Hadoop NameNode, or Spark History Server

Amazon EMR allows the custom configuration of all applications at EMR cluster launch. But, sometimes these configurations prevent services from starting. When there's an issue preventing a service from starting a message appears.

The following is an example error message from Spark History Server:

2022-11-26 03:34:13 +0000 Puppet (err): Systemd start for spark-history-server failed!
journalctl log for spark-history-server:
-- Logs begin at Sat 2022-11-26 03:27:57 UTC, end at Sat 2022-11-26 03:34:13 UTC. --
Nov 26 03:34:10 ip-192-168-1-32 systemd[1]: Starting Spark history-server...
Nov 26 03:34:10 ip-192-168-1-32 spark-history-server[1076]: Starting Spark history-server (spark-history-server):[OK]
Nov 26 03:34:10 ip-192-168-1-32 su[1112]: (to spark) root on none
Nov 26 03:34:13 ip-192-168-1-32 systemd[1]: spark-history-server.service: control process exited, code=exited status=1
Nov 26 03:34:13 ip-192-168-1-32 systemd[1]: Failed to start Spark history-server.
Nov 26 03:34:13 ip-192-168-1-32 systemd[1]: Unit spark-history-server.service entered failed state.
Nov 26 03:34:13 ip-192-168-1-32 systemd[1]: spark-history-server.service failed.
2022-11-26 03:34:13 +0000 /Stage[main]/Spark::History_server/Service[spark-history-server]/ensure (err): change from 'stopped' to 'running' failed: Systemd start for spark-history-server failed!
journalctl log for spark-history-server:

To resolve this type of error:

  • Verify the service that failed to start. Check if the provided configurations are spelled correctly.
  • Navigate the following path to see the S3 log to investigate the reason for the failure: s3://example-log-location/example-cluster-ID/node/example-primary-node-ID/applications/example-failed-application/example-failed-service.gz.
    Note: Replace example-log-location, example-cluster-ID, example-primary-node-ID, example-failed-application, and example-failed-service with your system's naming.

Issues when downloading or installing applications

Amazon EMR can install many applications. But, sometimes there's an issue when one application can't be downloaded or installed. This can cause the EMR cluster to fail. When this failure happens, the provisioning logs don't complete. You must review the stderr.gz log instead to find similar messages caused by failed yum installations.

The following is an example error message from stderr.gz:

stderr.gz
Error Summary
-------------
Disk Requirements:
  At least 2176MB more space needed on the / filesystem.
  
2022-11-26 03:18:44,662 ERROR Program: Encountered a problem while provisioning
java.lang.RuntimeException: Amazon-linux-extras topics enabling or yum packages installation failed.

To resolve this type of error, increase the root Amazon Elastic Block Store (Amazon EBS) volume during the EMR cluster launch.

S3 logs aren't available

Amazon EMR fails to provision applications, and there aren't any logs generated in Amazon S3. In this scenario, it's likely that a network error caused S3 logging to fail.

To resolve this type of error:

  • Verify that the Logging option is turned on during the EMR cluster launch. For more information, see Configure cluster logging and debugging.
  • When using a custom AMI, verify that there are no firewall rules interfering with the required Amazon EMR network settings. For more information, see Working with Amazon EMR-managed security groups.
  • When using a custom AMI, check to see if there are any failed primary nodes. Open the Amazon EMR console, and in the navigation pane, choose Hardware to determine if clusters couldn't launch any primary nodes.
  • When using a custom AMI, verify that you're following best practices. For more information, see Using a custom AMI.

Related information

EMR cluster failed to provision

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago