EMR Cluster failure with "Failed to start the job flow due to an internal error"

3 minute read
Content level: Advanced
3

This article might help to investigate the EMR cluster that terminated with error mentioned as "Failed to start the job flow due to an internal error" especially when using custom AMI image.

In the event of an EMR cluster provisioning failure accompanied by the below exception, it is advisable to consult the system logs of the EC2 instance, provided one was created during the process. Alternatively, engaging the development mode allows for direct login access to the primary node, enabling further investigation into the underlying cause of the failure.


Enter image description here


When the EMR cluster provisioning steps taking longer time and you observed that primary instance creation attempted multiple times and terminated with “time out occurred during bootstrap” exception, click the terminated ec2 instance ID under Instance groups in the EMR console and go to EC2 console -> select the terminated instance -> click on “Actions” -> Select “Monitor and Troubleshoot” -> “Get system logs”

There should be cloud-init information logged in the system log for various package installation. Please check for any failures like service not found or failed to start related exceptions as shown below for an instance,

Enter image description here

For EMR 7.x versions, we need to use images based on Amazon Linux 2023 for custom AMIs. If attempt to provision with an AMI based on Amazon Linux 2 in EMR 7.x releases, then provisioning will fail. On the other hand, For EMR versions lower than 7.x, Amazon Linux 2023 AMIs are not supported. So, firstly check your AMI version/type compatibility and make sure that aligned with custom AMI consideration.

If the EMR daemons installation steps not complete successfully, then the cluster logs such as daemons logs, provision logs not available on the mentioned s3 log location. So, you can either manually login into the primary node and review the logs or use your own BA shell script to upload the logs from the various local log locations(/mnt/var/log/ & /emr/) to your S3 bucket for investigation and troubleshooting. Please refer this document for the log locations details.

On the other hand, if you do not find the logs in S3 or system logs, Set development mode on the cluster to get into the primary node and troubleshoot the issue. When starting your EMR cluster, set the "--additional-info" parameter to

'{"clusterType":"development"}'

When this flag is set and the primary node fails to provision, then EMR service keeps the cluster alive for some time before it decommissions it. This is very useful for probing various log files before the cluster is terminated. You can terminate the EC2 instance after the investigation complete, to avoid the additional billing for EC2 instance. Please note this parameter can only be set through the AWS CLI or AWS SDK and is not available through the EMR console. Please refer this document for EMR create-cluster CLI reference to setting up the development mode.

AWS
SUPPORT ENGINEER
published a month ago685 views