EMR Cluster failure with "Master instance startup failed due to an internal error"

4 minute read
Content level: Advanced
3

This article might help to investigate the EMR cluster that terminated with error mentioned as "Master instance startup failed due to an internal error" especially when using custom AMI image.

In the circumstance where an EMR cluster provisioning process encounters a failure, accompanied by the exception described below, it is recommended to review the system logs of the EC2 instance, provided that instance was successfully created during the process. Please note that the system logs will be discarded after a certain period following the termination of the instance. To retain the logs for future reference and troubleshooting purposes, it is advisable to download and store the system logs in a permanent location.


Enter image description here


To view the system log, click the terminated ec2 instance ID under Instance groups in the EMR console and go to EC2 console -> select the terminated instance -> click on “Actions” -> Select “Monitor and Troubleshoot” -> “Get system logs”.

Subsequent to the cloud-init step which logs its activities in the system log, the setup-devices step is initiated. During this phase, the contents of the /tmp, /var, and /emr directories (if they are present on the Amazon Machine Image) are relocated to /mnt/tmp, /mnt/var, and /mnt/emr respectively upon system startup.

Enter image description here

As illustrated in the above example, the setup-devices service encountered a failure during its initialization. The absence of the "rsync" and "nfs_util" packages in the custom AMI prevented the successful completion of this setup-devices step. The preservation of files relies on operating system utilities such as the "rsync" and "nfs_util" packages. If there is substantially a large amount of data to be transferred between directories, the system startup may experience longer-than-anticipated delays.

Enter image description here

Based on the preceding system log entries, it appears that the setup-devices phase encountered a failure during its execution. To corroborate this observation, it is recommended to examine the log files stored in the designated S3 log location, provided that the logs have been successfully transmitted to the specified log destination. For the terminated node, you might find below prefixes under S3-Bucket/<EMR-Cluster>/node/<Primary node EC2 InstanceID>/

Enter image description here

Go to setup-devices folder, open “setup_tmp_dir.2024-04-09-16-20-05.log.gz” file and check if any exception occurred. If the rsync package missing, then it will throw “command not found” exception.

Enter image description here

In this case, please make sure to install them in the custom AMI directly instead of adding them in Bootstrap action script. Because, BA script will be executed after the setup-devices step complete, so the issue will not be fixed.

For some cases, if the nfs_util not available in the custom AMI, then you can refer the setup-devices logs stored in s3 directly. You can refer “ setup_var_lib_dir.2024-04-09-16-58-24.log.gz” file under setup-devices directory in S3.

Enter image description here

If the nfs_util missing, then you might encounter below exception reported in the setup_var_lib_dir.2024-04-09-16-58-24.log.gz log file.

Enter image description here

So, please make sure to install this package as well to successfully complete the setup-devices step.

yum install rsync
yum install nfs-utils

In case the logs not available in s3 log destination, then set development mode on the cluster to get into the primary node and troubleshoot the issue. When starting your EMR cluster, set the "--additional-info" parameter to

'{"clusterType":"development"}'

When this flag is set and the primary node fails to provision, then EMR service keeps the cluster alive for some time before it decommissions it. This is very useful for probing various log files before the cluster is terminated. You can terminate the EC2 instance after the investigation complete, to avoid the additional billing for EC2 instance. Please note this parameter can only be set through the AWS CLI or AWS SDK and is not available through the EMR console. Please refer this document for EMR create-cluster CLI reference to setting up the development mode.

AWS
SUPPORT ENGINEER
published a month ago478 views
1 Comment

This is a nice step to troubleshoot EMR Cluster failure

profile picture
EXPERT
replied a month ago