My Amazon EMR cluster terminated unexpectedly.
Resolution
Amazon EMR stores cluster logs in an Amazon Simple Storage Service (Amazon S3) bucket that you specify at cluster launch. For example, s3://example-log-location/example-cluster-ID/node/example-EC2-instance-ID/.
To identify why your Amazon EMR cluster terminated, review the Amazon EMR provisioning logs stored in Amazon S3.
"SHUTDOWN_STEP_FAILED (USER_ERROR)" error
When you submit a step job in your Amazon EMR cluster, you can specify the step failure behavior in the ActionOnFailure parameter. If you select TERMINATE_CLUSTER or TERMINATE_JOB_FLOW for the ActionOnFailure parameter, then the Amazon EMR cluster terminates and you might see the following error message from AWS CloudTrail:
"{ "severity": "ERROR", "actionOnFailure": "TERMINATE_JOB_FLOW", "stepId": "s-2I0G########", "name": "Example Step", "clusterId": "j-2YJ#######", "state": "FAILED", "message": "Step s-2I0G####### (Example Step) in Amazon EMR cluster j-2YJ####### failed at 202#-1#-0# 0#:## UTC." }"
To avoid this error, use the CONTINUE or CANCEL_AND_WAIT option in the ActionOnFailure parameter when you submit the step job.
For more information, see StepConfig.
"NO_SLAVES_LEFT (SYSTEM_ERROR)" error
You receive the "No_SLAVES_LEFT" error when the following conditions are true:
- You turned off termination protection in the Amazon EMR cluster.
- All core nodes exceed disk storage capacity as specified by a maximum utilization threshold in the yarn-site configuration classification. The default maximum utilization threshold is 90%.
- The CORE instance is a Spot Instance, and the Spot Instance is TERMINATED_BY_SPOT_DUE_TO_NO_CAPACITY.
For more information on Spot Instance termination, see Why did Amazon EC2 interrupt my Spot Instance?
To resolve this error, take the following actions:
"502 Bad Gateway" error
When Amazon EMR internal systems can't reach the primary node for a period of time, you receive the "502 Bad Gateway" error. If you turn off termination protection, then Amazon EMR terminates the cluster.
When the instance-controller service is down, check the latest instance-controller logs and instance state logs. If the instance-controller standard output shows that insufficient memory terminated the service, then the primary node lacks adequate memory.
The following is an example error message from the instance state log:
# dump instance controller stdouttail -n 100 /emr/instance-controller/log/instance-controller.out
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb46c7c8000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid16110.log
# whats memory usage look like
free -m
total used free shared buff/cache available
Mem: 15661 15346 147 0 167 69
Swap: 0 0 0
To avoid the preceding error, launch an Amazon EMR cluster with a larger instance type to leverage more memory for your cluster's requirements. Also, clean up disk space to avoid memory outages in long running clusters. For more information, see How do I troubleshoot primary node failure with error "502 Bad Gateway" or "504 Gateway Time-out" in Amazon EMR?
"KMS_ISSUE (USER_ERROR)" error
When you use an Amazon EMR security configuration to encrypt an Amazon EBS root device and storage volumes, the role must have proper permissions. If the necessary permissions are missing, then you receive the following error message in AWS CloudTrail:
"The EMR Service Role must have the kms:GenerateDataKey* and kms:ReEncrypt* permission for the KMS key configuration when you enabled EBS encryption by default. You can retrieve that KMS key's ID by using the ec2:GetEbsDefaultKmsKeyId API."
To avoid the preceding error, make sure that security configurations that you used to encrypt the Amazon EBS root device and storage volumes have the necessary permissions. Also make sure that the Amazon EMR service role (EMR_DefaultRole_V2) has permissions to use the specified AWS Key Management Service (AWS KMS) key.
"Terminated with errors, The master node was terminated by user" error
When the Amazon EMR cluster primary node stops for any reason, the cluster terminates with the "The master node was terminated by user" error.
You receive the following error message in AWS CloudTrail:
eventTime": "2023-01-18T08:07:02Z",
"eventSource": "ec2.amazonaws.com",
"eventName": "StopInstances",
"awsRegion": "us-east-1",
"sourceIPAddress": "52.##.##.##",
"userAgent": "AWS Internal",
"requestParameters": {
"instancesSet": {
"items": [
{
"instanceId": "i-##f6c5###########"
}
]
},
"force": false
},
Because stopping the Amazon EMR primary or all core nodes leads to cluster termination, don't stop or reboot cluster nodes.
Note: Your Amazon EMR clusters might terminate for other reasons than those included in this article. For more information, see Resource errors during Amazon EMR cluster operations.