Skip to content

How do I resolve "Exit status: -100. Diagnostics: Container released on a lost node" error in Amazon EMR?

3 minute read
0

My Amazon EMR job fails with a "Container released on a lost node" error.

Short description

When Amazon EMR terminates a core or task node because of high disk space utilization, you might receive the following error:

"ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Container marked as failed: container_1572839353552_0008_01_000002 on host: ip-##-###-##-## Exit status: -100. Diagnostics: Container released on a lost node"

You might also receive the preceding error when a node becomes unresponsive because of prolonged high CPU utilization or low available memory.

The following resolution provides steps on how to resolve the error that you receive when you run out of disk space and your MR unhealthy nodes metric shows unhealthy nodes.

Resolution

When disk usage on a core or task node disk, for example, /mnt or /mnt1 exceeds 90%, the disk becomes unhealthy. If fewer than 25% of a node's disks are healthy, then YARN ResourceManager gracefully decommissions the node. To resolve this issue, add more Amazon Elastic Block Store (Amazon EBS) capacity to the EMR cluster.

Determine the root cause

To determine the cause of the error, check the MR unhealthy nodes and MR lost nodes Amazon CloudWatch metrics for the EMR cluster.

If the MR unhealthy nodes metric shows an unhealthy node, then insufficient disk space caused the issue.

If the MR lost nodes metric shows a lost node, then a hardware failure caused the node loss. Or Amazon EMR can't reach the node because of high CPU or memory usage.

Add more Amazon EBS capacity for new clusters

To add more Amazon EBS capacity when you launch an Amazon EMR cluster, choose a larger Amazon Elastic Compute Cloud (Amazon EC2) instance type. For more information, see Default EBS storage for instances. You can also modify the volume size or add more volumes when you create the cluster.

Add more core or task nodes for new or running clusters

Choose a larger number of core or task nodes when you launch a new cluster. Or add more core or task nodes to a running cluster.

Add more Amazon EBS volumes for running clusters

If larger Amazon EBS volumes don't resolve the issue, then attach more Amazon EBS volumes to a running cluster.

Complete the following steps:

  1. Attach more Amazon EBS volumes to the core and task nodes.

  2. Format and mount the attached volumes. Make sure to use the correct disk number. For example, use /mnt1 or /mnt2 instead of /data.

  3. Use SSH to connect to the node.

  4. Add the path /mnt1/yarn inside the yarn.nodemanager.local-dirs property of /etc/hadoop/conf/yarn-site.xml.
    Example:

    <property>  
        <name>yarn.nodemanager.local-dirs</name>
        <value>/mnt/yarn,/mnt1/yarn</value>
    </property>
  5. To stop the node manager service, run the following command:

    sudo stop hadoop-yarn-nodemanager
  6. To start the node manager service, run the following command:

    sudo start hadoop-yarn-nodemanager
  7. Turn on termination protection.

If you still have disk space issues, then take the following actions:

  • Remove unnecessary files.
  • Increase the disk utilization threshold from 90% to 99%. To do this, modify the yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage property in yarn-default.xml on all nodes. Then, restart the hadoop-yarn-nodemanager service.

Related information

Amazon EMR cluster terminates with NO_SLAVE_LEFT and core nodes FAILED_BY_MASTER

Why does the core node in my Amazon EMR cluster run out of disk space?

AWS OFFICIALUpdated 5 months ago