Skip to content

How do I turn off Safemode for the NameNode service on my Amazon EMR cluster?

6 minute read
0

The NameNode service goes into Safemode when I try to run an Apache Hadoop or Apache Spark job on an Amazon EMR cluster. I turned off Safemode, but it immediately comes back on.

Short description

When you run an Apache Hadoop or Apache Spark job on an Amazon EMR cluster, you might receive one of the following error messages:

  • "Cannot create file/user/test.txt._COPYING_. Name node is in safe mode."
  • "org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /user/hadoop/.sparkStaging/application_15########_0001. Name node is in safe mode. It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:ip-###-##-##-##.ec2.internal"

After the DataNodes report that most file system blocks are available, the NameNode automatically leaves Safemode. However, the NameNode might enter Safemode again for the following reasons:

  • Available space is less than the amount of space that's required for the NameNode storage directory.
  • The NameNode can't load the FsImage and EditLog into memory.
  • The NameNode didn't receive the block report from the DataNode.
  • Some nodes in the cluster might be down and the blocks on the nodes become unavailable.
  • Some blocks might be corrupt.

Resolution

Important: In some cases, you might have data loss when you manually turn off Safemode.

To manually turn off Safemode, run the following command:

sudo -u hdfs hadoop dfsadmin -safemode leave

If Safemode automatically turns back on, then check the NameNode log at /var/log/hadoop-hdfs/ to determine the root cause. After you determine the cause, use the following troubleshooting steps to resolve the issue.

Switch to a cluster with multiple primary nodes

The Checkpoint process isn't automatic in clusters with a single primary node. So, Hadoop Distributed File System (HDFS) can't back up edit logs to a new snapshot (FsImage) and automatically remove them. If your cluster has only one primary node, then edit logs might use all the disk space in /mnt. To resolve this issue, launch a cluster with multiple primary nodes. Clusters with multiple primary nodes support high availability for the HDFS NameNode.

Remove unnecessary files from /mnt

The dfs.namenode.resource.du.reserved parameter specifies the minimum available disk space for /mnt. When the amount of available disk space for /mnt drops to a value that's below the value that you set in dfs.namenode.resource.du.reserved, the NameNode enters Safemode. The default value for dfs.namenode.resource.du.reserved is 100 MB. When Safemode is on, NameNode blocks all file systems and modifications. To resolve this issue, you must remove the unnecessary files from /mnt.

If there's sufficient disk space, then the logs look similar to the following example:

2020-08-28 19:14:43,540 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker (org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeResourceMonitor@5baaae4c): Space available on volume '/dev/xvdb2' is 76546048, which is below the configured reserved amount 104857600

If the disk space is insufficient, then the logs look similar to the following example:

2020-09-28 19:14:43,540 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem (org.apache.hadoop.hdfs.server.namenode.FSNamesystem$NameNodeResourceMonitor@5baaae4c): NameNode low on available disk space. Already in safe mode.

To remove unnecessary files, complete the following steps:

  1. Use SSH to connect to the primary node.

  2. To confirm that the NameNode is still in Safemode, run the following command:

    [root@ip-###-##-##-### mnt]# hdfs dfsadmin -safemode getSafe mode is ON
  3. Delete unnecessary files from /mnt.
    Note: If the directory in/mnt/namenode/current directory uses a large amount of space on a cluster with one primary node, then first create a new snapshot (FsImage). Then, remove the old edit logs.

  4. Check the amount of available disk space in /mnt. If the available space is more than 100 MB, then run the following command to check the status of Safemode again:

    [hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode get

    Example output:

    Safe mode is ON
  5. Run the following command to turn off Safemode:

    [hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode leave

    Example output:

    Safe mode is OFF

If /mnt still has less than 100 MB of available space, then perform one or more of the following actions:

Remove more files

Complete the following steps:

  1. Use SSH to connect to the primary node.

  2. Run the following command to navigate to the /mnt directory:

    cd /mnt
  3. Run the following command to determine the folders that use the most disk space:

    sudo du -hsx * | sort -rh | head -10
  4. Run the following command to check the largest subfolders within the folders that use the most diskspace:

    cd /var
    sudo du -hsx * | sort -rh | head -10

    Note: The preceding command checks for the largest subfolders in in the var folder. To check another folder, replace var with the folder that you want to check.

  5. Delete the largest files first. Make sure that you delete only files that you no longer need. The Amazon Simple Storage Service (Amazon S3) logging bucket already stores backup copies of compressed log files from /mnt/var/log/hadoop-hdfs/ and /mnt/var/log/hadoop-yarn/. You can safely delete these log files.

  6. After you delete the unnecessary files, run the following command to check the status of Safemode again:

    [hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode get

    Example output:

    Safe mode is ON
  7. Run the following command to turn off Safemode:

    [hadoop@ip-###-##-##-### ~]$ hdfs dfsadmin -safemode leave

    Example output:

    Safe mode is OFF

Check for corrupt or missing blocks and files

Complete the following steps:

  1. To check the health of the cluster, run the following command:
    hdfs fsck /
    Note: The output report also provides you with a percentage of under replicated blocks and a count of missing replicas.
  2. To locate the DataNode for each block of the file, run the following command for each file in the list:
    hdfs fsck example_file_name -locations -blocks -files
    Note: Replace example_file_name with your file name.
    Example output:
    0. BP-762523015-192.168.0.2-1480061879099:blk_1073741830_1006 len=134217728 MISSING!
    1. BP-762523015-192.168.0.2-1480061879099:blk_1073741831_1007 len=134217728 MISSING!
    2. BP-762523015-192.168.0.2-1480061879099:blk_1073741832_1008 len=70846464 MISSING!
    The preceding example output shows that the 192.168.0.2 DataNode stores the block. You can check the DataNode's logs for errors that are related to the specific block ID (blk_##).
    Note: Missing blocks often occur because nodes terminate unexpectedly.
  3. To delete the corrupted files, exit Safemode and then run the following command:
    hdfs dfs -rm example_file_name
    Note: Replace example_file_name with your file name.

Use CloudWatch metrics to monitor the health of HDFS

Use the following Amazon CloudWatch metrics to identify why the NameNode enters Safemode:

  • To identify the percentage of HDFS storage that’s used, review HDFSUtilization.
  • To identify the number of blocks where HDFS has no replicas, review MissingBlocks. These might be corrupt blocks.
  • To identify the number of blocks that need replication, review UnderReplicatedBlocks.

Related information

HDFS users guide on the Apache Hadoop website

AWS OFFICIALUpdated 4 months ago