Decoding Instance-State log in EMR

15 minute read
Content level: Advanced
3

The Instance-state log available in Amazon EMR on EC2 that provides valuable information for troubleshooting application failures or investigating system details. This article describes the detailed information of various system level statistics and metrics collected by instance-state and how they are helpful to investigate performance issues.

Introduction

The instance-state log captures the operating system statistics every 15 minutes and uploads the data to an S3 bucket log location organized by cluster ID and EC2 instance ID (S3-Bucket/<EMR-Cluster>/node/<EC2 InstanceID>/daemons/instance-state/instance-state.log-timestamp.gz). The same log files can also be accessed locally on each EMR instance node under the /emr/instance-state directory via SSH. The automated regular collection of granular instance health data makes the Instance-state log a useful resource when examining performance issues or gathering detailed system information for an EMR cluster.

There are some differences between the instance-state logs for core/task nodes compared to those from the primary EMR cluster node, as the primary node log captures additional information such as heap statistics for the ResourceManager and NameNode components, HDFS disk usage details, etc. It is important to understand that the instance-state logs provide system snapshots at 15 minute intervals. Any activities or metrics occurring at shorter intervals will not be captured. For more granular monitoring at the minute- or second-level, other solutions can be implemented such as Amazon CloudWatch, Amazon Managed Service for Prometheus and Grafana, or the CodeGuru Profiler. The instance-state logs provide periodic broad system statistics while other tools are better suited for fine-grained or real-time monitoring of EMR cluster performance and processes.

As an example, I will be providing the overview of various instance stats that captured in the instance-state log.

uptime

Firstly, the log contains information such as the AMI type (e.g. AL2023 or AL2) and system uptime, which indicates how long the EC2 instance has been continuously running. From the load average values, which represent the average system load over a given period, The three load average values show the average number of processes waiting in the run queue over the last 1, 5, and 15 minute periods respectively. By examining the uptime data point and comparing it to the load average metrics also captured in the log, one can analyze system load over time.

OS is AL2023: /usr/bin/ruby

# how long have we been up
uptime
 19:21:48 up 58 min,  1 user,  load average: 0.19, 0.39, 0.43

TOP CPU & Memory

Next, we can see the useful results of TOP CPU processes and TOP memory consuming processes as shown below,

#Top CPU users
ps auxwww --sort -%cpu | head -20
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
yarn       10275  1.7  3.1 4857880 514376 ?      Sl   18:28   0:55 /usr/lib/jvm/jre-17/bin/java -Dproc_resourcemanager -Djava.net.preferIPv4Stack=true -server -XX:+ExitOnOutOfMemoryError --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED
...
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
#Top memory users
ps auxwww --sort -rss | head -20
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
yarn       10275  1.7  3.1 4857880 514376 ?      Sl   18:28   0:55 /usr/lib/jvm/jre-17/bin/java -Dproc_resourcemanager -Djava.net.preferIPv4Stack=true -server -XX:+ExitOnOutOfMemoryError --add-exports=jdk.compiler/com.sun.tools.javac.util=ALL-UNNAMED
...
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager

dmesg

Then, we can get tail output of the dmesg command. DMesg messages provide useful diagnostic information about the system and its hardware devices. It can be used to troubleshoot hardware issues like faulty devices, installation problems, driver failures etc. Any issues with detecting or communicating with hardware typically leaves traces in the kernel logs shown by dmesg. Checking dmesg logs is one of the first steps for investigating any odd system behavior, kernel panics or device problems.

# hows the kernel looking
dmesg | tail -n 25

Tail output of instance-controller & logpusher log

The instance-state logs also contain the most recent entries from the instance-controller log at the time of capture. Examining these instance-controller logs allows one to identify any errors that may have occurred in the operations of the instance-controller daemon. Additionally, the logpusher logs are included, which provide visibility into the process responsible for transmitting logs from the EC2 instance to the Amazon S3 storage bucket.

# dump instance controller log
tail -n 100 /emr/instance-controller/log/instance-controller.log

# dump log pusher log
tail -n 100 /emr/logpusher/log/logpusher.log

Thread dumps of java processes

Within Instance-state log, we can refer the thread dumps of key JVM-based components running on an EMR cluster, including the ResourceManager, NameNode, and Spark Executors. The thread dump outputs captured at the time of the instance snapshot(15 mins). A thread dump enumerates all active threads running within a Java virtual machine (JVM), along with stack traces and execution details for each thread. Examining these thread dumps allows one to understand the active operations and current internal state of critical cluster services.

The data enables identifying problematic code paths, stalled processes, resource contention issues, and any other behavior deviating from normal execution. Common thread states visible in the dumps such as RUNNABLE, BLOCKED, WAITING, and TIMED_WAITING provide insights into the health and progress of both individual threads and their enclosing components. Example output mentioned in the below format,

============== Begin threads dump for process 4406 of hadoop ==============
<thread dump>
============== End threads dump for process 4406 of hadoop ==============


============= Begin threads dump for process 6211 of emr-notebook =============
<thread dump>
============= End threads dump for process 6211 of emr-notebook =============

Heap dump output

For EMR cluster primary nodes, the instance-state logs contain additional heap dump statistics for critical cluster services including the ResourceManager and NameNode processes. These heap dumps enumerate all objects currently occupying the Java virtual machine (JVM) memory for these components, providing details on the distribution and scale of memory utilization across different generation pools like young generation (eden, from, and to spaces) and the old generation.

ResourceManager heap stats:
Attaching to process ID 24677, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.402-b08

using thread-local object allocation.
Parallel GC with 4 thread(s)

Heap Configuration:
...

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 499122176 (476.0MB)
   used     = 223128976 (212.79237365722656MB)
   free     = 275993200 (263.20762634277344MB)
   44.70428018008961% used
From Space:
   capacity = 17301504 (16.5MB)
   used     = 7347352 (7.006980895996094MB)
   free     = 9954152 (9.493019104003906MB)
   42.46655088482481% used
To Space:
   capacity = 17301504 (16.5MB)
   used     = 0 (0.0MB)
   free     = 17301504 (16.5MB)
   0.0% used
PS Old Generation
   capacity = 210239488 (200.5MB)
   used     = 25669712 (24.480545043945312MB)
   free     = 184569776 (176.0194549560547MB)
   12.209748151593672% used

36227 interned Strings occupying 3931112 bytes.
Namenode heap stats:
Attaching to process ID 22132, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.402-b08

using thread-local object allocation.
Parallel GC with 4 thread(s)

Heap Configuration:
...

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 304611328 (290.5MB)
   used     = 129299672 (123.30977630615234MB)
   free     = 175311656 (167.19022369384766MB)
   42.447427299880324% used
From Space:
   capacity = 32505856 (31.0MB)
   used     = 9662904 (9.215263366699219MB)
   free     = 22842952 (21.78473663330078MB)
   29.726656021610385% used
To Space:
   capacity = 31457280 (30.0MB)
   used     = 0 (0.0MB)
   free     = 31457280 (30.0MB)
   0.0% used
PS Old Generation
   capacity = 230162432 (219.5MB)
   used     = 47760976 (45.54841613769531MB)
   free     = 182401456 (173.9515838623047MB)
   20.750986850886246% used

By assessing the memory consumption trends and patterns within these dumps, administrators can get idea about the situations indicating potential memory leaks, excessive garbage collection pressure, or other suboptimal JVM memory conditions at the particular timeframe.

HDFS Usage

Further, the Hadoop Distributed File System (HDFS) usage command provides the output of the HDFS report, which presents an overview of the utilized capacity and available disk space within the HDFS cluster. Additionally, this report facilitates the identification of any missing or corrupted HDFS blocks. It is important to note that this information is captured solely within the instance-state log of the primary node.

HDFS usage:
Configured Capacity: 62245027840 (57.97 GB)
Present Capacity: 61468073984 (57.25 GB)
DFS Remaining: 61467828224 (57.25 GB)
DFS Used: 245760 (240 KB)
DFS Used%: 0.00%
Replicated Blocks:
	Under replicated blocks: 0
	Blocks with corrupt replicas: 0
	Missing blocks: 0
	Missing blocks (with replication factor 1): 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0

Live datanodes (1):

Name: xx.xx.xx.xx:9866 (ip-xx-xx-xx-xx.ec2.internal)
Hostname: ip- xx-xx-xx-xx.ec2.internal
Decommission Status : Normal
Configured Capacity: 62245027840 (57.97 GB)
DFS Used: 245760 (240 KB)
Non DFS Used: 776953856 (740.96 MB)
DFS Remaining: 61467828224 (57.25 GB)
DFS Used%: 0.00%
DFS Remaining%: 98.75%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Fri Mar 08 14:37:42 UTC 2024
Last Block Report: Fri Mar 08 13:50:56 UTC 2024
Num of Blocks: 1213

traceroute output

Next captured information is traceroute output. The node's ability to establish connectivity with the S3 (Simple Storage Service) via an external network can be verified by analyzing the traceroute output, which provides diagnostic information about the network path and potential routing issues.

# Now traceroute it
traceroute -T --sport=17241 -p 443 -w 3 -n -m 10 elasticmapreduce.s3.amazonaws.com
traceroute to elasticmapreduce.s3.amazonaws.com (xx.xx.xx.xx), 10 hops max, 60 byte packets

For the list of last logged user details in the node also can be verified,

# listing of last logged in users
last -w -n 25
hadoop   pts/0        xx.xx.xx.ip    Fri Mar  8 13:48   still logged in   

iostat output

Further, we refer iostat output to check the disk IOPS bottlenecks if any. The output of iostat provides various metrics related to CPU utilization, device I/O activity, and other system performance indicators.

# whats io usage look like
iostat -x 1 5
Linux 4.14.320-243.544.amzn2.x86_64 (ip-xx-xx-xx-xx) 	03/08/24 	_x86_64_	(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.80    0.02    2.45    0.33    0.01   87.39

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     4.96    0.14   24.67     3.82   978.89    79.21     0.03    1.59    0.64    1.59   0.25   0.63
nvme2n1           0.00     0.03    0.07    0.30     1.42    49.73   272.84     0.00    3.26    0.52    3.91   0.60   0.02
nvme0n1           0.02     5.40   15.59    5.39   405.11   923.92   126.67     0.03    1.90    0.77    5.17   0.32   0.67
  1. CPU Utilization:
  • %user: Percentage of CPU time spent executing user-level processes.
  • %nice: Percentage of CPU time spent executing nice (low-priority) processes.
  • %sys: Percentage of CPU time spent executing system (kernel) processes.
  • %iowait: Percentage of time the CPU was idle while waiting for I/O operations to complete.
  • %idle: Percentage of time the CPU was idle and not processing any tasks.
  1. Device I/O Statistics:
  • tps (Transfers per second): Number of I/O transfers per second issued to the device.
  • Blk_read/s (Blocks read per second): Number of blocks read from the device per second.
  • Blk_wrtn/s (Blocks written per second): Number of blocks written to the device per second.
  • Blk_read (Total blocks read): Total number of blocks read from the device.
  • Blk_wrtn (Total blocks written): Total number of blocks written to the device.
  1. Queue Statistics:
  • rrqm/s (Merged read requests per second): Number of read requests merged per second.
  • wrqm/s (Merged write requests per second): Number of write requests merged per second.
  • r/s (Read requests per second): Number of read requests issued to the device per second.
  • w/s (Write requests per second): Number of write requests issued to the device per second.
  • rkB/s (Read kilobytes per second): Number of kilobytes read from the device per second.
  • wkB/s (Write kilobytes per second): Number of kilobytes written to the device per second.
  • avgrq-sz (Average request size): Average size of the requests issued to the device.
  • avgqu-sz (Average queue length): Average number of requests waiting to be serviced by the device.
  • await (Average wait time): Average time (in milliseconds) for I/O requests to be served.
  • svctm (Average service time): Average service time (in milliseconds) for I/O requests.
  • %util (Device utilization): Percentage of CPU time during which the device was actively servicing requests.

Free memory

Next, we see the result of “free -m” command. By analyzing the output, you can determine if the system has enough available memory for running additional processes or if you need to consider adding more physical memory

# whats memory usage look like
free -m
              total        used        free      shared  buff/cache   available
Mem:          15563        5740        4893           0        4929        9503
Swap:             0           0           0
  1. Total: This is the total amount of physical memory (RAM) installed on the system.
  2. Used: This represents the amount of memory that is currently being used by processes, including shared memory.
  3. Free: This is the amount of physical memory that is currently unused and available for new processes or data.
  4. Shared: This represents the amount of memory that is shared between multiple processes. This memory is counted multiple times in the "Used" column.
  5. Buff/Cache: This shows the amount of memory used by the kernel for buffering and caching data. This memory can be reclaimed by the system if needed.
  6. Available: This is an estimate of the amount of memory that is available for starting new applications, without swapping. It is calculated as the sum of the "Free" and the amount of "Buff/Cache" memory that can be reclaimed.

vmstat & netstat output

We can review the vmstat and netstat stats to understand the CPU usage of system and user level, check the memory availability and also network bytes transferred in and out. For example, if the us (user processes) and sy (system processes) values are consistently high, it may indicate that the system is heavily loaded with processes competing for CPU time. If the wa (I/O wait) value is high, it could indicate an I/O bottleneck or slow disk performance.

By monitoring the free column, you can determine the amount of available memory on the system. If the free value is consistently low, it may indicate that the system is running out of available memory, which could lead to performance issues or potential system instability.

Additionally, the buff and cache columns represent the memory used for buffering and caching data, respectively. This memory can be reclaimed by the system if needed, providing a potential source of additional free memory.

# trend memory
vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  1      0 5009972   6244 5041764    0    0   104   491 1126 2086 10  2 87  0  0
 1  1      0 5001312   6244 5044664    0    0  2812    68 11849 22997  8 21 67  4  0
 0  1      0 4999212   6244 5047536    0    0  2788     6 10674 19939 10 16 69  4  0
 2  0      0 4994512   6244 5050300    0    0  2688     0 10163 19464  7 16 73  4  0
 1  0      0 4993344   6244 5052656    0    0  2072     0 12244 23934 17 22 58  3  0

# dump network statistics
netstat -s -e

Netstat output displays various network-related information, including active network connections, routing tables, and network interface statistics. The output of netstat provides valuable insights into the system's network activity and can be used for troubleshooting and performance monitoring.

Disk utilization

In the event that a running application occupies a significant portion of the Elastic Block Store (EBS) volume, it is advisable to monitor the disk utilization of the node by referring to the relevant output. If the disk usage of the node exceeds a predetermined threshold, such as 80% or 90%, the node would likely be marked as unhealthy by the Yarn resource manager.

# amount of disk free
df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.6G     0  7.6G   0% /dev
tmpfs           7.6G     0  7.6G   0% /dev/shm
tmpfs           7.6G  632K  7.6G   1% /run
tmpfs           7.6G     0  7.6G   0% /sys/fs/cgroup
/dev/nvme0n1p1   10G  8.0G  2.1G  80% /
/dev/nvme1n1p1  5.0G   85M  5.0G   2% /emr
/dev/nvme1n1p2   27G  1.3G   26G   5% /mnt
/dev/nvme2n1     32G  263M   32G   1% /mnt1
tmpfs           1.6G     0  1.6G   0% /run/user/995
tmpfs           1.6G     0  1.6G   0% /run/user/0

By analyzing below output, we can also gain insights into the top 10 largest files or folders that contribute to the occupation of space, including application logs and HDFS blocks. This information can aid in identifying potential areas of concern related to disk usage to ensure optimal system performance and prevent potential issues caused by excessive disk utilization.

# Top 10 largest files and folders in /mnt/var/*
du -hsx /mnt/var/* | sort -rh | head -10
516M	/mnt/var/cache
115M	/mnt/var/lib
72M	/mnt/var/log
0	/mnt/var/tmp
0	/mnt/var/run
0	/mnt/var/em

# Top 20 largest files and folders including subdirectories in /mnt/var/log
du -hx /mnt/var/log | sort -rh | head -20
72M	/mnt/var/log
57M	/mnt/var/log/journal/ec2f6eb1863ad90f4a074f23addb33bf
57M	/mnt/var/log/journal

So, in this article, I have highlighted most of the important stats that instance-state log captures in EMR and I hope the analysis presented herein will serve as a valuable insight for troubleshooting and investigating any initial/potential performance-related concerns that may arise. As mentioned, more granular monitoring at the minute- or second-level, other solutions can be implemented such as Amazon CloudWatch, Amazon Managed Service for Prometheus, Grafana, CodeGuru Profiler etc.

AWS
SUPPORT ENGINEER
published 13 days ago396 views