EMR Cluster suddenly stop showing cloudwatch metrics

2

I have an EMR cluster running for MWAA, it runs perfectly fine. But two days ago it stopped showing any metrics on the Monitoring tab as well as CloudWatch metrics on the console. So it doesn't seem to be a configuration problem, because it was fine before. Even tho the CloudWatch metrics don't show anything, the EMR runs the tasks just fine. Any idea why this might be happening? How can I check and fix it?

1 Answer
4

Hello,

There might some issue in metrics-collector daemon running on the primary node. Please check if the process is running in the primary node and status of the process.

ps -ef | grep metrics-collector

-or-

sudo systemctl status metricscollector.service

If it is down, please start the service back. If the master node run out of memory, cpu or disk capacity, then it might fail to collect the metric data. Please validate them using below commands,

free -m
ps auxwww --sort -%cpu | head -20
df -h

Besides, you can also check the instance-controller log & instance-state log on the primary node to see if any issue that blocked this daemon. Please refer the EMR log locations

profile pictureAWS
SUPPORT ENGINEER
answered a year ago
  • ps -ef | grep metrics-collector
    root     18501  7371  0 19:41 pts/0    00:00:00 grep --color=auto metrics-collector
    root     29193     1  2 Sep18 ?        09:07:59 /usr/bin/java -Xmx1024m -Xms300m -XX:OnOutOfMemoryError=kill -9 %p -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/emr-metrics-collector/lib/*:/home/hadoop/conf -Dlog4j.defaultInitOverride emr.metricscollector.Main
    

    and

    sudo systemctl status metricscollector.service
    ● metricscollector.service - EMR metrics collector daemon
       Loaded: loaded (/etc/systemd/system/metricscollector.service; static; vendor preset: disabled)
       Active: active (running) since Mon 2023-09-18 16:55:47 UTC; 2 weeks 3 days ago
      Process: 29128 ExecStart=/usr/bin/metricscollector (code=exited, status=0/SUCCESS)
     Main PID: 29193 (java)
        Tasks: 75
       Memory: 586.4M
       CGroup: /system.slice/metricscollector.service
               └─29193 /usr/bin/java -Xmx1024m -Xms300m -XX:OnOutOfMemoryError=kill -9 %p -XX:MinHeapFreeRatio=10 -server -cp /usr/share/aws/emr/emr-metrics-collector/lib/*:/home/hadoop/conf...
    
    Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
    
  • I did restart with the commands you mentioned above. And does not seems to fix the problem, still not seeing any metrics.

  • There should be some clue in the instance-controller log & instance-state log on the primary node. Please check the metricscollector related info in the aforementioned logs. Also check if any IAM permission in the service role changed as there might a possibility that metric data not fetched by Cloudwatch due to permission issue.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions