Skip to content

How do I publish and monitor an Amazon EMR application status with CloudWatch integration?

6 minute read
0

I want to integrate Amazon CloudWatch with Amazon EMR to publish and monitor the statuses of applications that I installed on my cluster. I want CloudWatch to alert me when applications are down.

Short description

When you integrate CloudWatch with Amazon EMR, you can track critical statuses for applications that you installed, such as HiveServer2 and YARN ResourceManager. Then, you can publish the statuses to CloudWatch custom metrics and configure alerts for service unavailability. To track additional applications, you can modify the application list as needed.

Resolution

Prerequisites:

  • Amazon EMR version 5.30.0 or later
  • Amazon EMR instance profile role or an AWS Identity and Access Management (IAM) user role with cloudwatch:PutMetricData permissions

Create a script to monitor your Amazon EMR applications

You can create a script to monitor your Amazon EMR applications. The following example script that's named check_process.sh monitors YARN ResourceManager and HiveServer2 on a primary node. The script also monitors YARN NodeManager on core and task worker nodes. To monitor additional applications, you can modify applications under the # Monitor specific services section in the script.

To configure the following script to include additional applications, see Create bootstrap actions to install additional software with an Amazon EMR cluster.

Example script:

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this
software and associated documentation files (the "Software"), to deal in the Software
without restriction, including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.  
  
#!/bin/bash

# Set up logging
LOG_FILE="/var/log/hadoop/service-monitor-detailed.log"
LOG_STATUS_FILE="/var/log/hadoop/service-monitor-status.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
CLUSTERID=$(jq -r ".jobFlowId" < /emr/instance-controller/lib/info/extraInstanceData.json)
INSTANCEID=$(ec2-metadata -i | cut -d " " -f 2)
HOSTIP=$(hostname -i)
NODETYPE=$(cat /mnt/var/lib/instance-controller/extraInstanceData.json | jq -r '.instanceRole' | awk '{print toupper(substr($0,1,1)) tolower(substr($0,2))}')

# Function to log messages
log_message() {
    echo "$TIMESTAMP - $1" >> "$LOG_FILE"
    echo "$TIMESTAMP - $1"
}

log_status_message() {
    echo "$TIMESTAMP - $1" >> "$LOG_STATUS_FILE"
}

# Function to send metric to CloudWatch
send_to_cloudwatch() {
    local host_ip=$1
    local service_name=$2
    local status=$3

    aws cloudwatch put-metric-data \
        --namespace "EMR/ServiceStatus" \
        --metric-name "ServiceStatus" \
        --value "$status" \
        --unit "Count" \
        --dimensions ClusterId=$CLUSTERID,NodeServiceName=$service_name,InstanceId=$INSTANCEID,NodeType=$NODETYPE \
        --timestamp "$(date -u +"%Y-%m-%dT%H:%M:%SZ")" \
        --region "${AWS_REGION:-us-east-1}" || {
            log_message "ERROR: Failed to send metric for service $service_name"
            return 1
        }

    log_message "Successfully sent metric for service: $service_name (Status: $status)"
}

# Create log file if it doesn't exist
touch "$LOG_FILE"
touch "$LOG_STATUS_FILE"

log_message "Starting service monitoring..."

# Monitor specific services
services=(
    "hive-server2"
    "hadoop-yarn-resourcemanager"
    "hadoop-yarn-nodemanager"
)
service_names=(
    "HiveServer2"
    "YARN-ResourceManager"
    "YARN-NodeManager"
)

for i in "${!services[@]}"; do

    # Check if service is disabled as not all services are running on all nodes
    if systemctl is-enabled "${services[$i]}" 2>/dev/null | grep -q "disabled"; then
        log_message "$CLUSTERID $INSTANCEID $HOSTIP $NODETYPE ${service_names[$i]}-Status DISABLED (ignored)"
        continue
    fi
    
    # Get service status
    status_output=$(systemctl status "${services[$i]}" 2>/dev/null)

    # Extract the process status
    process_status=$(echo "$status_output" | grep "Active:" | sed -E 's/Active: ([^ ]+) .*/\1/' | xargs)

    # Log message
    log_message "$CLUSTERID $INSTANCEID $HOSTIP $NODETYPE ${service_names[$i]}-Status $process_status"
    log_status_message "$CLUSTERID $INSTANCEID $HOSTIP $NODETYPE ${service_names[$i]}-Status $process_status"

    # Convert status to numeric value for CloudWatch
    status_value=0
    if [ "$process_status" != "active" ]; then
        status_value=1
        # Send to CloudWatch
        send_to_cloudwatch "$HOSTIP" "${service_names[$i]}" "$status_value"
    fi

done

log_message "Service monitoring completed."

exit 0

Important: Before you run the script in a production environment, it's a best practice to test the script in a test environment.

The preceding script publishes custom metrics to CloudWatch. AWS prorates all custom metric charges by the hour and meters them only when the script sends the metrics to CloudWatch. For more information, see Amazon CloudWatch pricing.

Configure service monitoring on your Amazon EMR cluster

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

To implement automated service monitoring, use a bootstrap action script.

Complete the following steps:

  1. To prepare the script, run the following cp AWS CLI command to upload the script to an Amazon Simple Storage Service (Amazon S3) bucket that your Amazon EMR cluster can access:

    aws s3 cp check_process.sh s3://your-bucket/monitoring/check_process.sh
  2. To copy the script to each cluster node and use crontab to schedule the script, create a bootstrap action script that's similar to the following example:

    #!/bin/bash
    
    # Copy monitoring script from S3
    aws s3 cp s3://your-bucket/monitoring/check_process.sh /home/hadoop/
    chmod +x /home/hadoop/check_process.sh
    
    # Add to crontab
    (crontab -l 2>/dev/null; echo "*/5 * * * * /home/hadoop/check_process.sh") | crontab -  

    Note: Modify the crontab duration to meet your requirements.

  3. Add the bootstrap action script to the Amazon EMR cluster configuration file.
    Note: The S3 bucket must have minimum required permissions to access the scripts.

  4. Launch the cluster.

After you launch the cluster, run the following command on your cluster nodes to confirm that Amazon EMR correctly copied the script:

ls -l /home/hadoop/check_process.sh

To confirm that you correctly configured crontab, run the following command on the cluster nodes:

 crontab -l

Review the logs

The script generates detailed logs and status logs on cluster nodes. To verify that the script works correctly, review both logs.

Detailed logs

The /var/log/hadoop/service-monitor-detailed.log file provides comprehensive logs with timestamps, cluster ID, instance ID, host IP address, node type, and service status.

Example file:

2025-05-06 23:07:01 - Starting service monitoring...
2025-05-06 23:07:01 - j-1O1234567890 i-0a6871234567890 111.xx.xx.92 Master HiveServer2-Status inactive
2025-05-06 23:07:01 - Successfully sent metric for service: HiveServer2 (Status: 1)
2025-05-06 23:07:01 - j-1O1234567890 i-0a6871234567890 111.xx.xx.92 Master YARN-ResourceManager-Status active
2025-05-06 23:07:01 - Service monitoring completed.

Status logs

The /var/log/hadoop/service-monitor-status.log file contains records of the service status without the additional metadata.

Example file:

2025-05-06 23:07:01 - j-1O1234567890 i-0a6871234567890 111.xx.xx.92 Master HiveServer2-Status inactive
2025-05-06 23:07:01 - j-1O1234567890 i-0a6871234567890 111.xx.xx.92 Master YARN-ResourceManager-Status active
2025-05-06 23:08:01 - j-1O1234567890 i-0a6871234567890 111.xx.xx.92 Master HiveServer2-Status inactive
2025-05-06 23:08:01 - j-1O1234567890 i-0a6871234567890 111.xx.xx.92 Master YARN-ResourceManager-Status failed
2025-05-06 23:09:01 - j-1O1234567890 i-0a6871234567890 111.xx.xx.92 Master HiveServer2-Status inactive
2025-05-06 23:09:01 - j-1O1234567890 i-0a6871234567890 111.xx.xx.92 Master YARN-ResourceManager-Status failed

Use CloudWatch to monitor application metrics

The script sends metrics to CloudWatch when an application is down.

To monitor the metrics, complete the following steps:

  1. Open the CloudWatch console.
  2. In the navigation pane, under Metrics, choose All metrics.
  3. Under Metrics, choose EMR/ServiceStatus, and then select the ServiceStatus metric.
  4. Filter the metrics by the available dimensions: ClusterId, InstanceId, NodeServiceName, and NodeType.

Related information

Create a CloudWatch alarm based on a static threshold

View and restart Amazon EMR and application processes (daemons)

AWS OFFICIALUpdated 18 days ago