How do I send NVIDIA GPU metrics from my EC2 Linux instances to CloudWatch?

3 minute read
0

I want to send NVIDIA GPU metrics from Linux Amazon Elastic Compute Cloud (Amazon EC2) instances to Amazon CloudWatch.

Short description

Use the CloudWatch agent to collect NVIDIA GPU metrics for your Amazon EC2 Linux instances. Add the nvidia_gpu field inside the metrics_collected section in the CloudWatch agent configuration file. For more information, see Collect NVIDIA GPU metrics.

The instance must have an NVIDIA driver installed. For more information, see Recommended GPU instances. NVIDIA drivers are preinstalled on some Amazon Machine Images (AMIs). If the instance doesn't have a NVIDIA driver, then manually install the driver. You can either download a public NVIDIA driver or download a driver from Amazon Simple Storage Service (Amazon S3). You can also use an AMI with the driver preinstalled. For more information, see Install NVIDIA drivers on Linux instances.

Resolution

Download the CloudWatch agent and create an IAM role

Complete the following steps:

  1. Download and configure the CloudWatch agent for your EC2 instances.
  2. Install the CloudWatch agent.
  3. Verify that your instances have outbound internet access to send data to CloudWatch.
  4. Create an AWS Identity and Access Management (IAM) role to use with the CloudWatch agent.

Note: The IAM role must have access to the AmazonS3ReadOnlyAccess AWS managed policy.

Create or edit the CloudWatch agent configuration file and start the agent

Complete the following steps:

  1. Manually create or edit the CloudWatch agent configuration file. Make sure that you specify the GPU metrics that you want to collect in the nvidia_gpu field under the metrics_collected section.
    Example CloudWatch agent configuration file:

    {
        "agent": {
            "metrics_collection_interval": 60,
            "run_as_user": "root"
        },
        "metrics": {
            "metrics_collected": {
                "nvidia_gpu": {
                    "measurement": [
                        "utilization_gpu",
                        "memory_total",
                        "memory_used",
                        "memory_free"
                    ]
                }
            }
        }
    }
  2. Run the following command to use the command line to start the CloudWatch agent:

     sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:configuration-file-path -s

    Note: Replace configuration-file-path with your configuration file path.
    If the CloudWatch agent fails to start and you receive the following error message, then the agent can't locate the nvidia-smi file:
    "[telegraf] Error running agent: validate input plugin nvidia_smi failed because of Cannot get file's stat /usr/bin/nvidia-smi: no such file or directory"

  3. To verify that the NVIDIA driver correctly installed, run the following command:

    [ec2-user ~]$ nvidia-smi -q | head

The output lists the installed version of the NVIDIA driver and details about the GPUs.

If the NVIDIA driver didn't correctly install, then reinstall the driver for your Linux EC2 instance type.

Related information

How do I troubleshoot Xid errors on my NVIDIA GPU-accelerated EC2 Linux instance?

Amazon CloudWatch agent adds support for NVIDIA GPU metrics

AWS OFFICIAL
AWS OFFICIALUpdated 4 months ago