By using AWS re:Post, you agree to the Terms of Use

How to monitor and collect GPU metrics for Windows EC2 instances using Amazon CloudWatch?

1

Hi, Are there ways to collect GPU metrics for EC2 Windows instances, and send those metrics to Amazon CloudWatch for monitoring? NVIDIA GPU performance metrics from Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing instances running Linux are available, but could not find clear documentation for instances running Windows.

1 Answer
1

There are a couple of ways to collect GPU metrics on Amazon EC2 Windows instances and push them to Amazon CloudWatch for monitoring: 1/.Using Amazon CloudWatch Custom Metrics, 2/. Using Telegraf as a 3rd party tool.

The following sections describe how both methods can be utilized. The key difference is that with Amazon CloudWatch Custom metrics, you can monitor GPU and memory-specific metrics such as utilization. Using Telegraf can expose additional metrics such as clock memory, encoder stats, and more. Hence, determining which tool to use largely depends on the use case requirements.

Method 1: Using CloudWatch Custom Metrics Pre-requisites: AWS CLI Nvidia Driver AWS Powershell modules IAM Role assigned to the instance that will allow CloudWatch put-metrics.

The following script configures a scheduled task in Windows to run every minute, collect metrics, and push them to Amazon CloudWatch while using the EC2 instance ID as a dimension. The script can alternatively be installed once using Amazon System Manager (SSM) for further automation.

#Create Scheduled Task to run every minute $taskName = 'Collect GPU Stats' $description = 'Collect GPU Stats and Pass to Cloudwatch Custom Metrics' $taskAction = New-ScheduledTaskAction -Execute 'powershell.exe' -Argument '-File C:\Scripts\GPUStats.ps1' $principal = New-ScheduledTaskPrincipal -UserID "Administrator" -LogonType S4U -RunLevel Highest $taskTrigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 1) $settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Minutes 2)

Register-ScheduledTask -TaskName $taskName -Action $taskAction -Trigger $taskTrigger -Description $description -Settings $settings -Principal $principal

mkdir C:\Scripts Set-Content -Path 'C:\Scripts\GPUStats.ps1' -Value @' #Get Stats from NVIDIA-SMI $STATS = & 'C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe' —query-gpu=temperature.gpu,memory.used,memory.free,utilization.gpu,utilization.memory —format=csv,nounits

#Convert to PS Object $object = ConvertFrom-Csv -InputObject $STATS -Delimiter ','

#Get EC2 Instance ID $instanceID = Get-EC2InstanceMetadata -Path '/instance-id'

#Put Metrics in Cloudwatch aws cloudwatch put-metric-data —metric-name Temperature —namespace GPUStats —value $object.'temperature.gpu' —dimensions InstanceId=$instanceID aws cloudwatch put-metric-data —metric-name MemoryUsed —namespace GPUStats —unit Megabytes —value $object.'memory.used [MiB]' —dimensions InstanceId=$instanceID aws cloudwatch put-metric-data —metric-name MemoryFree —namespace GPUStats —unit Megabytes —value $object.'memory.free [MiB]' —dimensions InstanceId=$instanceID aws cloudwatch put-metric-data —metric-name GPUUtilization —namespace GPUStats —unit Percent —value $object.'utilization.gpu [%]' —dimensions InstanceId=$instanceID aws cloudwatch put-metric-data —metric-name MemoryUtilization —namespace GPUStats —unit Percent —value $object.'utilization.memory [%]'—dimensions InstanceId=$instanceID

echo $object exit '@

Method 2: Using Telegraf with CloudWatch (external answer version)

The steps to install and run Telegraf are outlined below:

  1. Create an IAM User that has S3 access and cloudwatch:PutMetricData permission
  2. Download Nvidia Grid Drivers:

$Bucket = "ec2-windows-nvidia-drivers" $KeyPrefix = "latest" $LocalPath = "$home\Desktop\NVIDIA" $Objects = Get-S3Object -BucketName $Bucket -KeyPrefix $KeyPrefix -Region us-east-1 foreach ($Object in $Objects) { $LocalFileName = $Object.Key if ($LocalFileName -ne '' -and $Object.Size -ne 0) { $LocalFilePath = Join-Path $LocalPath $LocalFileName Copy-S3Object -BucketName $Bucket -Key $Object.Key -LocalFile $LocalFilePath -Region us-east-1 } }

  1. Download Telegraf (https://portal.influxdata.com/downloads/) for Windows then install it:

wget https://dl.influxdata.com/telegraf/releases/telegraf-1.21.4_windows_amd64.zip -UseBasicParsing -OutFile telegraf-1.21.4_windows_amd64.zip Expand-Archive .\telegraf-1.21.4_windows_amd64.zip -DestinationPath 'C:\Program Files\InfluxData\telegraf'

C:"Program Files"\InfluxData\telegraf\telegraf-1.21.4\telegraf.exe —service install —config C:"Program Files"\InfluxData\telegraf\telegraf-1.21.4\telegraf.conf

  1. Update Conf file - Output plugin section under cloudwatch output plugin section and

First: The OUTPUT PLUGIN section: Make the following changes: region = "region example: us-east-1" access_key = "yourIAMroleAccessKey" secret_key = " yourIAMroleSecretKey "

Update the INPUT PLUGINS section to add the following:

  • [[inputs.nvidia_smi]] bin_path = "C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe"
  • Comment out # [[inputs.diskio]] (it was not commented out by default for me which caused an error)
  • Comment out #[[inputs.processes]] (it was not commented out by default for me which caused an error)
  1. Start the service by going to ‘Services’ in Windows, finding Telegraf Service, and hitting ‘start’ or ‘restart’.
  2. Open CloudWatch Console and Look for InfluxData/Telegraf inputs.
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions