How to monitor and collect GPU metrics for Windows EC2 instances using Amazon CloudWatch?

1

Hi, Are there ways to collect GPU metrics for EC2 Windows instances, and send those metrics to Amazon CloudWatch for monitoring? NVIDIA GPU performance metrics from Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing instances running Linux are available, but could not find clear documentation for instances running Windows.

AWS
asked 2 years ago2257 views
2 Answers
2

There are a couple of ways to collect GPU metrics on Amazon EC2 Windows instances and push them to Amazon CloudWatch for monitoring: 1/.Using Amazon CloudWatch Custom Metrics, 2/. Using Telegraf as a 3rd party tool.

The following sections describe how both methods can be utilized. The key difference is that with Amazon CloudWatch Custom metrics, you can monitor GPU and memory-specific metrics such as utilization. Using Telegraf can expose additional metrics such as clock memory, encoder stats, and more. Hence, determining which tool to use largely depends on the use case requirements.

Method 1: Using CloudWatch Custom Metrics Pre-requisites: AWS CLI Nvidia Driver AWS Powershell modules IAM Role assigned to the instance that will allow CloudWatch put-metrics.

The following script configures a scheduled task in Windows to run every minute, collect metrics, and push them to Amazon CloudWatch while using the EC2 instance ID as a dimension. The script can alternatively be installed once using Amazon System Manager (SSM) for further automation.

#Create Scheduled Task to run every minute $taskName = 'Collect GPU Stats' $description = 'Collect GPU Stats and Pass to Cloudwatch Custom Metrics' $taskAction = New-ScheduledTaskAction -Execute 'powershell.exe' -Argument '-File C:\Scripts\GPUStats.ps1' $principal = New-ScheduledTaskPrincipal -UserID "Administrator" -LogonType S4U -RunLevel Highest $taskTrigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 1) $settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Minutes 2)

Register-ScheduledTask -TaskName $taskName -Action $taskAction -Trigger $taskTrigger -Description $description -Settings $settings -Principal $principal

mkdir C:\Scripts Set-Content -Path 'C:\Scripts\GPUStats.ps1' -Value @' #Get Stats from NVIDIA-SMI $STATS = & 'C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe' —query-gpu=temperature.gpu,memory.used,memory.free,utilization.gpu,utilization.memory —format=csv,nounits

#Convert to PS Object $object = ConvertFrom-Csv -InputObject $STATS -Delimiter ','

#Get EC2 Instance ID $instanceID = Get-EC2InstanceMetadata -Path '/instance-id'

#Put Metrics in Cloudwatch aws cloudwatch put-metric-data —metric-name Temperature —namespace GPUStats —value $object.'temperature.gpu' —dimensions InstanceId=$instanceID aws cloudwatch put-metric-data —metric-name MemoryUsed —namespace GPUStats —unit Megabytes —value $object.'memory.used [MiB]' —dimensions InstanceId=$instanceID aws cloudwatch put-metric-data —metric-name MemoryFree —namespace GPUStats —unit Megabytes —value $object.'memory.free [MiB]' —dimensions InstanceId=$instanceID aws cloudwatch put-metric-data —metric-name GPUUtilization —namespace GPUStats —unit Percent —value $object.'utilization.gpu [%]' —dimensions InstanceId=$instanceID aws cloudwatch put-metric-data —metric-name MemoryUtilization —namespace GPUStats —unit Percent —value $object.'utilization.memory [%]'—dimensions InstanceId=$instanceID

echo $object exit '@

Method 2: Using Telegraf with CloudWatch (external answer version)

The steps to install and run Telegraf are outlined below:

  1. Create an IAM User that has S3 access and cloudwatch:PutMetricData permission
  2. Download Nvidia Grid Drivers:

$Bucket = "ec2-windows-nvidia-drivers" $KeyPrefix = "latest" $LocalPath = "$home\Desktop\NVIDIA" $Objects = Get-S3Object -BucketName $Bucket -KeyPrefix $KeyPrefix -Region us-east-1 foreach ($Object in $Objects) { $LocalFileName = $Object.Key if ($LocalFileName -ne '' -and $Object.Size -ne 0) { $LocalFilePath = Join-Path $LocalPath $LocalFileName Copy-S3Object -BucketName $Bucket -Key $Object.Key -LocalFile $LocalFilePath -Region us-east-1 } }

  1. Download Telegraf (https://portal.influxdata.com/downloads/) for Windows then install it:

wget https://dl.influxdata.com/telegraf/releases/telegraf-1.21.4_windows_amd64.zip -UseBasicParsing -OutFile telegraf-1.21.4_windows_amd64.zip Expand-Archive .\telegraf-1.21.4_windows_amd64.zip -DestinationPath 'C:\Program Files\InfluxData\telegraf'

C:"Program Files"\InfluxData\telegraf\telegraf-1.21.4\telegraf.exe —service install —config C:"Program Files"\InfluxData\telegraf\telegraf-1.21.4\telegraf.conf

  1. Update Conf file - Output plugin section under cloudwatch output plugin section and

First: The OUTPUT PLUGIN section: Make the following changes: region = "region example: us-east-1" access_key = "yourIAMroleAccessKey" secret_key = " yourIAMroleSecretKey "

Update the INPUT PLUGINS section to add the following:

  • [[inputs.nvidia_smi]] bin_path = "C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe"
  • Comment out # [[inputs.diskio]] (it was not commented out by default for me which caused an error)
  • Comment out #[[inputs.processes]] (it was not commented out by default for me which caused an error)
  1. Start the service by going to ‘Services’ in Windows, finding Telegraf Service, and hitting ‘start’ or ‘restart’.
  2. Open CloudWatch Console and Look for InfluxData/Telegraf inputs.
AWS
answered 2 years ago
profile picture
EXPERT
reviewed 11 days ago
1

AWS-User-0753981 has a working solution but the code hasn't pasted nicely. I have worked through 'Method 1: Using custom metrics' and reproduce below. The solution is for Windows GPU servers. Note that if you are using a Linux server note that you can now use the CloudWatch agent to collect GPU metrics.

First we need to set up the query using nvidia-smi.exe, which should be available with a default installation. On my server this was in C:\Windows\System32 but it may be found in folder of the form C:\Windows\System32\DriverStore\C:\Windows\System32\DriverStore\FileRepository\nv* per https://stackoverflow.com/questions/57100015/how-do-i-run-nvidia-smi-on-windows. In what follows make sure the path of nvidia-smi.exe matches your server.

First check that the GPU monitoring works correctly on the server.

#Get Stats from NVIDIA-SMI 
$STATS = & 'C:\Windows\System32\nvidia-smi.exe' --query-gpu=temperature.gpu,memory.used,memory.free,utilization.gpu,utilization.memory --format=csv,nounits

#Convert to PS Object 
$object = ConvertFrom-Csv -InputObject $STATS -Delimiter ','

You can verify this is working correctly by printing the variables and confirming the output for $STATS:

echo $STATS

expected output:

27, 116, 22616, 2, 0

and $object

echo $object

expected output

temperature.gpu : 27
memory.used [MiB] : 116
memory.free [MiB] : 22616
utilization.gpu [%] : 2
utilization.memory [%] : 0

Now we need to create a script from these to push the stats to CloudWatch. Create a folder e.g. C:\Scripts

mkdir C:\Scripts

and create a new file GPUMetrics.ps1 with the content below. Note that the value of the variable $namespace will be the name of the Custom namepace where you will find the metrics in Amazon CloudWatch.

#Get Stats from NVIDIA-SMI 
$STATS = & 'C:\Windows\System32\nvidia-smi.exe' --query-gpu=temperature.gpu,memory.used,memory.free,utilization.gpu,utilization.memory --format=csv,nounits

#Convert to PS Object 
$object = ConvertFrom-Csv -InputObject $STATS -Delimiter ','

#Get EC2 Instance ID 
$instanceID = Get-EC2InstanceMetadata -Path '/instance-id'

#Set value of Custom Namespace
$namespace = 'GPUMetrics'

#Put Metrics in Cloudwatch 
aws cloudwatch put-metric-data --metric-name Temperature --namespace $namespace --value $object.'temperature.gpu' --dimensions InstanceId=$instanceID
aws cloudwatch put-metric-data --metric-name MemoryUsed --namespace $namespace --unit Megabytes --value $object.'memory.used [MiB]' --dimensions InstanceId=$instanceID
aws cloudwatch put-metric-data --metric-name MemoryFree --namespace $namespace --unit Megabytes --value $object.'memory.free [MiB]' --dimensions InstanceId=$instanceID
aws cloudwatch put-metric-data --metric-name GPUUtilization --namespace $namespace --unit Percent --value $object.'utilization.gpu [%]' --dimensions InstanceId=$instanceID
aws cloudwatch put-metric-data --metric-name MemoryUtilization --namespace $namespace --unit Percent --value $object.'utilization.memory [%]'--dimensions InstanceId=$instanceID

echo $object 

If you run this script you should see the first set of metrics appear as GPUMetrics under Customer Namespaces in CloudWatch. If they are not showing first check that you are viewing the same region where the EC2 instance is located.

Finally we can set up a scheduled task to push the metrics to CloudWatch at regular intervals. If you have named or located your script differently from above make sure to update the ‘-File’ parameter in the $taskAction definition accordingly.

#Create Scheduled Task to run every minute 
$taskName = 'Push GPU Metrics to CloudWatch' 
$description = 'Collect GPU Metrics and Pass to Cloudwatch Custom Metrics'
$taskAction = New-ScheduledTaskAction -Execute 'powershell.exe' -Argument '-File C:\Scripts\GPUMetrics.ps1'
$principal = New-ScheduledTaskPrincipal -UserID "Administrator" -LogonType S4U -RunLevel Highest
$taskTrigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 1)
$settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Minutes 2)

#schedule the task to run
Register-ScheduledTask -TaskName $taskName -Action $taskAction -Trigger $taskTrigger -Description $description -Settings $settings -Principal $principal

If this completes successfully the final output should confirm the Task has been created:

TaskPath         TaskName                          State
--------         --------                          -----
\                Push GPU Metrics to CloudWatch      Ready

If you open the GPUMetrics custom namespace you can explore the metrics pushed to CloudWatch and view as time series.

AWS
answered 7 months ago
profile picture
EXPERT
reviewed 11 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions