Looking for Neuron metrics to include in the CloudWatch agent config file "config.json" to monitor Neuron stats from Amazon CloudWatch.

0

I am running a Deep Learning AMI Neuron (Ubuntu 22.04) instance on AWS. From the command line I created the CloudWatch agent configuration file "config.json", and successfully started the CloudWatch agent.

I am looking for Neuron specific metrics for Trn1 or Trn2 that I can include in the "config.json" file, to be able to monitor Neuron health from Amazon CloudWatch. Examples will be of great help.

Thank you!

1 Answer
0

To monitor Neuron-specific metrics for Trn1 or Trn2 instances using Amazon CloudWatch, you can utilize the neuron-monitor tool, which is part of the aws-neuron-tools package. This tool collects metrics and stats from Neuron Applications running on your system and can be configured to stream this data to CloudWatch.

Here's an example of how you can set up your CloudWatch agent configuration file (config.json) to include Neuron-specific metrics:

  1. First, create a neuron-monitor configuration file (e.g., monitor.conf) with the following content:

{ "period": "1s", "neuron_runtimes": [ { "tag_filter": ".*", "metrics": [ { "type": "neuroncore_counters" }, { "type": "memory_used" }, { "type": "neuron_runtime_vcpu_usage" }, { "type": "execution_stats" } ] } ], "system_metrics": [ { "type": "vcpu_usage" }, { "type": "memory_info" }, { "period": "2s", "type": "neuron_hw_counters" } ] }

  1. Then, in your CloudWatch agent configuration file (config.json), you can include a section to run neuron-monitor and pipe its output to neuron-monitor-cloudwatch.py. Here's an example:

{ "agent": { "metrics_collection_interval": 60, "run_as_user": "root" }, "metrics": { "metrics_collected": { "neuron": { "measurement": [ "neuroncore_utilization", "neuron_memory_used", "neuron_runtime_vcpu_usage", "execution_errors", "latency_stats" ] } }, "append_dimensions": { "InstanceId": "${aws:InstanceId}" } }, "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/neuron-monitor.log", "log_group_name": "neuron-monitor-logs", "log_stream_name": "{instance_id}" } ] } } } }

  1. After updating your configuration files, you can start neuron-monitor and pipe its output to CloudWatch using the following command:

/opt/aws/neuron/bin/neuron-monitor -c /path/to/monitor.conf | /opt/aws/neuron/bin/neuron-monitor-cloudwatch.py --namespace neuron_monitor --region your-region &

This setup will allow you to monitor various Neuron-specific metrics in CloudWatch, including NeuronCore utilization, memory usage, vCPU usage, execution stats, and hardware counters. You can then create custom dashboards and alarms in CloudWatch to monitor the health and performance of your Neuron instances.
Sources
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_sources/tools/neuron-sys-tools/neuron-monitor-user-guide.rst.txt
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_sources/tools/neuron-sys-tools/neuron-monitor-user-guide.rst
Whisper audio transcription powered by AWS Batch and AWS Inferentia | AWS HPC Blog

profile picture
answered 16 days ago
profile pictureAWS
EXPERT
reviewed 16 days ago
  • I appreciate the reply. I followed the provided suggestions , and updated the config.json file as recommended.

    However, when I start neuron-monitor and pipe its output to CloudWatch as suggested, I get the following error in the second line below:

    "2025/04/30 18:58:54 I! Valid Json input schema. 2025/04/30 18:58:54 W! Ignoring unrecognized input neuron 2025/04/30 18:58:54 D! ec2tagger processor required because append_dimensions is set"

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions