- Newest
- Most votes
- Most comments
To monitor Neuron-specific metrics for Trn1 or Trn2 instances using Amazon CloudWatch, you can utilize the neuron-monitor tool, which is part of the aws-neuron-tools package. This tool collects metrics and stats from Neuron Applications running on your system and can be configured to stream this data to CloudWatch.
Here's an example of how you can set up your CloudWatch agent configuration file (config.json) to include Neuron-specific metrics:
- First, create a neuron-monitor configuration file (e.g., monitor.conf) with the following content:
{ "period": "1s", "neuron_runtimes": [ { "tag_filter": ".*", "metrics": [ { "type": "neuroncore_counters" }, { "type": "memory_used" }, { "type": "neuron_runtime_vcpu_usage" }, { "type": "execution_stats" } ] } ], "system_metrics": [ { "type": "vcpu_usage" }, { "type": "memory_info" }, { "period": "2s", "type": "neuron_hw_counters" } ] }
- Then, in your CloudWatch agent configuration file (config.json), you can include a section to run neuron-monitor and pipe its output to neuron-monitor-cloudwatch.py. Here's an example:
{ "agent": { "metrics_collection_interval": 60, "run_as_user": "root" }, "metrics": { "metrics_collected": { "neuron": { "measurement": [ "neuroncore_utilization", "neuron_memory_used", "neuron_runtime_vcpu_usage", "execution_errors", "latency_stats" ] } }, "append_dimensions": { "InstanceId": "${aws:InstanceId}" } }, "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/neuron-monitor.log", "log_group_name": "neuron-monitor-logs", "log_stream_name": "{instance_id}" } ] } } } }
- After updating your configuration files, you can start neuron-monitor and pipe its output to CloudWatch using the following command:
/opt/aws/neuron/bin/neuron-monitor -c /path/to/monitor.conf | /opt/aws/neuron/bin/neuron-monitor-cloudwatch.py --namespace neuron_monitor --region your-region &
This setup will allow you to monitor various Neuron-specific metrics in CloudWatch, including NeuronCore utilization, memory usage, vCPU usage, execution stats, and hardware counters. You can then create custom dashboards and alarms in CloudWatch to monitor the health and performance of your Neuron instances.
Sources
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_sources/tools/neuron-sys-tools/neuron-monitor-user-guide.rst.txt
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_sources/tools/neuron-sys-tools/neuron-monitor-user-guide.rst
Whisper audio transcription powered by AWS Batch and AWS Inferentia | AWS HPC Blog
I appreciate the reply. I followed the provided suggestions , and updated the config.json file as recommended.
However, when I start neuron-monitor and pipe its output to CloudWatch as suggested, I get the following error in the second line below:
"2025/04/30 18:58:54 I! Valid Json input schema. 2025/04/30 18:58:54 W! Ignoring unrecognized input neuron 2025/04/30 18:58:54 D! ec2tagger processor required because append_dimensions is set"