What's the best way to monitor NVIDIA GPU utilization on Linux (Ubuntu) during model training?


Which tool is recommended for monitoring NVIDIA GPU utilization on a Linux (Ubuntu) Amazon EC2 instance? I'm currently training custom TensorFlow ML models and using the NVIDIA System Management Interface (nvidia-smi) to track memory usage, GPU utilization, and the temperature of my NVIDIA GPU devices.

asked 2 years ago103 views
1 Answer
Accepted Answer

You can also use the Amazon SageMaker Debugger Profiling Report to capture system metrics.

The report provides information on the following:

  • System usage statistics
  • Framework metrics
  • Rule evaluation results
  • Step durations
  • GPU utilization
  • Batch size
  • CPU bottlenecks
  • I/O bottlenecks
  • Workload balancing
  • GPU memory
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions