Why does CloudWatch show that my Amazon SageMaker endpoint's CPU or GPU utilization is greater than 100%?
2 minute read
The Amazon CloudWatch CPU or GPU utilization metric for my Amazon SageMaker endpoint is greater than 100%.
The CloudWatch CPUUtilization and GPUUtilization metrics show the percentage of CPU or GPU units that the containers are using. The value is multiplied by the number of CPUs or GPUs, which is why the value can be greater than 100%.
Here are some examples:
For a non-GPU instance such as ml.m4.xlarge, CPUUtilization can range from 0 to 400% because the instance has four vCPUs.
For a GPU instance such as ml.p3.8xlarge, CPUUtilization can range between 0 to 3200%. GPUUtilization can range between 0 to 400%. This is because the instance has 32 vCPUs and 4 GPUs.
For multiple instances, the default view in CloudWatch shows the average CPU or GPU utilization across all instances. For example, if you have five ml.m4.xlarge instances, CPUUtilization can range from 0 to 400% because each instance has four vCPUs.