Why is ssm-agent-worker using 100% of CPU?

0

I have a t2.small instance (1CPU, 2GB RAM) that has been running smoothly for 18 months (averaging 20% CPU usage; see graph below) but became unresponsive today. After some investigation I found that ssm-agent-worker was running at 100%. I've switched to a t2.medium (2CPU, 4GB RAM) so that if that happens again I'll have another CPU that can handle my workload, but I'd prefer not to double my costs just to handle an AWS bug (if that is what it is). Any advice? CPU Usage for 7 days

asked a year ago1443 views
2 Answers
1

Hi There

Did you perform any Systems Manager functions during this time like deploying a package, running a script/command etc? I would start by checking the logs and seeing what was happening during that time. See https://docs.aws.amazon.com/systems-manager/latest/userguide/troubleshooting-ssm-agent.html#systems-manager-ssm-agent-log-files

profile pictureAWS
EXPERT
Matt-B
answered a year ago
  • I did not log in to the server at all for well over a week. The problem began this morning at around midnight. There is no error log and amazon-ssm-agent.log has no entries for the six weeks before today when I rebooted the machine and got some startup messages, including Entering SSM Agent hibernate - EC2RoleRequestError: no EC2 instance role found caused by: EC2MetadataError: failed to make EC2Metadata request.

0

I have seen numerous cases where the SSM agent freaks out and hits 100% CPU - often disrupting the actual service that is running on the machine and causing health monitors to kill the machine, so it is very difficult to catch a running issue.

When this happens, it usually takes out all machines in an auto scaling group at once, so you basically have complete loss of service.

I just had this happened in front of my eyes and its a huge problem - I had customers complaining about this. I have managed to get one system hit that way to not die and after investigating it for a bit, I can see that the SSM agent, as well as the Ubuntu snap daemon is thrashing.

I can also see that the machine had hit its EBS max throughput and that is either the cause of the issue or the result of the Amazon SSM agent issue.

guss77
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions