How do I troubleshoot my SSM Agent error log for crash reasons?

5 minute read
0

I want to troubleshoot my AWS Systems Manager Agent (SSM Agent) error log for crash reasons.

Short description

Systems Manager commands might fail on target instances during Run Command, Association, Automation, or Sessions Manager, a capability of Systems Manager. These failures cause an error similar to the following one:

"document process failed unexpectedly: document worker timed out; check [ssm-document-worker]/[ssm-session-worker] log for crash reason."

This error occurs because of the following reasons:

  • Insufficient resources
  • Out of memory (OOM)
  • Not enough space on the disk
  • Too many open files

For more troubleshooting, review the SSM Agent logs on the instance.

Note: For better performance, security, and access to the latest features, update SSM Agent to the latest version. Also, subscribe to amazon-ssm-agent/RELEASE NOTES on the GitHub website to receive notifications about SSM Agent updates.

Resolution

Insufficient resources

If a session with a target instance experiences resource limitations, such as exceeded memory or disk space, then the session might crash and cause system malfunctions. Make sure that your instances have enough resources to manage the workload that's running on the target instances.

OOM

OOM errors occur when running processes are using all the available memory, and a program or operating system (OS) can't allocate space. The affected system can't load additional programs, and the associated processes stop functioning properly. The OS then turns off processes that it determines to be low priority.

To check if insufficient memory is causing this error, see Troubleshoot an unreachable instance. For Linux OOM issues, see Out of memory: end process.

Not enough space on the disk

This error occurs on a Linux system when you're trying to write data or save files, but lack sufficient space. To resolve this error, take the following actions:

Too many open files

When Linux instances stop processing Systems Manager Run Command because of document worker crashes, you might receive a too many open files error. When too many files are open, SSM Agent can't start command processors and reports this error in the agent logs. This error occurs in the following scenarios:

  • The SSM agent process reached the maximum number of open files for the root user.
  • The total number of open files across the system reached the system-wide limit of maximum open files.
  • The system reached the limit for notifying the subsystem in the kernel.

To troubleshoot this error, complete the following steps:

1.     Identify the PID of the SSM Agent process:

$ sudo ps -C amazon-ssm-agent -o pid=

2.    Identify the limits of the PID. The first number is the soft limit, and the second number is the hard limit.

$ sudo cat /proc/_**PID**_/limits |grep "Max open files"

3.    Identify the total number of open files from the Systems Manager process:

$ sudo lsof -p _**PID**_ |wc -l

4.    Compare the results between steps 2 and 3. If the total number of open files is close to the hard limit, then this might be preventing new files from opening. To resolve this issue, take one of the following actions:

  • Restart SSM Agent.
  • Set a higher value for the hard limit in the SSM Agent startup files:

Note: Replace all example strings with your values.

Upstart: Amazon Linux 1, Ubuntu 14.04, and Ubuntu 16.04 with .deb package

echo "limit nofile example-hard-limit" >> /etc/init/amazon-ssm-agent.override

Systemd: Amazon Linux 2, RHEL 7.x, and RHEL 8.x 

$ sudo systemctl edit amazon-ssm-agent  [Service]  LimitNOFILE=example-hard-limit

Systemd: Ubuntu 22.04 LTS, 20.10 STR & 20.04, 18.04 (using snap)

$ sudo systemctl edit snap.amazon-ssm-agent.amazon-ssm-agent  [Service]  LimitNOFILE=example-hard-limit

5.    Restart SSM Agent service.

Note: When updating the hard limit, review your application's requirements, including the number of concurrent users, network connections, and file operations. To prevent abuse and resource exhaustion, default hard limits are set low. Make sure to stress test the new hard limit. Monitor the hard limit and adjust it, if needed.

Check the worker logs

To review worker logs, complete the following steps:

1.    View the SSM Agent logs. SSM Agent maintains information in the following files:

Note: To understand the available log files and their purpose, it's a best practice to refer to the official documentation for the OS that you use.

  • Linux - /var/log/amazon/ssm/amazon-ssm-agent.log
  • Linux - /var/log/amazon/ssm/errors.log
  • Windows - %PROGRAMDATA%\Amazon\SSM\Logs\amazon-ssm-agent.log
  • Windows - %PROGRAMDATA%\Amazon\SSM\Logs\errors.log

2.    Check the OS-level logs for software or kernel issues:

  • Windows - C:\Windows\System32\winevt\Logs
  • Ubuntu/ Debian - /var/log/syslog
  • Amazon Linux/ CentOS/ RHEL - /var/log/messages
  • Suse - /var/log/messages

3.    Update the seelog.xml file to allow SSM Agent debug logging.

Note: SSM Agent debug logging generates large amounts of log data that might affect system storage. After you finish troubleshooting, it's a best practice to turn off debug logging.

Related information

Troubleshooting SSM Agent

AWS OFFICIAL
AWS OFFICIALUpdated a year ago