My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance becomes unresponsive due to over-utilization of resources. How can I prevent this?
There are several common causes for why an instance becomes unresponsive:
Memory: EC2 instances don't have allocated swap space by default. Running out of memory can invoke the Linux Out Of Memory (OOM) manager. The OOM manager terminates processes, such as a database, web server, or the SSH service.
Networking: Without networking, your system can't answer ARP requests from status checks. When this occurs, your instance fails to communicate with other hosts.
Amazon Elastic Block Store (Amazon EBS): With no disk I/O, read or write instructions become stuck. For example, creation of temporary files, reads from system libraries, or databases.
CPU: All the preceding tasks require CPU time to work. 100% CPU usage for a prolonged time prevents the kernel from performing normal operating system operations.
These issues might also accumulate into a snowball effect. For example, you run out of memory and the OOM manager terminates an important process. Now, a second process that relies on the first process that was stopped starts a much higher number of CPU cycles. If this task is disk related, then this cycle can also exhaust the EBS volume. Also, the issue might be transferred to a different instance that is expecting communication from the unresponsive instance.
If your system often becomes unresponsive due to over-utilization of resources, do the following:
- Use a monitoring tool such as Amazon CloudWatch to observe trends and patterns of high resource utilization.
- If you have multiple services and aren't sure which one is over-utilizing resources, then install a utility such as atop.
- Review your application and operating system logs. These logs are usually located in /var/log/.
- Review the history of commands to see if there was human error. The command history is usually located in the ~/.bash_history file.
- Review cronjobs by running the crontab -l command.
Act based on the acquired data
Prevent future over-utilization
- Before deploying a new application in production, create a test environment and benchmark to determine the necessary compute, memory, EBS, and network.
- Deploy according to your benchmarks, while building for fault tolerance. For more information, see the following:
Design interactions in a distributed system to prevent failures
Tutorial: Set up a scaled and load-balanced application
- Continue monitoring your instances, and create alarms for certain resource usage thresholds.