ec2 server instances unresponsive

Question

On two successive days I found that the same two server instances (both of them t4g.small) became unresponsive overnight, and needed to be Stopped and Started using the Amazon Console. I believe this Stop/Start of an instance leads to the instance being instantiated on a different underlying server hardware. The server logs are not accessible when the instances become unresponsive - so I do not have any idea why the instances are hanging; but I fear there may be an underlying hardware issue.

How can I handle this problem? What could be the reasons for the same two instances hanging repeatedly? The message on the Instances view of the Management Console was "1/2 Checks passed" in red. Is there a way that a server instance can be automatically Stopped and then Started if a Health check remains in a failed state for a certain length of time?

Answer

Thanks for your response. I have now created CloudWatch Alarms for each of the failing instances and selected Reboot as the associated Action (because there was no option to Stop and then Start an instance). But I am not sure that Reboot will work when the instance freezes. I have also had to associate Elastic IP addresses with these two instances because the Public IP address of each instance changes whenever I Stop and then Start an instance. Is it normal for the Public IP address of an instance to be changed like this after Stop/ Start, or is this peculiar to specific Availabilty Zones/ Regions?

Answer

Unfortunately, from time to time, an EC2 instance will fail - and sometimes more than one at once, if there is an incident that impacts an entire rack of servers or more.

There are multiple strategies for performing automated recovery of failed EC2 instances that are [discussed in our documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html). Simplified automatic recovery works in many - but not all - circumstances. We recommend configuring CloudWatch Alarms to detect and recover when system status checks fail.

If you have not terminated the failed instance - or if you have terminated it, but opted to preserve the EBS volumes associated with it when you created it - you may be able to locate the original EBS volume and attach it to an instance to examine the logs for troubleshooting purposes.

ec2 server instances unresponsive

Contenuto pertinente