By using AWS re:Post, you agree to the Terms of Use

Instance Status Failed

0

We had an instance failure on 25/03/2019 at 01:45, however, I cannot see anything listed in the "events" on the AWS support dashboards which would advise of an issue.

The instance stopped responding and looking at the CloudWatch metrics I can see that the StatusCheckFailed_System metric went from 0 to 1 between 01:44 and 01:45 on 25/03/2019.

No health issues have been reported, nor have we received communication from AWS that the instance is running on any degraded hardware (like we have received in the past when an instance failed during the night).

Can AWS advise of any issues experienced between 01:40 and 02:10 on 25/03/2019 that would affect the below instance? Metrics for the EC2 instance and also the EBS volume between these times are blank in CloudWatch which would indicate to me that there was an issue that caused the outage.

instance-id: i-017d39167c95d214c

asked 3 years ago7 views
1 Answer
0
Accepted Answer

Hello matthalion,

I am sorry to hear about the issue with your instance i-017d39167c95d214c.

I have checked the instance and I could see that the underlying physical host, on top of which your instance was hosted, had been experiencing hardware related issues during the above mentioned times. This caused your instance to become unresponsive and to fail its status checks.

Please note that in the future you can check whether an instance was affected by a hardware related event by checking its 'System Status Checks' [1]. The history of these checks can also be viewed in Amazon CloudWatch by looking at StatusCheckFailed_System metric \[2,3].

Please accept our apologies for the above issue and for any inconvenience caused by it.

Please note that your instance is still being hosted on the same physical host. Although the host is healthy right now, you may consider stopping and then starting your instance. As you may be aware already, the stop / start action has the function to move an instance to another healthy physical host [4] (note: simple 'Reboot' action does not work this way) that was not affected by the above mentioned hardware issues.

I would like to suggest that you to take a look at the Auto Recovery feature for Amazon EC2. You can create an Amazon CloudWatch alarm that monitors an Amazon EC2 instance and automatically recovers the instance if it becomes impaired due to an underlying hardware failure or a problem that requires AWS involvement to repair. Basically, you can use CloudWatch to set up the alarm which will trigger when the System Status check fails. This alarm can further trigger an EC2 Action like "Recover this instance" \[5,6].

Please let us know if you need any further help.

Links:
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html#types-of-instance-status-checks
[2] https://aws.amazon.com/blogs/aws/ec2-instance-status-metrics/
[3] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ec2-metricscollected.html#ec2-metrics
[4] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html#instance_stop
[5] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html#AddingRecoverActions
[6] https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/

Regards,
awstomas

answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions