My instance randomly shut down (i-0c9d6ec62a98addd4)

0

Hello,

At around 8:31am Pacific this morning, a number of services started to fail.

I narrowed the problem down to my m4.xlarge instance (i-0c9d6ec62a98addd4). It apparently shutdown suddenly. Event log message upon reboot was:

"The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly."

I'm trying to get a better idea of what happened. Any indicators on your side? Is it a power problem or something else entirely? If it was a planned outage, is there some way I can be notified ahead of time so I could at least do a clean shutdown?

thx

질문됨 5년 전905회 조회
2개 답변
0

Hello,

I am sorry to hear about the issue with your instance i-0c9d6ec62a98addd4.

I have checked the instance and I could see that the underlying physical host, on top of which your instance was hosted, started experiencing hardware related issues at 2019-01-25T16:33:00.000Z. This caused your instance to reboot.

Please note that in the future you can check whether an instance was affected by a hardware related event by checking its 'System Status Checks' [1]. The history of these checks can also be viewed in Amazon CloudWatch by looking at StatusCheckFailed_System metric \[2,3].

Please accept our apologies for the above issue and for any inconvenience caused by it.

I have now checked the instance and I can see that it is back up and running again.

I would like to suggest that you to take a look at the Auto Recovery feature for Amazon EC2. You can create an Amazon CloudWatch alarm that monitors an Amazon EC2 instance and automatically recovers the instance if it becomes impaired due to an underlying hardware failure or a problem that requires AWS involvement to repair. Basically, you can use CloudWatch to set up the alarm which will trigger when the System Status check fails. This alarm can further trigger an EC2 Action like "Recover this instance" \[4,5].

We also advise to our customers to design their application in such a way such that there is no single point of failure in their environment. Please refer to our white paper on Building Fault-Tolerant Applications in the AWS Cloud \[6] for more information.

Please let us know if you need any further help.

Links:
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html#types-of-instance-status-checks
[2] https://aws.amazon.com/blogs/aws/ec2-instance-status-metrics/
[3] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ec2-metricscollected.html#ec2-metrics
[4] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html#AddingRecoverActions
[5] https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/
[6] https://aws.amazon.com/whitepapers/designing-fault-tolerant-applications/

Regards,
awstomas

AWS
답변함 5년 전
0

Aye, thanks for the help. I set up some alerts per your recommendation.

답변함 5년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠