Multiple system check failures over past 18 months - how to automatically stop/start instance?

0

I've experienced 3 System check failures on my LightSail Windows Server instance over the past 18 months. The system check failure lasts for many hours and then according to the FAQ pages (https://aws.amazon.com/premiumsupport/knowledge-center/lightsail-instance-failed-status-check/) it basically means that the underlying hardware has failed and the instance need to be stopped and restarted to migrate to new hardware. I do this manually but it leads to a lot of downtime (especially since LightSail takes upto 1 hour to stop the instance when there's a System check failure).

The resource utilization up to the point of failure is extremely low (see image): https://www.dropbox.com/s/a882b3nbu8roauu/Lightsail.jpg

  1. Does anyone know why my LightSail instances keep having System check failures and what I can do to avoid it?
  2. More importantly, is there anyway to have Amazon automatically Stop and Start the failed instance if the System Check Failure continues for more than X minutes/hours?
  • Instance status checks might also fail due to over-utilization of resources, have you ruled out this aspect?

  • @RoB yes, this instance has an average utilization of under 0.5% CPU right up to the point of failure and similarly low for network/disk activity, it's an IIS server handing a few requests per minute. Plus whenever it has a system check failure, it continues in this state for 3+ hours and the only solution is to stop (which takes up to 45 minutes whenever it fails) and then start; then it's good for the next 6 months or so until the next system check failure.

RBoy
已提問 2 年前檢視次數 244 次
1 個回答
0

Hello RBoy,

I understand that your Lightsail instance keeps failing System checks now and then. After that, you have to manually stop and start your instance to migrate it to a new host. However, you would like to automate the process of stopping and starting your instance in the case where System failure happens.

I suggest you look at your system logs to check what is causing your Lightsail instances to have System check failure. The logs will reveal an error that can help you troubleshoot the issue.

To automatically stop and start your instance you can use a Lambda function and CloudWatch Events to trigger these actions. CloudWatch automatically manages a variety of metrics for standard EC2 instances, however, the metrics collected in Lightsail are by default not visible in the CloudWatch dashboard. With that being said, you will have to do the following to get your Lightsail metrics in CloudWatch:

  1. Create an IAM user with the necessary permissions to submit the CloudWatch metrics data collected from the Lightsail instance.
  2. Installing the CloudWatch Agent on your Lightsail.
  3. Configuring the CloudWatch Agent to use the IAM user when submitting data to CloudWatch

Below is a sample code you can use to schedule the stop of the Instance:

import boto3
region = 'us-west-1'
client = boto3.client('lightsail', region_name='region')

def lambda_handler(event, context):
      client.stop_instance( instanceId='ID-OF-YOUR-LIGHTSAIL-INSTANCE')

A sample code you can use to schedule the start of the Instance:

import boto3
region = 'us-west-1'
client = boto3.client('lightsail', region_name='region')

def lambda_handler(event, context):
      client.start_instance( instanceId='ID-OF-YOUR-LIGHTSAIL-INSTANCE')

For region, replace "us-west-1" with the AWS Region that your instance is in and replace 'ID-OF-YOUR-LIGHTSAIL-INSTANCE' with the ID of the specific instance that you want to stop and start.

I hope that this information will be helpful.

Cebi
已回答 2 年前
  • Thanks @Cebi. How does one pull the system logs for LightSail instances? The link you provided is for EC2 instances.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南