Multiple system check failures over past 18 months - how to automatically stop/start instance?

0

I've experienced 3 System check failures on my LightSail Windows Server instance over the past 18 months. The system check failure lasts for many hours and then according to the FAQ pages (https://aws.amazon.com/premiumsupport/knowledge-center/lightsail-instance-failed-status-check/) it basically means that the underlying hardware has failed and the instance need to be stopped and restarted to migrate to new hardware. I do this manually but it leads to a lot of downtime (especially since LightSail takes upto 1 hour to stop the instance when there's a System check failure).

The resource utilization up to the point of failure is extremely low (see image): https://www.dropbox.com/s/a882b3nbu8roauu/Lightsail.jpg

  1. Does anyone know why my LightSail instances keep having System check failures and what I can do to avoid it?
  2. More importantly, is there anyway to have Amazon automatically Stop and Start the failed instance if the System Check Failure continues for more than X minutes/hours?
  • Instance status checks might also fail due to over-utilization of resources, have you ruled out this aspect?

  • @RoB yes, this instance has an average utilization of under 0.5% CPU right up to the point of failure and similarly low for network/disk activity, it's an IIS server handing a few requests per minute. Plus whenever it has a system check failure, it continues in this state for 3+ hours and the only solution is to stop (which takes up to 45 minutes whenever it fails) and then start; then it's good for the next 6 months or so until the next system check failure.

RBoy
질문됨 2년 전244회 조회
1개 답변
0

Hello RBoy,

I understand that your Lightsail instance keeps failing System checks now and then. After that, you have to manually stop and start your instance to migrate it to a new host. However, you would like to automate the process of stopping and starting your instance in the case where System failure happens.

I suggest you look at your system logs to check what is causing your Lightsail instances to have System check failure. The logs will reveal an error that can help you troubleshoot the issue.

To automatically stop and start your instance you can use a Lambda function and CloudWatch Events to trigger these actions. CloudWatch automatically manages a variety of metrics for standard EC2 instances, however, the metrics collected in Lightsail are by default not visible in the CloudWatch dashboard. With that being said, you will have to do the following to get your Lightsail metrics in CloudWatch:

  1. Create an IAM user with the necessary permissions to submit the CloudWatch metrics data collected from the Lightsail instance.
  2. Installing the CloudWatch Agent on your Lightsail.
  3. Configuring the CloudWatch Agent to use the IAM user when submitting data to CloudWatch

Below is a sample code you can use to schedule the stop of the Instance:

import boto3
region = 'us-west-1'
client = boto3.client('lightsail', region_name='region')

def lambda_handler(event, context):
      client.stop_instance( instanceId='ID-OF-YOUR-LIGHTSAIL-INSTANCE')

A sample code you can use to schedule the start of the Instance:

import boto3
region = 'us-west-1'
client = boto3.client('lightsail', region_name='region')

def lambda_handler(event, context):
      client.start_instance( instanceId='ID-OF-YOUR-LIGHTSAIL-INSTANCE')

For region, replace "us-west-1" with the AWS Region that your instance is in and replace 'ID-OF-YOUR-LIGHTSAIL-INSTANCE' with the ID of the specific instance that you want to stop and start.

I hope that this information will be helpful.

Cebi
답변함 2년 전
  • Thanks @Cebi. How does one pull the system logs for LightSail instances? The link you provided is for EC2 instances.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠