How can I trigger automatic recovery when my Amazon EC2 instance fails a status check?

5 minute read
0

I want to trigger an automatic recovery action when my Amazon Elastic Compute Cloud (Amazon EC2) instance fails a status check.

Short description

Automatic recovery can recover an Amazon EC2 instance when it fails a system status check. An instance failure during a system status check usually means that there is an AWS hardware-related issue. But automatic recovery can't recover an instance that fails an instance status check. For more information on these checks, see Types of status checks.

The following instance types support automatic recovery actions:

  • General purpose: A1, M3, M4, M5, M5a, M5n, M5zn, M6a, M6g, M6i, M6in, M7g, T1, T2, T3, T3a, T4g
  • Compute-optimized: C3, C4, C5, C5a, C5n, C6a, C6g, C6gn, C6i, C6in, C7g, Hpc6a
  • Memory-optimized: R3, R4, R5, R5a, R5b, R5n, R6a, R6g, R6i, R6in, R7g, u-3tb1, u-6tb1, u-9tb1, u-12tb1, u-18tb1, u-24tb1, X1, X1e, X2iezn
  • Accelerated computing: G3, G3s, G5g, Inf1, P2, P3, VT1

Resolution

There are two methods that you can use to recover your instance automatically:

  • Simplified automatic recovery based on instance configuration
  • Amazon CloudWatch action based recovery

Simplified automatic recovery based on instance configuration

When you use this method, note the following prerequisites and limitations:

  • Your instance must use the default or Dedicated Instance tenancy.
  • By default, all instances that support simplified automatic recovery are configured to recover failed instances.
  • Make sure that the recovery doesn't take place during AWS Health Dashboard events or any other AWS events that affect hardware.
  • The following steps show you how to turn automatic recovery behavior to default or turn off automatic recovery. You can do this during instance launch or after you launch your instance.
  • This process works only on Running instances.

Use the following steps to turn off simplified automatic recovery during instance launch:

  1. Open the Amazon EC2 console.
  2. Choose Launch instance.
  3. In the Advanced details section, turn off Instance auto-recovery.
  4. Configure your settings, and then launch the instance.

Use the following steps to turn off simplified automatic recovery for instances in the running or stopped states:

  1. Open the Amazon EC2 console.
  2. In the navigation pane, choose Instances.
  3. Select the instance, and then choose Actions.
  4. Choose Instance settings, and then turn off Change auto-recovery behavior.
  5. Choose Save.

Use the following steps to set the automatic recovery behavior to default for instances in the running or stopped states:

  1. Open the Amazon EC2 console.
  2. In the navigation pane, choose Instances.
  3. Select the instance, and then choose Actions.
  4. Choose Instance settings, and then for Change auto-recovery behavior, choose Default (On).
  5. Choose Save.

Review the results of a simplified automatic recovery on the Health Dashboard event. See the following example notifications:

  • Failed events: AWS_EC2_SIMPLIFIED_AUTO_RECOVERY_FAILURE
  • Successful events: AWS_EC2_SIMPLIFIED_AUTO_RECOVERY_SUCCESS

CloudWatch action based recovery

You can use CloudWatch action based recovery to choose when you want to recover your instance. When an event triggers the StatusCheckFailed_System alarm, the recover action initiates. Then, the Amazon Simple Notification Service (Amazon SNS) topic triggers the notification that you chose when you created the alarm and associated it with the recovery action.

As part of instance recovery, the instance is migrated during an instance reboot, and any data that is in-memory is lost. When the process is complete, information is published to the SNS topic that you configured for the alarm. Subscribers to the SNS topic receive an email notification that includes the status of the recovery attempt and any further instructions. You can then observe an instance reboot on the recovered instance.

There are a number of reasons that a system check might fail. See the following examples:

  • Loss of network connectivity
  • Loss of system power
  • Software issues on the physical host
  • Hardware issues on the physical host that affect network reachability

When you use this method, note the following prerequisites and limitations:

  • CloudWatch action based recovery also supports instance types that have instance store volumes. This includes M3**,** C3**,** R3, X1, X1e, X2idn, X2iedn, as well as the instances supported by simplified automatic recovery.
  • This method doesn't support EC2 instances with dedicated tenancy and metal instances.
  • CloudWatch doesn't allow you to add a recovery action to an alarm for an instance that doesn't support recovery actions.

Use the following steps to configure automatic recovery on your instance:

Step 1: Create an alarm

  1. Open the Amazon EC2 console.

  2. In the navigation pane, choose Instances.

  3. Select the instance that you want to configure.

  4. Choose Actions, and then choose Monitor and troubleshoot.

  5. Choose Manage CloudWatch alarms, and then choose Create an alarm.

    Note: To create an alarm, you must have AWS Identity and Access Management (IAM) permissions to stop and start the specified instance.

  6. For Alarm notification, choose an SNS topic. You can also create a new topic.

    Note: To receive notifications when an event triggers an alarm, you must be subscribed to the SNS topic.

  7. Toggle on Alarm action, and then choose Recover.

  8. For Group samples by and Type of data to sample, choose a statistic and metric for your use case.

  9. For Consecutive period and Period, enter the evaluation period for the alarm.

  10. A default alarm name is created automatically. Optionally, modify the Alarm name.

  11. Choose Create.

Step 2: Set alarm for reboot

  1. Open the CloudWatch console.
  2. In the navigation pane, choose All Alarms.
  3. Select the alarm that you created. Choose Action, and then choose Edit.
  4. In the Additional Configuration section, choose Treat missing data as bad (breaching threshold).
  5. Choose Save.
AWS OFFICIAL
AWS OFFICIALUpdated 9 months ago