- Newest
- Most votes
- Most comments
When an EC2 instance fails the "instance reachability" check, it typically indicates an issue with the instance's underlying hardware, hypervisor, or the network. This problem can cause the application running on the instance to stop working, as you mentioned.
Steps to Diagnose and Fix the Issue
Check EC2 Status Checks:
**Instance Reachability Check: ** This test verifies that the EC2 instance is reachable from AWS. A failure here often means there's an issue with the instance itself, like a kernel issue or hardware problem.
System Status Check: This test checks the underlying hardware and hypervisor hosting your instance.
You can view these checks in the AWS Management Console under the EC2 Dashboard -> Instances -> Status Checks tab. Look at the status checks to see which one failed.
Review System Logs:
Instance Console Logs:
In the AWS Management Console, go to your EC2 instance, then navigate to the Actions dropdown and select Instance Settings -> Get System Log. Review the system log for any kernel panic, out-of-memory errors, or other system-level issues that might explain why the instance became unreachable.
CloudWatch Logs:
If you have configured CloudWatch Logs for your EC2 instance, review them for any application-specific errors or unusual system activity leading up to the failure.
/var/log/messages or /var/log/syslog:
SSH into the instance (if it’s accessible) and check the /var/log/messages or /var/log/syslog files for any logs that could indicate issues, such as hardware errors, disk space issues, or network interface problems.
Check Application Logs:
If your NodeJS application generates logs, check those logs (usually in /var/log/ or a custom log directory) for any errors that could explain why the application stopped working.
Check Disk Space:
Sometimes, a full disk can cause the instance to fail to respond properly.
Run df -h to check disk usage. If the root volume is full, clean up unnecessary files.
Analyze Memory and CPU Usage:
Use commands like top, htop, or free -m to check for high CPU or memory usage that might have led to the instance becoming unresponsive. Look for memory leaks or processes consuming excessive resources.
Review Network Configuration:
Check the security group rules and network ACLs associated with the instance to ensure that the instance isn't inadvertently blocked by restrictive rules. Use ifconfig or ip a to ensure network interfaces are configured correctly.
Recovery Steps
Reboot the Instance:
If the instance remains unreachable, try rebooting it. This might clear transient hardware issues or reinitialize the network interfaces.
AWS Management Console -> EC2 Dashboard -> Select the instance -> Actions -> Instance State -> Reboot Instance.
Stop and Start the Instance:
If a reboot doesn’t fix the issue, stopping and then starting the instance (not just rebooting) might help, as this will move the instance to a different physical host.
Be aware that the public IP address will change if it’s not an Elastic IP.
Create an AMI and Launch a New Instance:
If the issue persists, consider creating an AMI (Amazon Machine Image) from the instance, then launching a new instance from that AMI. This helps to determine if the issue is with the instance itself or something in the environment.
Review AWS Health Dashboard:
Check the AWS Health Dashboard for any ongoing issues with the EC2 service in your region.
Restore from Backup or Snapshot:
If the instance is completely unresponsive, and you can’t resolve the issue, restoring from a previous snapshot or backup might be necessary. Preventive Measures
Enable CloudWatch Alarms:
Set up CloudWatch alarms to monitor CPU usage, memory, disk space, and other vital metrics, so you can take action before the instance becomes unresponsive.
Regular Instance Maintenance:
Regularly review and update the instance's software, including the OS and applications, to prevent issues caused by outdated or buggy software.
Auto-Recovery Feature:
Consider enabling the EC2 Auto-Recovery feature, which automatically recovers an instance when a system status check fails.
Use Elastic Load Balancing (ELB):
For critical applications, use an ELB to distribute traffic across multiple instances, so that if one instance fails, traffic is redirected to the others.
Hi,
The official documentation details here what you have to validate to eliminate this status check: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstances.html
Best,
Didier
Relevant content
- AWS OFFICIALUpdated a year ago
