Skip to content

experiencing an issue with multiple EC2 instances. Several instances are showing as "impaired" in the AWS EC2. We use code to stop and start instances. at stop time it happens generally

0

Can anyone tell us we didn't find an error in 2/2 or 3/3 status checks neither the region effected. suppose we are stopping 100 server. effected was only 25. was this issue from aws hardware or from user side?

asked a year ago124 views
2 Answers
0

The issue you're experiencing with multiple EC2 instances showing as "impaired" during the stop process is concerning and could be caused by various factors. Since the problem is affecting only a subset of your instances (25 out of 100) and not all instances are failing status checks, it's less likely to be a widespread AWS hardware or regional issue.

Given that the impairment occurs during the stop process initiated by your code, there are a few possibilities to consider:

  1. Code-related issues: Your code that stops the instances might have some inconsistencies or race conditions that affect some instances but not others. Review your code to ensure it's handling the stop process correctly for all instances.

  2. Instance-specific problems: Some instances might have underlying issues that only become apparent during the shutdown process. This could be related to the operating system, installed software, or instance configuration.

  3. Resource constraints: If you're stopping many instances simultaneously, there might be API throttling or resource constraints that cause some operations to fail or timeout.

  4. Transient network issues: Temporary network glitches could affect the communication between your code and the AWS API for some instances.

To troubleshoot this issue:

  1. Implement detailed logging in your code to capture any errors or unexpected behaviors during the stop process.

  2. Check the system logs of the affected instances for any error messages or warnings that occur just before or during the stop process.

  3. Consider implementing a more gradual or staggered approach to stopping instances to reduce the likelihood of hitting API limits or resource constraints.

  4. Monitor CloudWatch metrics for the affected instances to see if there are any patterns or anomalies in resource utilization or network activity leading up to the impairment.

  5. If the issue persists, you may want to engage AWS Support for a more in-depth investigation, providing them with detailed logs, timestamps, and instance IDs of the affected instances.

While it's not impossible for this to be an AWS hardware issue, the selective nature of the problem (affecting only some instances) suggests it's more likely related to your specific setup or code. Continue monitoring the situation and gathering data to identify any patterns that might help pinpoint the root cause.
Sources
EC2 failing health check ( Instance reachability check failed ) | AWS re:Post
Connectivity Issue to EC2 Instances | AWS re:Post
Troubleshoot unhealthy instances in Amazon EC2 Auto Scaling - Amazon EC2 Auto Scaling

answered a year ago
0

When multiple EC2 instances are showing as "impaired" during the stop process—especially if it's affecting only a portion of them (like 25 out of 100)—it points more towards an issue related to how the stop command is being executed rather than a broad AWS infrastructure failure.

Since you’ve confirmed that all status checks (2/2 and 3/3) are passing and there's no regional issue, we can narrow the possible causes to the following:

  1. Code or automation issue: If you're using scripts or automation tools to stop the instances, there could be logic flaws, timing issues, or race conditions affecting only some instances. Even a minor inconsistency in how the stop command is handled can lead to unpredictable results.
  2. Instance-specific configurations: Some instances may have unique OS-level settings, running processes, or software that doesn’t respond well to shutdown signals, causing the instance to go into an "impaired" state temporarily.
  3. API rate limits or throttling: Trying to stop a large number of instances at once can sometimes lead to throttling by the EC2 API. This could result in failed or delayed stop actions for some instances.
  4. Short-lived networking issues: Occasionally, temporary networking glitches between your automation system and AWS services can affect API calls for a small number of instances.

What You Can Do to Troubleshoot:

  1. Add detailed logging to your stop script or tool so you can track exactly which API calls succeed or fail and what responses are returned.
  2. Review the instance logs (via CloudWatch Logs or EC2 system logs) to check if there are any shutdown-related errors on the OS level.
  3. Try a staggered stop approach — instead of stopping all 100 instances at once, break them into smaller groups to reduce the load and potential rate-limiting issues.
  4. Monitor CloudWatch metrics (like CPU, memory, and network) for any trends or anomalies that show up just before stopping the instances.
  5. If the issue continues, contact AWS Support with a detailed report (instance IDs, timestamps, logs) to get help investigating at the infrastructure level.
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.