AutoScaling Issue: Instances Inaccessible via SSH and Experiencing Failures as Shown in Attached Image

0

Hello AWS Community,

I am experiencing a persistent issue with my AutoScaling setup. The instances become inaccessible via SSH and exhibit failures, as detailed in the logs shown in the attached image. The errors do not result in an 'instance failure' state, but they prevent standard access and operation.

Enter image description here url_helper.py[WARNING]: Calling 'http://169.254.169.254/latest/api/token' failed [112/120s]: request error [('Connection aborted.', BadStatusLine('No status line received - the server has closed the connection',))]

To address this problem, I have attempted the following troubleshooting steps:

Reconfigured the AMI and launch template to ensure they are set up correctly. Experimented with both IMDSv1 and IMDSv2 to rule out metadata service issues. Changed the availability zones from C to A, and also tried other variations like C to B, to check if the issue is zone-specific. Despite these efforts, the problem persists. The security groups and network ACLs are configured correctly for SSH access. The AMI works flawlessly in a non-AutoScaling environment, and IAM roles and policies are appropriately assigned.

I am seeking insights into the potential causes of this issue and any further troubleshooting steps or solutions that might be recommended. Any advice on resolving the inability to access instances via SSH in this scenario would be immensely valuable.

Thank you for your help!

2 Answers
1

Hello.

Is it possible to connect via serial console to the EC2 that is experiencing the issue?
If this is possible, you may be able to connect with a serial console and check the network settings of the OS.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-to-serial-console.html

Also, is it possible to access metadata on EC2, which is the basis of the AMI?

profile picture
EXPERT
answered 5 months ago
profile pictureAWS
EXPERT
reviewed 5 months ago
0

I've commonly seen that error once or twice at the start of userdata logs while the OS is booting up, but it usually resolves once the network stack is fully running. IMDS should be accessible regardless of security group/NACL/Route Table settings, so those shouldn't be an issue.

  • Is the AMI + UserData used inside and outside the ASG identical (or for simplicity, is the same launch template used to launch the non-ASG test instance)?
  • Are there any errors seen outside the instance?
  • Is there anything else running on startup which might be affecting the local network stack of the OS? IPTables being configured or something similar?
AWS
answered 5 months ago
  • Hello Shahad, Thank you for your response.

    • Is the AMI + UserData used inside and outside the ASG identical (or for simplicity, is the same launch template used to launch the non-ASG test instance)?

    The template used to start the non-ASG test instance is identical to the ASG template.

    • Are there any errors seen outside the instance?

    If you are referring to the system log when you say the error is displayed outside the instance, then yes, that is correct.

    • Is there anything else running on startup which might be affecting the local network stack of the OS? IPTables being configured or something similar?

    I have not performed any separate work on the local network.

  • If you have a Premium Support plan, it will probably be simpler to troubleshoot if someone can see the actual instances being launched.

    There shouldn't be anything special about instances launched inside an ASG. AutoScaling is just calling RunInstances or CreateFleet to EC2, the same as you would from the EC2 console.

    From that very last line in your screenshot, can you check on one of the non-working ASG instances if IMDS has been disabled? https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceMetadataOptionsRequest.html If its been disabled, you should be able to see if from the results of this command: https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-instances.html - maybe its getting disabled post launch by some sort of automation?

  • After accessing the problematic Amazon Linux 2 instance and checking the logs, I found an error stating 'NET: dhclient: Locked /run/dhclient/resolv.lock'. Are you aware of what this might be

  • That looks like a message stating the DHCP config file is being locked so something can edit it. It shouldn't necessarily be a problem. Is there a lifecycle hook, or anything else on the ASG which might be triggering additional deployments to these instance not happening on the standalone instance? Maybe there's something like CodeDeploy trying to deploy additional software/updates at the same time as the userdata is running, which is causing conflicts/race conditions?

  • According to what Shahad_C mentioned, our Auto Scaling has lifecycle hooks set with a start time of 600s and a deletion time of 30s. We do not use CI/CD methods for deployment (e.g., CodePipeline, Jenkins, etc.). However, we also do not experience conflicts during deployment.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions