EC2 instance suddenly unreachable via SSH

0

I was working on one of many instances (i-0023be12dc6bc88dd) in us-east-1a yesterday when the SSH session stopped responding. Attempting to reconnect timed out. This has happened occasionally before on other instances after a large spike/load of network traffic and usually recovers with an instance restart. This did not work in this case, and all others are unable to reach it as well.

tried so far:

  1. Instance restart
  2. Instance stop-start
  3. remove-readd security groups
  4. reset my local VPN connection, we have a (VPN/route table to reach VPC instances)
  5. checked the flow logs of the ENI, does not show traffic from my internal VPN IP during new attempts
  6. iptables -F && systemctl restart sshd

What works:

  1. If I SSH into another instance in the VPC (same or different subnet), I can then SSH into the problem instance immediately, everything is running and it behaves normally.

Info:

~$ ssh -v -i mykey.pem ubuntu@172.31.128.87
OpenSSH_7.2p2 Ubuntu-4ubuntu2.10, OpenSSL 1.0.2g  1 Mar 2016
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: Connecting to 172.31.128.87 [172.31.128.87] port 22.
debug1: connect to address 172.31.128.87 port 22: Connection timed out
ssh: connect to host 172.31.128.87 port 22: Connection timed out

From the instance when connected through another instance:

ubuntu@ip-172-31-128-87:~$ sudo systemctl restart sshd
ubuntu@ip-172-31-128-87:~$ sudo ss -tpln | grep -E '22|ssh'
LISTEN   0         128                 0.0.0.0:22               0.0.0.0:*        users:(("sshd",pid=4467,fd=3))         
LISTEN   0         128                    [::]:22                  [::]:*        users:(("sshd",pid=4467,fd=4))

I'm at a loss for what's next.

asked 2 years ago1914 views
2 Answers
0
Accepted Answer

So I figured out what was happening.... This was a dev-test box where we were testing various config stacks with docker-compose and a named networks: section. Each time a docker-compose up -d was executed, compose was recreating the network and incrementing the CIDR block starting from the default 172.17.0.0/16. Once it got to 172.27.0.0/16 after 10 restarts, it created a bridge interface that sat on top of the CIDR for our VPN routes.

ubuntu@ip-172-31-128-87:~$ netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         172.31.128.1    0.0.0.0         UG        0 0          0 ens5
172.17.0.0      0.0.0.0         255.255.0.0     U         0 0          0 docker0
172.19.0.0      0.0.0.0         255.255.0.0     U         0 0          0 br-61dfa3cb04db
172.27.0.0      0.0.0.0         255.255.0.0     U         0 0          0 br-889068f61237
172.31.128.0    0.0.0.0         255.255.254.0   U         0 0          0 ens5

This is why it was only accesible via another instance in the 172.31.0.0/16. Even after and instance reboot, docker-compose held the network bridge config even though the containers had crashed. Another docker-compose down cleaned them up then we were able to pin the network creation to a non-conflicting CIDR in docker-compose.yml

networks:
  mynet:
    ipam:
      driver: default
      config:
        - subnet: 172.23.0.0/16
answered 2 years ago
0

From what you are explaining here, the right path should be to open a support ticket as the support engineers have the right tools to analyze the situation and help you to diagnose what's happening.

Best,

profile pictureAWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions