Skip to content

SSM Connections failing

0

Hello all,

I have EC2 instances that I was able to connect to via the SSM Agent through the AWS Console interface. One day in the last couple weeks, they stopped working.

This is in GovCloud, and in that account, I have two VPC's In one VPC, I can connect to the instances, and in the other, I can't. I have created identical instances in both VPC's to minimize the variances. Same result, can connect to one, and not the other.

Looking through the console UI's and comparing the settings for the VPC, Route Tables, NACL's, Subnets and Security Groups, and nothing appears to be blocking the traffic from the EC2 instance outward thought the NAT Gateway. All of these test instances are using the same instance profile.

I've also checked everything on this Post: https://repost.aws/questions/QUmw-Dgnm0RuaCyCLsOVuo5Q/unable-to-connect-to-ec2-instance-via-ssm-session-manager

I am at a loss as to what to try next. Is it possible for a NAT Gateway to stop working? That seems like a possibility.

Thanks.

asked a year ago708 views
5 Answers
0
Accepted Answer

The issue was the DHCP option set on the VPC - pointing to a local domain managed by a server that was no longer running.

Thanks everyone. Problem Solved.

answered a year ago
0

How are you using NAT Gateway for the SSM in the first VPC where it works? SSM will require VPC endpoints in the same VPC where the instance is. Check pre-requisits: https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/connect-to-an-amazon-ec2-instance-by-using-session-manager.html#connect-to-an-amazon-ec2-instance-by-using-session-manager-prereqs

A KB that always works for me: https://repost.aws/knowledge-center/ec2-systems-manager-vpc-endpoints

AWS
EXPERT
answered a year ago
  • Thanks for this. I am not using endpoints at all. The NAT Gateway is in a public subnet, and routes to any IP from the private subnet (there the instances are) are directed to the NAT Gateway. The architectures are the same between the two. Looking at Fleet Manager, all of the instance in the failing VPC have lost their connections, which is why I suspect something with the NAT Gateways in the failing VPC's.

    Thanks again.

0

Were ther any Security Group or NACL rule changes? SSM needs outbound connection to the SSM public endpoint on port 443 if VPC endpoint is not used, this is for the agent to check in with SSM service.

answered a year ago
0

No changes there. All of our resources are in their own subnet, and they aren't connecting to SSM. Both VPC's have a similar setup. Public and Private Subnets, with routes to the NAT Gateway from the private subnets.

Additionally, a server in the public subnet can't connect either. But I should be able to connect to that one. More to come on that.

I feel like I am solving a simple issue in public, just can't find it. Ain't that always the case?

Thanks all.

answered a year ago
0

This is likely the issue:

2024-08-22 00:01:44 INFO [ssm-agent-worker] Entering SSM Agent hibernate - RequestError: send request failed caused by: Post "https://ssm.us-gov-west-1.amazonaws.com/": dial tcp: lookup ssm.us-gov-west-1.amazonaws.com on 127.0.0.53:53: read udp 127.0.0.1:54693->127.0.0.53:53: i/o timeout

DNS Resolution is enabled in both VPC's

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.