Configured VPC NAT instances stopped working yesterday (03.03.2022, eu-central-1)
Hi,
I'm confronted with a really annoying problem currently. My custom VPC (3 public subnets, 3 private subnets -> internet access through NAT instances) broke out of the blue yesterday.
My infrastructure is deployed via CloudFormation and yesterday I updated a stack where three NAT instances for my VPC are located (for each public subnet there is one NAT instance deployed in it). They have worked flawlessly before yesterday and as a new Amazon Linux 2 version was released (I reference the AMI ID via /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-arm64-gp2), these instances got updated to use the newest AMI. Since then I have problems routing traffic from private subnets to the internet as things are not working as expected anymore.
The current primary point of failure is that my CodePipeline fails because a CodeBuild action fails. The temporary CodeBuild instance is deployed in one of the three private subnets and then has to download a CodePipeline artifact from S3 through the internet. This step fails with the following error:
`CLIENT_ERROR: RequestError: send request failed caused by: Get "https://s3.eu-central-1.amazonaws.com/<S3-bucket-name>?location=": dial tcp 52.219.170.173:443: connect: no route to host for primary source and source version arn:aws:s3:::<S3-prefix>`
The thing is: before yesterday's last stack update which altered the NAT instances, everything was working as expected and CodePipeline succeeded. CodeBuild was able to download the necessary artifacts from S3 and the VPC and NAT instances were set up correctly. Then the update came in and CodeBuild fails now.
The only thing that was changed was the AMI ID for the NAT instances (and I replaced absolute strings for "ProjectName" in my CodeBuild actions in CodePipeline with !Ref to the AWS::CodeBuild::Project resources which should have nothing to do with my current problem). After the updated NAT instances were not working anymore, I set their AMI IDs to explicit older versions as I assumed that there is a problem with the newest Amazon Linux 2 version. However, even with the older AMIs I'm not able to get the NAT instances working again (at least not for CodeBuild, but I noticed that ECS services running on an EC2 instance (which is also deployed in a private subnet) lost connection to the internet as well). I even redeployed the whole infrastructure to check if there is a problem on the side of AWS but the problem persists.
The problem got me really frustrated now as everything was working fine. Then a small update was applied and now the NAT instances fail even if I havn't changed anything in the VPC and NAT configuration. Where should the problem be now if not on the side of AWS? My currently deployed NAT instances are configured as described by AWS and as they have worked before, they are reachable via SSH and can access the internet via the VPCs internet gateway. Still, CodeBuild continues to fail with the mentioned error and the internet seems not to be accessible from private subnets as it was the case before yesterday.
I would be more than glad if anyone has suggestions how this problem can be resolved now.
Thanks in advance!