Configured VPC NAT instances stopped working yesterday (03.03.2022, eu-central-1)

0

Hi,

I'm confronted with a really annoying problem currently. My custom VPC (3 public subnets, 3 private subnets -> internet access through NAT instances) broke out of the blue yesterday.

My infrastructure is deployed via CloudFormation and yesterday I updated a stack where three NAT instances for my VPC are located (for each public subnet there is one NAT instance deployed in it). They have worked flawlessly before yesterday and as a new Amazon Linux 2 version was released (I reference the AMI ID via /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-arm64-gp2), these instances got updated to use the newest AMI. Since then I have problems routing traffic from private subnets to the internet as things are not working as expected anymore.

The current primary point of failure is that my CodePipeline fails because a CodeBuild action fails. The temporary CodeBuild instance is deployed in one of the three private subnets and then has to download a CodePipeline artifact from S3 through the internet. This step fails with the following error:

CLIENT_ERROR: RequestError: send request failed caused by: Get "https://s3.eu-central-1.amazonaws.com/<S3-bucket-name>?location=": dial tcp 52.219.170.173:443: connect: no route to host for primary source and source version arn:aws:s3:::<S3-prefix>

The thing is: before yesterday's last stack update which altered the NAT instances, everything was working as expected and CodePipeline succeeded. CodeBuild was able to download the necessary artifacts from S3 and the VPC and NAT instances were set up correctly. Then the update came in and CodeBuild fails now.

The only thing that was changed was the AMI ID for the NAT instances (and I replaced absolute strings for "ProjectName" in my CodeBuild actions in CodePipeline with !Ref to the AWS::CodeBuild::Project resources which should have nothing to do with my current problem). After the updated NAT instances were not working anymore, I set their AMI IDs to explicit older versions as I assumed that there is a problem with the newest Amazon Linux 2 version. However, even with the older AMIs I'm not able to get the NAT instances working again (at least not for CodeBuild, but I noticed that ECS services running on an EC2 instance (which is also deployed in a private subnet) lost connection to the internet as well). I even redeployed the whole infrastructure to check if there is a problem on the side of AWS but the problem persists.

The problem got me really frustrated now as everything was working fine. Then a small update was applied and now the NAT instances fail even if I havn't changed anything in the VPC and NAT configuration. Where should the problem be now if not on the side of AWS? My currently deployed NAT instances are configured as described by AWS and as they have worked before, they are reachable via SSH and can access the internet via the VPCs internet gateway. Still, CodeBuild continues to fail with the mentioned error and the internet seems not to be accessible from private subnets as it was the case before yesterday.

I would be more than glad if anyone has suggestions how this problem can be resolved now.

Thanks in advance!

3 Answers
0
Accepted Answer

Update: I redeployed the NAT instances AGAIN (complete deletion and new launch) without any user data and did the NAT configuration manually AGAIN exactly as shown in the AWS guide for setting up NAT instances. The "Validate VPC settings" run in CodeBuild still produces errors as well as I can't reach the VPC internet gateway from a test instance inside one of the three private subnets via Reachability Analyzer.

HOWEVER: I retried executing CodePipeline again and CodeBuild is able to access the necessary artifacts through the internet again. I have no idea why it is suddenly working again as I have no idea why it didn't work in the first place. You have to trust me here that I really changed no NAT configurations, they worked before yesterday and then they didn't work anymore after an AMI update which wasn't the cause of the problem. Now I configured the NAT instances manually once again and suddenly CodeBuild has access to the internet again.

Of course I am happy that it works again now but it still leaves a bad taste as I still have no idea what exactly the reason was for these issues. Given that there is paid a good amount of money for my infrastructure it makes me wonder how stable this VPC environment really is. Sure, I use NAT instances instead of NAT gateways but it would be like trying to crack a walnut with a sledgehammer as I only need a fraction of the resources that a NAT gateway provides. I read comments since years that these NAT gateways are too expensive for simple tasks and this situation still hasn't changed so I am forced to use NAT instances, which seem to work until they don't. Or there was something inherently off in AWS'es VPC service itself (at least for eu-central-1), who knows...

Edit: I'm going to set the instance's user data to the exact configuration again as provided in the AWS guide. The adjusted commands have worked, too, and made the configuration persistent so that it would survive a reboot but I'm going to better stay on the safe side now.

Czyze
answered 2 years ago
0

I added some information in the comments, as far as I can see it, it is not a problem related to the Amazon Linux 2 AMI as I (as mentioned in the question) tested older Amazon Linux 2 AMIs as well and the problem remains. The VPC and NAT configurations shouldn't generate any issues as they have worked before.

As there was a VPC service incident in eu-central-1 yesterday, might it be possible that something is still off with the VPC service, causing routing issues even if everything is configured as it should be?

Czyze
answered 2 years ago
  • Are you able to reach the NAT instance from the EC2 instance in the private subnet?

  • As I redeployed the whole infrastructure I'm stuck in the CodePipeline building stage right now where I only test with CodeBuild's temporary instance if it is working. In CodeBuild's "Edit Environment" menu I can "Validate VPC settings" and it fails with the standard error which I had some time ago when setting up the VPC for the first time:

    "The VPC with ID vpc-xxxx might not have an internet connection because CodeBuild cannot find the 0.0.0.0/0 destination for the target NAT gateway with subnet ID subnet-xxxx. Verify your VPC has an internet connection through the device with ID i-xxxx."

  • I could try and set up an instance in one of the private subnets but then I also need to set up SSH access and probably ICMP for it (right now I only have TCP (HTTP and HTTPS) enabled as inbound traffic to the NAT instances from within the VPC (as recommended in the AWS guide)). Do you think it is helpful to set up a test instance to maybe gain additional info?

  • All in all it is a bit suspicious that "CodeBuild cannot find the 0.0.0.0/0 destination for the target NAT gateway" even though the routes should be configured correctly and have worked until yesterday.

  • Yes i think it would be helpful to quickly spin up an EC2 instance in the same private subnet where codepipeline's temporary instance is. Then you can test connectivity to your NAT instance over ICMP then to the internet.

    Since your test instance will be in the private subnet, you will need to also setup a Bastion in the public subnet to get to it. See https://aws-quickstart.github.io/quickstart-linux-bastion/#_launch_the_quick_start

    OR setup SSM Session Manager which doesnt use SSH. See https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started.html

    Another option is to use the Reachability Analyzer https://docs.aws.amazon.com/vpc/latest/reachability/what-is-reachability-analyzer.html

0

WHen you change the AMI in Cloudformation, the instance is replaced.

If you are using a regular Amazon Linux AMI (not NAT specific), did you set up the instance after it was launched or do you have the commands embedded in Cloudformation somewhere?

sudo sysctl -w net.ipv4.ip_forward=1 sudo /sbin/iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE sudo yum install iptables-services sudo service iptables save

Is it possible the source/destination check flag was reset on the NAT instance? https://docs.aws.amazon.com/vpc/latest/userguide/VPC_NAT_Instance.html#EIP_Disable_SrcDestCheck

Also double-check the route tables for the private subnets and be sure the 0.0.0.0/0 route is pointing to the NIC of the NAT instance and not labeled as "Blackholed"

As a side note, If you use an existing NAT AMI, AWS recommends that you migrate to a NAT gateway. NAT gateways provide better availability, higher bandwidth, and requires less administrative effort. If NAT instances are a better match for your use case, you can create your own NAT AMI. For more information, see Compare NAT gateways and NAT instances.

profile pictureAWS
EXPERT
Matt-B
answered 2 years ago
  • Hi mthwbarb, thanks for your quick answer!

    1. I disable source/destination check in the template, so it is always turned off for the instances.

    2. I have 3 custom route tables, one for each private subnet. They have the default entry plus a custom route: Destination Target 0.0.0.0/0 <NAT instance ID> (which gets resolved to the corresponding network interface of the instance: eni-xxxxxxxxxxxxxxxxx)

    3. I understand that NAT gateways might be the easier and better solution but they are too expensive for my use case.

  • About the Amazon Linux 2 configuration, it is configured with user data in the template, e.g.: #!/bin/bash sudo bash -c 'echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf' sudo sysctl -p sudo yum -y update sudo yum install -y iptables-services sudo systemctl enable iptables sudo systemctl start iptables sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE sudo service iptables save sudo yum install -y aws-cfn-bootstrap sudo /opt/aws/bin/cfn-signal -e 0 --region ${AWS::Region} --stack ${AWS::StackName} --resource NatInstance1

  • The configuration is slightly different than described in the AWS guide to make the settings persistent between potential reboots. I changed "sudo sysctl -w net.ipv4.ip_forward=1" as shown above, some days before I was still using "sudo sysctl -w net.ipv4.ip_forward=1" and all other commands, only "sudo bash -c 'echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf'" and "sudo sysctl -p" was exchanged. Everything worked fine, now it is not working even when trying the exact configuration from the guide.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions