- Newest
- Most votes
- Most comments
Update: I redeployed the NAT instances AGAIN (complete deletion and new launch) without any user data and did the NAT configuration manually AGAIN exactly as shown in the AWS guide for setting up NAT instances. The "Validate VPC settings" run in CodeBuild still produces errors as well as I can't reach the VPC internet gateway from a test instance inside one of the three private subnets via Reachability Analyzer.
HOWEVER: I retried executing CodePipeline again and CodeBuild is able to access the necessary artifacts through the internet again. I have no idea why it is suddenly working again as I have no idea why it didn't work in the first place. You have to trust me here that I really changed no NAT configurations, they worked before yesterday and then they didn't work anymore after an AMI update which wasn't the cause of the problem. Now I configured the NAT instances manually once again and suddenly CodeBuild has access to the internet again.
Of course I am happy that it works again now but it still leaves a bad taste as I still have no idea what exactly the reason was for these issues. Given that there is paid a good amount of money for my infrastructure it makes me wonder how stable this VPC environment really is. Sure, I use NAT instances instead of NAT gateways but it would be like trying to crack a walnut with a sledgehammer as I only need a fraction of the resources that a NAT gateway provides. I read comments since years that these NAT gateways are too expensive for simple tasks and this situation still hasn't changed so I am forced to use NAT instances, which seem to work until they don't. Or there was something inherently off in AWS'es VPC service itself (at least for eu-central-1), who knows...
Edit: I'm going to set the instance's user data to the exact configuration again as provided in the AWS guide. The adjusted commands have worked, too, and made the configuration persistent so that it would survive a reboot but I'm going to better stay on the safe side now.
I added some information in the comments, as far as I can see it, it is not a problem related to the Amazon Linux 2 AMI as I (as mentioned in the question) tested older Amazon Linux 2 AMIs as well and the problem remains. The VPC and NAT configurations shouldn't generate any issues as they have worked before.
As there was a VPC service incident in eu-central-1 yesterday, might it be possible that something is still off with the VPC service, causing routing issues even if everything is configured as it should be?
WHen you change the AMI in Cloudformation, the instance is replaced.
If you are using a regular Amazon Linux AMI (not NAT specific), did you set up the instance after it was launched or do you have the commands embedded in Cloudformation somewhere?
sudo sysctl -w net.ipv4.ip_forward=1
sudo /sbin/iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
sudo yum install iptables-services
sudo service iptables save
Is it possible the source/destination check flag was reset on the NAT instance? https://docs.aws.amazon.com/vpc/latest/userguide/VPC_NAT_Instance.html#EIP_Disable_SrcDestCheck
Also double-check the route tables for the private subnets and be sure the 0.0.0.0/0 route is pointing to the NIC of the NAT instance and not labeled as "Blackholed"
As a side note, If you use an existing NAT AMI, AWS recommends that you migrate to a NAT gateway. NAT gateways provide better availability, higher bandwidth, and requires less administrative effort. If NAT instances are a better match for your use case, you can create your own NAT AMI. For more information, see Compare NAT gateways and NAT instances.
Hi mthwbarb, thanks for your quick answer!
-
I disable source/destination check in the template, so it is always turned off for the instances.
-
I have 3 custom route tables, one for each private subnet. They have the default entry plus a custom route: Destination Target 0.0.0.0/0 <NAT instance ID> (which gets resolved to the corresponding network interface of the instance: eni-xxxxxxxxxxxxxxxxx)
-
I understand that NAT gateways might be the easier and better solution but they are too expensive for my use case.
-
About the Amazon Linux 2 configuration, it is configured with user data in the template, e.g.: #!/bin/bash sudo bash -c 'echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf' sudo sysctl -p sudo yum -y update sudo yum install -y iptables-services sudo systemctl enable iptables sudo systemctl start iptables sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE sudo service iptables save sudo yum install -y aws-cfn-bootstrap sudo /opt/aws/bin/cfn-signal -e 0 --region ${AWS::Region} --stack ${AWS::StackName} --resource NatInstance1
The configuration is slightly different than described in the AWS guide to make the settings persistent between potential reboots. I changed "sudo sysctl -w net.ipv4.ip_forward=1" as shown above, some days before I was still using "sudo sysctl -w net.ipv4.ip_forward=1" and all other commands, only "sudo bash -c 'echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf'" and "sudo sysctl -p" was exchanged. Everything worked fine, now it is not working even when trying the exact configuration from the guide.
Relevant content
- Accepted Answerasked 9 months ago
- Accepted Answerasked 9 months ago
- Accepted Answerasked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 6 months ago
Are you able to reach the NAT instance from the EC2 instance in the private subnet?
As I redeployed the whole infrastructure I'm stuck in the CodePipeline building stage right now where I only test with CodeBuild's temporary instance if it is working. In CodeBuild's "Edit Environment" menu I can "Validate VPC settings" and it fails with the standard error which I had some time ago when setting up the VPC for the first time:
"The VPC with ID vpc-xxxx might not have an internet connection because CodeBuild cannot find the 0.0.0.0/0 destination for the target NAT gateway with subnet ID subnet-xxxx. Verify your VPC has an internet connection through the device with ID i-xxxx."
I could try and set up an instance in one of the private subnets but then I also need to set up SSH access and probably ICMP for it (right now I only have TCP (HTTP and HTTPS) enabled as inbound traffic to the NAT instances from within the VPC (as recommended in the AWS guide)). Do you think it is helpful to set up a test instance to maybe gain additional info?
All in all it is a bit suspicious that "CodeBuild cannot find the 0.0.0.0/0 destination for the target NAT gateway" even though the routes should be configured correctly and have worked until yesterday.
Yes i think it would be helpful to quickly spin up an EC2 instance in the same private subnet where codepipeline's temporary instance is. Then you can test connectivity to your NAT instance over ICMP then to the internet.
Since your test instance will be in the private subnet, you will need to also setup a Bastion in the public subnet to get to it. See https://aws-quickstart.github.io/quickstart-linux-bastion/#_launch_the_quick_start
OR setup SSM Session Manager which doesnt use SSH. See https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started.html
Another option is to use the Reachability Analyzer https://docs.aws.amazon.com/vpc/latest/reachability/what-is-reachability-analyzer.html