Update: I redeployed the NAT instances AGAIN (complete deletion and new launch) without any user data and did the NAT configuration manually AGAIN exactly as shown in the AWS guide for setting up NAT instances. The "Validate VPC settings" run in CodeBuild still produces errors as well as I can't reach the VPC internet gateway from a test instance inside one of the three private subnets via Reachability Analyzer.
HOWEVER: I retried executing CodePipeline again and CodeBuild is able to access the necessary artifacts through the internet again. I have no idea why it is suddenly working again as I have no idea why it didn't work in the first place. You have to trust me here that I really changed no NAT configurations, they worked before yesterday and then they didn't work anymore after an AMI update which wasn't the cause of the problem. Now I configured the NAT instances manually once again and suddenly CodeBuild has access to the internet again.
Of course I am happy that it works again now but it still leaves a bad taste as I still have no idea what exactly the reason was for these issues. Given that there is paid a good amount of money for my infrastructure it makes me wonder how stable this VPC environment really is. Sure, I use NAT instances instead of NAT gateways but it would be like trying to crack a walnut with a sledgehammer as I only need a fraction of the resources that a NAT gateway provides. I read comments since years that these NAT gateways are too expensive for simple tasks and this situation still hasn't changed so I am forced to use NAT instances, which seem to work until they don't. Or there was something inherently off in AWS'es VPC service itself (at least for eu-central-1), who knows...
Edit: I'm going to set the instance's user data to the exact configuration again as provided in the AWS guide. The adjusted commands have worked, too, and made the configuration persistent so that it would survive a reboot but I'm going to better stay on the safe side now.
I added some information in the comments, as far as I can see it, it is not a problem related to the Amazon Linux 2 AMI as I (as mentioned in the question) tested older Amazon Linux 2 AMIs as well and the problem remains. The VPC and NAT configurations shouldn't generate any issues as they have worked before.
As there was a VPC service incident in eu-central-1 yesterday, might it be possible that something is still off with the VPC service, causing routing issues even if everything is configured as it should be?
WHen you change the AMI in Cloudformation, the instance is replaced.
If you are using a regular Amazon Linux AMI (not NAT specific), did you set up the instance after it was launched or do you have the commands embedded in Cloudformation somewhere?
sudo sysctl -w net.ipv4.ip_forward=1
sudo /sbin/iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
sudo yum install iptables-services
sudo service iptables save
Is it possible the source/destination check flag was reset on the NAT instance? https://docs.aws.amazon.com/vpc/latest/userguide/VPC_NAT_Instance.html#EIP_Disable_SrcDestCheck
Also double-check the route tables for the private subnets and be sure the 0.0.0.0/0 route is pointing to the NIC of the NAT instance and not labeled as "Blackholed"
As a side note, If you use an existing NAT AMI, AWS recommends that you migrate to a NAT gateway. NAT gateways provide better availability, higher bandwidth, and requires less administrative effort. If NAT instances are a better match for your use case, you can create your own NAT AMI. For more information, see Compare NAT gateways and NAT instances.
- asked 7 months ago
- asked 5 years ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated a year ago