Instances can't reach classic ELB in VPC after ENI change
Four or five times in the past 6-8 weeks, we've had situations where one of our ec2 instances (running CentOS) cannot reach the private IP address of a classic ELB. I believe this is due to scaling events (or something else causing replacement of ELB components) happening on the ELB. From what I see in cloud trial, the network interface is replaced with one having the same ip address but a different mac address. Sometimes, but not all the time, the old mac address gets stuck in the instance's arp cache (in REACHABLE state), preventing the instance from communicating with the ELB causing drastic issues for our application. If I manually delete the entry from the arp cache, things start working again. This is happening across different environments, so multiple subnets, multiple ELBs and multiple ec2 instances. These environments and components have been running for years without seeing this issue before. The only network config change we've recently made is to disable jumbo frames earlier this year, but don't see how that would impact this.
Any ideas how to fix this?
Thanks
EDIT: this happened again today and I was able to more closely examine things. The new ENI is actually re-using an ip address that had been used over a month prior. The old entry for said ip address is still listed in the arp cache with the prior MAC address, despite not being used for about four weeks. This explains why it's starting to happen more frequently, as the chance that an ip address gets re-used increases as new ENIs are created for the ELBs. It's a /26 subnet so not a lot of addresses to choose from.
Think this one is resolved. As mentioned above, the subnets are /26, which gives about 60 usable addresses. The default setting for net.ipv4.neigh.default.gc_thresh1 in CentOS was 128. With a /26, it's not possible to hit that threshold so gc never runs. This resulted in old entries staying in arp cache for months and causing problems when ip addresses were re-used. I lowered the value of gc_thresh1 and the old/STALE entries almost immediately dropped out of the arp cache. Need to implement this across our hosts and let it run for a while, but initial results look promising.
Relevant questions
EC2-Classic and Data-transfer
Accepted AnswerAWS-User-6607725asked 5 years agoInstances can't reach classic ELB in VPC after ENI change
AWS-User-4421521asked 21 days agoWarning: EC2-Classic Networking Enabled
JamesASTasked 9 months agoDisable EC2 Classic from my account and move to VPC
josh_vasked 2 years agoInsufficientDBInstanceCapacity eu-west-1
roghakasked 3 years agoTLS Termination with an ELB
davemeyerasked 3 years agoInstance network isolation?
David Gasked 3 years agoReserved CIDR range in a Subnet
Accepted AnswerVPC - Public/Private Subnets - Unable to access from internet
Craig-Wasked 3 years agoHow to migrate my "m1.small" reserved instances to "t3.small"?
Accepted AnswerAWS-User-4230166asked 20 days ago