Instances can't reach classic ELB in VPC after ENI change

0

Four or five times in the past 6-8 weeks, we've had situations where one of our ec2 instances (running CentOS) cannot reach the private IP address of a classic ELB. I believe this is due to scaling events (or something else causing replacement of ELB components) happening on the ELB. From what I see in cloud trial, the network interface is replaced with one having the same ip address but a different mac address. Sometimes, but not all the time, the old mac address gets stuck in the instance's arp cache (in REACHABLE state), preventing the instance from communicating with the ELB causing drastic issues for our application. If I manually delete the entry from the arp cache, things start working again. This is happening across different environments, so multiple subnets, multiple ELBs and multiple ec2 instances. These environments and components have been running for years without seeing this issue before. The only network config change we've recently made is to disable jumbo frames earlier this year, but don't see how that would impact this.

Any ideas how to fix this?

Thanks

EDIT: this happened again today and I was able to more closely examine things. The new ENI is actually re-using an ip address that had been used over a month prior. The old entry for said ip address is still listed in the arp cache with the prior MAC address, despite not being used for about four weeks. This explains why it's starting to happen more frequently, as the chance that an ip address gets re-used increases as new ENIs are created for the ELBs. It's a /26 subnet so not a lot of addresses to choose from.

1 Answer
0

Think this one is resolved. As mentioned above, the subnets are /26, which gives about 60 usable addresses. The default setting for net.ipv4.neigh.default.gc_thresh1 in CentOS was 128. With a /26, it's not possible to hit that threshold so gc never runs. This resulted in old entries staying in arp cache for months and causing problems when ip addresses were re-used. I lowered the value of gc_thresh1 and the old/STALE entries almost immediately dropped out of the arp cache. Need to implement this across our hosts and let it run for a while, but initial results look promising.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions