Four or five times in the past 6-8 weeks, we've had situations where one of our ec2 instances (running CentOS) cannot reach the private IP address of a classic ELB. I believe this is due to scaling events (or something else causing replacement of ELB components) happening on the ELB. From what I see in cloud trial, the network interface is replaced with one having the same ip address but a different mac address. Sometimes, but not all the time, the old mac address gets stuck in the instance's arp cache (in REACHABLE state), preventing the instance from communicating with the ELB causing drastic issues for our application. If I manually delete the entry from the arp cache, things start working again. This is happening across different environments, so multiple subnets, multiple ELBs and multiple ec2 instances. These environments and components have been running for years without seeing this issue before. The only network config change we've recently made is to disable jumbo frames earlier this year, but don't see how that would impact this.
Any ideas how to fix this?
Thanks
EDIT: this happened again today and I was able to more closely examine things. The new ENI is actually re-using an ip address that had been used over a month prior. The old entry for said ip address is still listed in the arp cache with the prior MAC address, despite not being used for about four weeks. This explains why it's starting to happen more frequently, as the chance that an ip address gets re-used increases as new ENIs are created for the ELBs. It's a /26 subnet so not a lot of addresses to choose from.