Catastrophic ALB failure

0

Over the weekend our business, consisting of two auto-scale groups behind an ALB (arn:aws:elasticloadbalancing:us-west-2:126443228413:loadbalancer/app/bigfooty-alb/394c2dba84267148), went offline for six hours. Service was only reinstated by building an entirely new ALB (migrating over the rules and target groups).

At 4:15am our time (GMT+10) the ALB ceased to receive inbound traffic and would not respond to web traffic. We used it for port 80 and port 443 (with SSL cert) termination. At the same time, all target group instances were marked as "Unhealthy" (they were actually healthy in the sense that they were operating correctly and responding) and no traffic was forwarded on to them either way.

Our other EC2 servers not behind the ALB continued to operate.

Initial thoughts were:

a) deliberate isolation by AWS? Bill not paid, some offence taken at an abuse report? Unlikely and AWS had not notified us of any transgression or reason to take action.

b) A mistake on our part in network configuration? No change had been made in days to NACL or security groups. Further we were sound asleep. When we built the replacement ALB we used the same NACL and security groups without problem.

c) Maintenance activity gone wrong? This seems most likely. But AWS appeared not to detect the failure. And we didn't because we considered a complete, inexplicable, and undetected failure of an ALB as "unlikely". We will need to put in place some more healthchecks of our own. We have some based upon Nagios so can enable alarming.

The biggest concern is that this happened suddenly and unexpectedly and that AWS did not detect this. Normally we are never worried about AWS network infrastructure as "it just works". Until now.

Would anyone from AWS please be able to consider this?

Has anyone else ever seen something like this? What can be done to get service back faster or avoid it in the first place.

Regards, Groatz

Edited by: Groatz on Mar 10, 2019 10:08 PM

Groatz
asked 5 years ago185 views
3 Answers
0

What exactly do you mean by ceased to receive inbound traffic?

answered 5 years ago
0

I mean ceased to receive all inbound traffic. To elaborate:
a) won't respond to connection attempts as if it's either inert or firewall dropping packets
b) CloudWatch shows all active, new connections, processed bytes going down to zero
c) And certainly no traffic was being passed through to the target groups, their activity fell to zero, and yet they were healthy and fully operable, just lacking the feed of traffic.
d) Picture if you will a small stone on your desk, perhaps for the use of holding down papers. That would be equivalent to the ELBs that now seemingly fail every Sunday morning (local time).

A review of VPC NACLs and security groups was performed. They hadn't been changed in days/weeks prior. They still haven't been changed but replacement ELBs worked fine. The old ELB started responding to network traffic (albeit minus listeners, rules, target groups) and showed signs of life some days later by itself. I suppose if we could tolerate a once/week outage of a few days then life would be fine - it would restore itself perhaps.

It's frustrating that AWS lock up ELB infrastructure such that it can't be bounced or prodded.

Groatz
answered 5 years ago
0

I'm going to close this off in this thread. AWS haven't responded in this forum.

It happened again the following Sunday, and again this evening. Exact same symptoms. Restoration was initially achieved by creating a new ALB and migrating rules and target groups over. Curiously, the previous ALB was observed to be operational again but wouldn't fire up.

We are up to our 5th LB and the workaround of starting a new one no longer works. Newly created ELBs fail immediately.

We have initiated a direct support request with AWS as we are now down completely without workarounds.

Edited by: Groatz on Mar 19, 2019 4:42 AM

Groatz
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions