STH Down unplanned firewall upgrade opportunity

Patrick

Administrator
Staff member
Dec 21, 2010
11,908
4,871
113
Our suite at HE had a power failure last evening. It brought linode and a bunch of other sites offline.

When the power came back, the primary firewall got hit by AVR54 so it never booted. The backup did not failover as expected in the process.

New Broadwell-DE firewall is in place.

Sorry about this STHers.

Reevaluating the DC we are using here. More to come soon.
 

Blinky 42

Active Member
Aug 6, 2015
561
200
43
44
PA, USA
Glad you were able to get back up without too much long term damage.
We went through "unexpected power events" in the past at other colos that took out lots of hardware, so I can commiserate :p
 

nthu9280

Well-Known Member
Feb 3, 2016
1,588
440
83
San Antonio, TX
Glad to hear the site is back up without too much damage. I first thought it was my mobile / internet and then realized it was a issue with the site.
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,908
4,871
113
We have a lot of monitoring around the main site and forums. I was getting text messages and emails galore!

The person who won ended up being the one that tweeted the iEi Puzzle / AMD EPYC 3000 series, especially since the second issue was the firewall.
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,908
4,871
113
Progress update:

We had a big node (dual E5-2699 V3) that did not restart. It was showing errors upon every reboot and cold cycle when loading the Proxmox rpool.

Fix: Boot Ubuntu 18.04 LTS live CD, install zfs tools, do a zpool import. Once that completed, everything went back to normal and survived multipe reboots.

Getting that node back online lowered STH loading times by ~1 second! Back to normal speed.
 
  • Like
Reactions: eva2000 and William

PigLover

Moderator
Jan 26, 2011
2,964
1,271
113
Really unfortunate to have power cut issues in a modern data center. Lots of things have to go wrong at the same time - or the protect design needs to be flawed. Often it reflects lax operational discipline. I understand why you might reconsider HE as a hosting provider after that happens.

OTOH, if you had still been hosting in Vegas your response time might have been extended a bit :).

And in my experience an event like this one often results in incentives to "get sh$t fixed right". Depending on their response - going forward - HE might become the best place to be in the near future...
 

maze

Active Member
Apr 27, 2013
556
84
28
Do you have ANY idea how many times i checked “downforeveryoneorjustme” at work today? Shit :)

Damn Them HE
 
  • Like
Reactions: Palvelinvirhe

BLinux

cat lover server enthusiast
Jul 7, 2016
2,519
964
113
artofserver.com
Our suite at HE had a power failure last evening. It brought linode and a bunch of other sites offline.

When the power came back, the primary firewall got hit by AVR54 so it never booted. The backup did not failover as expected in the process.

New Broadwell-DE firewall is in place.

Sorry about this STHers.

Reevaluating the DC we are using here. More to come soon.
that's unfortunate. i suppose the system that got hit by AVR54 is replaceable under warranty?

when i noticed STH was down yesterday evening, at first i thought it was my firewall. it's been having some issues for a few months now and I never bothered to fix it since it was still doing it's job and just running in RAM. (self-built Linux kernel based firewall) I was hoping to see it hit 365 day uptime without a reboot, but alas, i wanted to make sure it wasn't the cause of why I couldn't reach STH so I shut it down to fix it! (well, it wasn't just STH, had some weird dns caching issues and other things happening throughout the day) I was so close... 352 day uptime. strangely, youtube and netflix video streams didn't get interrupted much... i guess they must buffer enough to survive a firewall reboot.

was there no UPS or generator?
 

Evan

Well-Known Member
Jan 6, 2016
3,071
512
113
HE for sure have decent UPS and genset gear, Likley the issues was something on the LV side that was a major problem. Electrical is something usually not that hard to get right but you can never account for that maintenance guy dropping a spanner in the UPS or whatever.
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,908
4,871
113
@PigLover if it was still hosted in Vegas, I would have failed over to Sunnyvale. Plan B is that there are several TB of RAM and thousands of cores sitting in Sunnyvale all with 25/40/100GbE.

That would violate my first rule: no hosting from lab racks, but it would work. The Sunnyvale site has been rock solid for 3 years now.

@Evan - last time they had an issue with their generator switch so it did not turn on. I still have not heard what is up this time.
 

_alex

Active Member
Jan 28, 2016
874
94
28
Bavaria / Germany
wow, glad you sorted to get back everything online, even with some hosts struggling.
really wonder why no diesels jumped-in after batteries kept stuff running until they are in sync - as this is usually the plan when external power fails.

my first thought was that there is maybe a problem with power caused by you, i.e. by powering on a bunch of new deep learning boxes at the same time o_O
 

Patrick

Administrator
Staff member
Dec 21, 2010
11,908
4,871
113
No deep learning at that facility. My guess is something prevented generators from kicking in
 

_alex

Active Member
Jan 28, 2016
874
94
28
Bavaria / Germany
Yes, read that you separate the site's hosting from labs - what is wise.
For generators it's not that easy to kick in under full load, as this is usually a lot of power that should stay stable in both voltage and and also frequency.
Big german DC had an outage some weeks ago, too.
In this context i learned that, even if generators kick in, it's still a problem to get them out of the circuit again - sometimes even more critical than when they come in.
Just curious what the dual E5 v3 exactly does to bring down page load times by this amount ...
 

Evan

Well-Known Member
Jan 6, 2016
3,071
512
113
I am certainly not an electrical engineer but. Work with plenty and grid synchronisation both cut in and cut out is reasonably trivial. Sure I have seen things like generators running out of fuel etc but still don’t see the issue should be huge to have generators work ok.

If your talking a mechanical reason a generator didn’t start what about the other ones, just a genset not starting doesn’t mean much really.
Small setup n+1, bigger n+2 is normal, less common is 2n and even 2n+1 setups, just like aircon chillers and cracs.

More a pain is scheduled maintenance on the Low voltage parts of a DC as you have to operate without half your circuits, at that end there is usually no static switching in place for redundancy and also limited way to manage redundancy and power routing.

Of course I know doing of HE infrastructure either but they are a big well known DC provider so I assume a certain level of facility but maybe I am wrong there.