STH Down unplanned firewall upgrade opportunity

Discussion in 'STH Suggestions and Updates' started by Patrick, Jun 21, 2018.

  1. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,045
    Likes Received:
    3,996
    Our suite at HE had a power failure last evening. It brought linode and a bunch of other sites offline.

    When the power came back, the primary firewall got hit by AVR54 so it never booted. The backup did not failover as expected in the process.

    New Broadwell-DE firewall is in place.

    Sorry about this STHers.

    Reevaluating the DC we are using here. More to come soon.
     
    #1
    eva2000, _alex, Patriot and 6 others like this.
  2. marcoi

    marcoi Active Member

    Joined:
    Apr 6, 2013
    Messages:
    916
    Likes Received:
    147
    glad its back up - cant start my morning without looking at the forums haha.
     
    #2
    cactus likes this.
  3. Blinky 42

    Blinky 42 Active Member

    Joined:
    Aug 6, 2015
    Messages:
    456
    Likes Received:
    152
    Glad you were able to get back up without too much long term damage.
    We went through "unexpected power events" in the past at other colos that took out lots of hardware, so I can commiserate :p
     
    #3
  4. Myth

    Myth Member

    Joined:
    Feb 27, 2018
    Messages:
    132
    Likes Received:
    6
    I really like this site, please know that as you work to maintain it, you are appreciated!
     
    #4
  5. nthu9280

    nthu9280 Active Member

    Joined:
    Feb 3, 2016
    Messages:
    965
    Likes Received:
    229
    Glad to hear the site is back up without too much damage. I first thought it was my mobile / internet and then realized it was a issue with the site.
     
    #5
  6. marcoi

    marcoi Active Member

    Joined:
    Apr 6, 2013
    Messages:
    916
    Likes Received:
    147
    i signed up for a twitter account this morning just to mention the site was down. haha :eek:
     
    #6
  7. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,045
    Likes Received:
    3,996
    We have a lot of monitoring around the main site and forums. I was getting text messages and emails galore!

    The person who won ended up being the one that tweeted the iEi Puzzle / AMD EPYC 3000 series, especially since the second issue was the firewall.
     
    #7
  8. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,045
    Likes Received:
    3,996
    Progress update:

    We had a big node (dual E5-2699 V3) that did not restart. It was showing errors upon every reboot and cold cycle when loading the Proxmox rpool.

    Fix: Boot Ubuntu 18.04 LTS live CD, install zfs tools, do a zpool import. Once that completed, everything went back to normal and survived multipe reboots.

    Getting that node back online lowered STH loading times by ~1 second! Back to normal speed.
     
    #8
    eva2000 and William like this.
  9. K D

    K D Well-Known Member

    Joined:
    Dec 24, 2016
    Messages:
    1,352
    Likes Received:
    284
    Noticed that the site was down around 4:00 am eastern.
     
    #9
  10. PigLover

    PigLover Moderator

    Joined:
    Jan 26, 2011
    Messages:
    2,659
    Likes Received:
    1,041
    Really unfortunate to have power cut issues in a modern data center. Lots of things have to go wrong at the same time - or the protect design needs to be flawed. Often it reflects lax operational discipline. I understand why you might reconsider HE as a hosting provider after that happens.

    OTOH, if you had still been hosting in Vegas your response time might have been extended a bit :).

    And in my experience an event like this one often results in incentives to "get sh$t fixed right". Depending on their response - going forward - HE might become the best place to be in the near future...
     
    #10
  11. maze

    maze Active Member

    Joined:
    Apr 27, 2013
    Messages:
    444
    Likes Received:
    61
    Do you have ANY idea how many times i checked “downforeveryoneorjustme” at work today? Shit :)

    Damn Them HE
     
    #11
    Palvelinvirhe likes this.
  12. Myth

    Myth Member

    Joined:
    Feb 27, 2018
    Messages:
    132
    Likes Received:
    6
    Like graceful shutdown through powerchut software to UPS?
     
    #12
  13. BLinux

    BLinux Well-Known Member

    Joined:
    Jul 7, 2016
    Messages:
    1,686
    Likes Received:
    446
    that's unfortunate. i suppose the system that got hit by AVR54 is replaceable under warranty?

    when i noticed STH was down yesterday evening, at first i thought it was my firewall. it's been having some issues for a few months now and I never bothered to fix it since it was still doing it's job and just running in RAM. (self-built Linux kernel based firewall) I was hoping to see it hit 365 day uptime without a reboot, but alas, i wanted to make sure it wasn't the cause of why I couldn't reach STH so I shut it down to fix it! (well, it wasn't just STH, had some weird dns caching issues and other things happening throughout the day) I was so close... 352 day uptime. strangely, youtube and netflix video streams didn't get interrupted much... i guess they must buffer enough to survive a firewall reboot.

    was there no UPS or generator?
     
    #13
  14. Evan

    Evan Well-Known Member

    Joined:
    Jan 6, 2016
    Messages:
    2,184
    Likes Received:
    301
    HE for sure have decent UPS and genset gear, Likley the issues was something on the LV side that was a major problem. Electrical is something usually not that hard to get right but you can never account for that maintenance guy dropping a spanner in the UPS or whatever.
     
    #14
  15. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,045
    Likes Received:
    3,996
    @PigLover if it was still hosted in Vegas, I would have failed over to Sunnyvale. Plan B is that there are several TB of RAM and thousands of cores sitting in Sunnyvale all with 25/40/100GbE.

    That would violate my first rule: no hosting from lab racks, but it would work. The Sunnyvale site has been rock solid for 3 years now.

    @Evan - last time they had an issue with their generator switch so it did not turn on. I still have not heard what is up this time.
     
    #15
  16. _alex

    _alex Active Member

    Joined:
    Jan 28, 2016
    Messages:
    846
    Likes Received:
    88
    wow, glad you sorted to get back everything online, even with some hosts struggling.
    really wonder why no diesels jumped-in after batteries kept stuff running until they are in sync - as this is usually the plan when external power fails.

    my first thought was that there is maybe a problem with power caused by you, i.e. by powering on a bunch of new deep learning boxes at the same time o_O
     
    #16
  17. Patrick

    Patrick Administrator
    Staff Member

    Joined:
    Dec 21, 2010
    Messages:
    11,045
    Likes Received:
    3,996
    No deep learning at that facility. My guess is something prevented generators from kicking in
     
    #17
  18. _alex

    _alex Active Member

    Joined:
    Jan 28, 2016
    Messages:
    846
    Likes Received:
    88
    Yes, read that you separate the site's hosting from labs - what is wise.
    For generators it's not that easy to kick in under full load, as this is usually a lot of power that should stay stable in both voltage and and also frequency.
    Big german DC had an outage some weeks ago, too.
    In this context i learned that, even if generators kick in, it's still a problem to get them out of the circuit again - sometimes even more critical than when they come in.
    Just curious what the dual E5 v3 exactly does to bring down page load times by this amount ...
     
    #18
  19. Evan

    Evan Well-Known Member

    Joined:
    Jan 6, 2016
    Messages:
    2,184
    Likes Received:
    301
    I am certainly not an electrical engineer but. Work with plenty and grid synchronisation both cut in and cut out is reasonably trivial. Sure I have seen things like generators running out of fuel etc but still don’t see the issue should be huge to have generators work ok.

    If your talking a mechanical reason a generator didn’t start what about the other ones, just a genset not starting doesn’t mean much really.
    Small setup n+1, bigger n+2 is normal, less common is 2n and even 2n+1 setups, just like aircon chillers and cracs.

    More a pain is scheduled maintenance on the Low voltage parts of a DC as you have to operate without half your circuits, at that end there is usually no static switching in place for redundancy and also limited way to manage redundancy and power routing.

    Of course I know doing of HE infrastructure either but they are a big well known DC provider so I assume a certain level of facility but maybe I am wrong there.
     
    #19
Similar Threads: Down unplanned
Forum Title Date
STH Suggestions and Updates Mostly heads-down week 2016 June 12 Jun 12, 2016
STH Suggestions and Updates STH Forums - Short Downtime Tonight Dec 10, 2015
STH Suggestions and Updates A bit of downtime today Dec 16, 2013
STH Suggestions and Updates STH Down for me Jan 3, 2013

Share This Page