Not my BEST IT juju day

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Good hell, my rear-end hurts...just took a massive chewing from 'the Mrs.' on introducing performance degradation/instability for her VDI while she was mid-editing of a clients session.

Here's the backstory of my boneheaded mis-adventures over the last couple of hours.

FedEx arrives, delivering new 8643 to 8087 cable to hook off my free port on my 9300-8i to four more slots on my sc216...ok 'maintenance time'

Begin to sVMotion roughly 700GB of VM's totaling about 40-50 VM's across my 3-node cluster...all goes well for quite some time, I am watching 'zpool iostat -v poolname 2' and things are crusing along. Skip ahead 30-45 mins or so, I see the sVMotions start to mostly wrap up except my wife's VDI session times out, a VDI RDS server stalls at 72% migrated, all others finish. Cannot pull up console on wife's VDI VM or RDP, attempt a reboot...go away vSphere tells me...ok reset...go away...shutdown...no dice. I start to panic and think hell maybe a services.sh restart from a ssh'ed in session to the ESXi hypervisor host that the two stalled VM's are on may fix...no fix...look to login directly to the host and attempt shutdown...no luv. Walk upstairs to get back to some 'real work' and hear the horric sound none of us wanna hear...

VVVVRRRRRROOOOOMMMMMMMMM as the server stack starts to spin/ramp up LOUD. FARK I think to myself, how in the HELL did my stack just reboot...small power blip or just bad luck, wife says nothing else looped in house and I didn't notice either. I walk away knowing vSphere autostart/HA will take care of me, back 15 mins later and I am up, sVMotions failed back gracefully as they should have but left some cruft on the dest NFS stg, cleaned up no biggie...look at hosts expecting to see all three uptimes of 15-20 mins and it's ONLY the two hosts that were running AIO FreeNAS VM's and cooresponding ESXi hosts doing HEAVY sVmotion I/O operations that had a heart attack.


WTF I thinks to myself, that was freaking weird, other host that is another FreeNAS AIO up for 49 days...EX4300 switch clearly reboot as well.

Any ideas??? Just a bad IT juju day? CREEPY in my book! NEVER had that happen to me.

Good news is I am fully up and running, I was scrambling to look at snapshots/replications to see where I was gonna restore from if needed, had them from a week back locally (kicsk self that i did not snap prior to sVmotion activities today) but also had them on another remote system from the same date stamp (double CYA), unnecessary thank goodness!
 

marcoi

Well-Known Member
Apr 6, 2013
1,533
289
83
Gotha Florida
I would use this occasion to tell the wife you need to replace that hardware. Just tell her its all bad and you need 20k budget haha... :D

Do you have any power monitoring? If you do, check the logs, maybe you hit some current/amp limit with everything running due to the migration?
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
RIGHT RIGHT!!! :-D BIG SMILE unfortunately no budget and I know those host will tick along for several more years. I HAVE been uber drooling over a 3-node setup using the X11SPH-nCTPF and Intel Scalable Xeon 4110's but alas DDR4 memory costs shoot that right in the d|ck. :-D

I run that stack at a steady 5A. Was watching PDU amp meter while sVmotions were going hot, ticked to 6am a time or two. Cannot really be pwr surge though as that third host never went down but assume at LEAST the two SC216's went down and the switch...maybe a kernel panic but the switch going down boggles my mind.

Pssst, I DO need a UPS. Dunno if that would have helped in this scenario but regardless, ZIL's flushed cleanly obviously.
 

K D

Well-Known Member
Dec 24, 2016
1,439
320
83
30041
Not related to this issue but can you post some detailed specs for your AIOs? I see the CPU and other info in your signature.

Just curious. You have a proven stable setup (not counting this incident :)) and I Am trying to get there.
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
Indeed very weird that only 3 out of 4 components rebooted... good 4 u I assume, but weird nevertheless.
How are thay hooked up power wise (breaker, outlets etc)?
 

marcoi

Well-Known Member
Apr 6, 2013
1,533
289
83
Gotha Florida
Are the two servers and switch in the same outlet circuit maybe?

As for ups, I got this one for my self.
OPTI-UPS Durable Series DS1500B UPS
OPTI-UPS Durable Series DS1500B UPS - Newegg.com

I needed double online due to house having a lot of brown outs when heavy compressors kicked in. On a normal ups the server detected the power dip of ups supplementing power and triggered false power failure event.

Like you said, maybe a kernel panic for esxi, which if that is the case might be in a log somewhere. As for switch, maybe the sudden flood of packets without a place to travel cause it to panic as well?
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Not related to this issue but can you post some detailed specs for your AIOs? I see the CPU and other info in your signature.

Just curious. You have a proven stable setup (not counting this incident :)) and I Am trying to get there.
X9SRL, E5-2670v1's, 128GB DDR3 memory, vSphere 6.0U3 I believe (testing 6.5 in nested env that is quite stable), FreeNAS 9.10.2-U3 AIO configured w/ 2vcpu, 12 gb memory, raidz w/ husmm pools and same SLOG device for VM's NFS mounted to vSphere env.

I routinely have months of uptime w/ this config only interrupted by pwr 'events', UPS otw soon I'd say.

EDIT: Ya know what the ONLY thing kinda new/unique to this env is the SLOG off the src pool WAS a P3700 currently, and I think a buddy of mine had an hunch that they caused issues in that AIO/vt-D config in his experiments...now I am unsure myself but that is the ONLY thing semi-new config-wise. Why would that host cause the toppling effect that I experienced though I may never know, should have just caused heartache on that host and FreeNAS AIO if that was the culprit and certainly not a switch loop/freakout!
 
Last edited:

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Indeed very weird that only 3 out of 4 components rebooted... good 4 u I assume, but weird nevertheless.
How are thay hooked up power wise (breaker, outlets etc)?
Power-wise I have a tripplite PDUMH15 hooked up off of a 15A circuit, nothing fancy, all servers/gear hanging off of that, had that EX4300 for gosh 6-9 month at least now and always ran rock solid, same w/ hosts that I built, but I've probably had those for a year and 1/2.
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
I would be looking at the switch, maybe the power supply in it got hot enough to tip over?
I tend to think nothing is wrong w/ switch, been a solider for a LONG time now burned in good for my lab env and cranks bits/bytes just as fast as you can throw them at it.
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Are the two servers and switch in the same outlet circuit maybe?

As for ups, I got this one for my self.
OPTI-UPS Durable Series DS1500B UPS
OPTI-UPS Durable Series DS1500B UPS - Newegg.com

I needed double online due to house having a lot of brown outs when heavy compressors kicked in. On a normal ups the server detected the power dip of ups supplementing power and triggered false power failure event.

Like you said, maybe a kernel panic for esxi, which if that is the case might be in a log somewhere. As for switch, maybe the sudden flood of packets without a place to travel cause it to panic as well?
They are on same circuit, never had issues w/ that w/ stack typically consuming 5A steady, nothing else on that circuit. Yeah I think something kernel panic wise and flood of packets caused this whole mess...now do I have time to prove it...probably not but THOSE would be some fun logs/debugs/dumps /packets to dig through :-D
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
I can't imagine anything from 2 hosts would flood the EX so much that it reboots to be honest.
Is the 3rd host on the same Triplite?
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Identical on the SC216's 501 platinum's if I remember PS's correctly. The Norco 2112 w/ same mobo/cpu/guts is a non hot-swap ATX pwr supply but a good (seasonic)
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
I was going to say power too.

Does your power distribution log any data you can read? Power spikes or heat?

Even though they're on the same power system are the outlets grouped differently?



I recall many stories from friends working in PC repair how clients systems would power off/reboot bring in to shop for stress testing and be fine, and conclude dirty power / spikes / whatever issues with an outlet. Sometimes breakers partially trip, but more rare for low power systems.
 

whitey

Moderator
Jun 30, 2014
2,766
868
113
41
Which one stayed alive?
Norco 2212 w/ seasonic pwr supply, if we think it's a commonality w/ SC216/501 plat PS's then I still have to tickle why the EX4300 still went down and the norco didnt all on that same tripplite PDU. I guess it's time to start crawling logs.
 

Rand__

Well-Known Member
Mar 6, 2014
6,634
1,767
113
Well it might be a little more resistent to a power spike/drop/hickup whatever - or not and its sth totally different, but at least its interesting.
Can you see anything in ipmi log?
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,641
2,058
113
Norco 2212 w/ seasonic pwr supply, if we think it's a commonality w/ SC216/501 plat PS's then I still have to tickle why the EX4300 still went down and the norco didnt all on that same tripplite PDU. I guess it's time to start crawling logs.
Well, in that case! Maybe it actually is power then.

My Seasonic in my home AIO tower will not reboot when my SM PSU reboot from my AC kicking on in my home office.

My older APC UPS will "click on off" loud as it switches to battery too where-as my other APC UPS kicks onthe fan and switches over to battery faster (neither are online UPS). I mention this too because my Seasonic on the APC won't power off on either UPS but the SM PSU do on the older/slow/worn out APC.

Now, what caused the Tripplite PDU to drop power?

Note: This is also why I use tripplite line conditioners with my APC UPS too, when the outlet surges the APC unit will kick on fan, go through 'check' cycle, fans for 10m, etc, with the line conditioner evening out the voltage spike or dip (to a % obviously) it doesn't occur. I also use these for my TV and Range in the kitchen for brownout/blackout protection since both are known to cause issues with sensitive electronics. They also allow me to use non-sinewave generators to charge the APC UPS batteries by preventing voltage variation.

Sadly beyond voltage dips/spikes I don't know much more about the specifics of the issues in regard to damage or what they do to PSUs.