Not my BEST IT juju day

whitey · Oct 17, 2017

Good hell, my rear-end hurts...just took a massive chewing from 'the Mrs.' on introducing performance degradation/instability for her VDI while she was mid-editing of a clients session.

Here's the backstory of my boneheaded mis-adventures over the last couple of hours.

FedEx arrives, delivering new 8643 to 8087 cable to hook off my free port on my 9300-8i to four more slots on my sc216...ok 'maintenance time'

Begin to sVMotion roughly 700GB of VM's totaling about 40-50 VM's across my 3-node cluster...all goes well for quite some time, I am watching 'zpool iostat -v poolname 2' and things are crusing along. Skip ahead 30-45 mins or so, I see the sVMotions start to mostly wrap up except my wife's VDI session times out, a VDI RDS server stalls at 72% migrated, all others finish. Cannot pull up console on wife's VDI VM or RDP, attempt a reboot...go away vSphere tells me...ok reset...go away...shutdown...no dice. I start to panic and think hell maybe a services.sh restart from a ssh'ed in session to the ESXi hypervisor host that the two stalled VM's are on may fix...no fix...look to login directly to the host and attempt shutdown...no luv. Walk upstairs to get back to some 'real work' and hear the horric sound none of us wanna hear...

VVVVRRRRRROOOOOMMMMMMMMM as the server stack starts to spin/ramp up LOUD. FARK I think to myself, how in the HELL did my stack just reboot...small power blip or just bad luck, wife says nothing else looped in house and I didn't notice either. I walk away knowing vSphere autostart/HA will take care of me, back 15 mins later and I am up, sVMotions failed back gracefully as they should have but left some cruft on the dest NFS stg, cleaned up no biggie...look at hosts expecting to see all three uptimes of 15-20 mins and it's ONLY the two hosts that were running AIO FreeNAS VM's and cooresponding ESXi hosts doing HEAVY sVmotion I/O operations that had a heart attack.

WTF I thinks to myself, that was freaking weird, other host that is another FreeNAS AIO up for 49 days...EX4300 switch clearly reboot as well.

Any ideas??? Just a bad IT juju day? CREEPY in my book! NEVER had that happen to me.

Good news is I am fully up and running, I was scrambling to look at snapshots/replications to see where I was gonna restore from if needed, had them from a week back locally (kicsk self that i did not snap prior to sVmotion activities today) but also had them on another remote system from the same date stamp (double CYA), unnecessary thank goodness!

marcoi · Oct 17, 2017

I would use this occasion to tell the wife you need to replace that hardware. Just tell her its all bad and you need 20k budget haha...

Do you have any power monitoring? If you do, check the logs, maybe you hit some current/amp limit with everything running due to the migration?

whitey · Oct 17, 2017

RIGHT RIGHT!!! :-D BIG SMILE unfortunately no budget and I know those host will tick along for several more years. I HAVE been uber drooling over a 3-node setup using the X11SPH-nCTPF and Intel Scalable Xeon 4110's but alas DDR4 memory costs shoot that right in the d|ck. :-D

I run that stack at a steady 5A. Was watching PDU amp meter while sVmotions were going hot, ticked to 6am a time or two. Cannot really be pwr surge though as that third host never went down but assume at LEAST the two SC216's went down and the switch...maybe a kernel panic but the switch going down boggles my mind.

Pssst, I DO need a UPS. Dunno if that would have helped in this scenario but regardless, ZIL's flushed cleanly obviously.

K D · Oct 17, 2017

Not related to this issue but can you post some detailed specs for your AIOs? I see the CPU and other info in your signature.

Just curious. You have a proven stable setup (not counting this incident

) and I Am trying to get there.

Rand__ · Oct 17, 2017

Indeed very weird that only 3 out of 4 components rebooted... good 4 u I assume, but weird nevertheless.
How are thay hooked up power wise (breaker, outlets etc)?

pricklypunter · Oct 17, 2017

I would be looking at the switch, maybe the power supply in it got hot enough to tip over?

marcoi · Oct 17, 2017

Are the two servers and switch in the same outlet circuit maybe?

As for ups, I got this one for my self.
OPTI-UPS Durable Series DS1500B UPS
OPTI-UPS Durable Series DS1500B UPS - Newegg.com

I needed double online due to house having a lot of brown outs when heavy compressors kicked in. On a normal ups the server detected the power dip of ups supplementing power and triggered false power failure event.

Like you said, maybe a kernel panic for esxi, which if that is the case might be in a log somewhere. As for switch, maybe the sudden flood of packets without a place to travel cause it to panic as well?

whitey · Oct 17, 2017

K D said:
Not related to this issue but can you post some detailed specs for your AIOs? I see the CPU and other info in your signature.

Just curious. You have a proven stable setup (not counting this incident ) and I Am trying to get there.

X9SRL, E5-2670v1's, 128GB DDR3 memory, vSphere 6.0U3 I believe (testing 6.5 in nested env that is quite stable), FreeNAS 9.10.2-U3 AIO configured w/ 2vcpu, 12 gb memory, raidz w/ husmm pools and same SLOG device for VM's NFS mounted to vSphere env.

I routinely have months of uptime w/ this config only interrupted by pwr 'events', UPS otw soon I'd say.

EDIT: Ya know what the ONLY thing kinda new/unique to this env is the SLOG off the src pool WAS a P3700 currently, and I think a buddy of mine had an hunch that they caused issues in that AIO/vt-D config in his experiments...now I am unsure myself but that is the ONLY thing semi-new config-wise. Why would that host cause the toppling effect that I experienced though I may never know, should have just caused heartache on that host and FreeNAS AIO if that was the culprit and certainly not a switch loop/freakout!

whitey · Oct 17, 2017

Rand__ said:
Indeed very weird that only 3 out of 4 components rebooted... good 4 u I assume, but weird nevertheless.
How are thay hooked up power wise (breaker, outlets etc)?

Power-wise I have a tripplite PDUMH15 hooked up off of a 15A circuit, nothing fancy, all servers/gear hanging off of that, had that EX4300 for gosh 6-9 month at least now and always ran rock solid, same w/ hosts that I built, but I've probably had those for a year and 1/2.

whitey · Oct 17, 2017

pricklypunter said:
I would be looking at the switch, maybe the power supply in it got hot enough to tip over?

I tend to think nothing is wrong w/ switch, been a solider for a LONG time now burned in good for my lab env and cranks bits/bytes just as fast as you can throw them at it.

whitey · Oct 17, 2017

marcoi said:
Are the two servers and switch in the same outlet circuit maybe?

As for ups, I got this one for my self.
OPTI-UPS Durable Series DS1500B UPS
OPTI-UPS Durable Series DS1500B UPS - Newegg.com

I needed double online due to house having a lot of brown outs when heavy compressors kicked in. On a normal ups the server detected the power dip of ups supplementing power and triggered false power failure event.

Like you said, maybe a kernel panic for esxi, which if that is the case might be in a log somewhere. As for switch, maybe the sudden flood of packets without a place to travel cause it to panic as well?

They are on same circuit, never had issues w/ that w/ stack typically consuming 5A steady, nothing else on that circuit. Yeah I think something kernel panic wise and flood of packets caused this whole mess...now do I have time to prove it...probably not but THOSE would be some fun logs/debugs/dumps /packets to dig through :-D

Rand__ · Oct 17, 2017

I can't imagine anything from 2 hosts would flood the EX so much that it reboots to be honest.
Is the 3rd host on the same Triplite?

whitey · Oct 17, 2017

Yessir, it sure is

Rand__ · Oct 17, 2017

Same/similar PSUs on all 3 boxes?

whitey · Oct 17, 2017

Identical on the SC216's 501 platinum's if I remember PS's correctly. The Norco 2112 w/ same mobo/cpu/guts is a non hot-swap ATX pwr supply but a good (seasonic)

Rand__ · Oct 17, 2017

Which one stayed alive?

T_Minus · Oct 17, 2017

I was going to say power too.

Does your power distribution log any data you can read? Power spikes or heat?

Even though they're on the same power system are the outlets grouped differently?

I recall many stories from friends working in PC repair how clients systems would power off/reboot bring in to shop for stress testing and be fine, and conclude dirty power / spikes / whatever issues with an outlet. Sometimes breakers partially trip, but more rare for low power systems.

whitey · Oct 17, 2017

Rand__ said:
Which one stayed alive?

Norco 2212 w/ seasonic pwr supply, if we think it's a commonality w/ SC216/501 plat PS's then I still have to tickle why the EX4300 still went down and the norco didnt all on that same tripplite PDU. I guess it's time to start crawling logs.

Rand__ · Oct 17, 2017

Well it might be a little more resistent to a power spike/drop/hickup whatever - or not and its sth totally different, but at least its interesting.
Can you see anything in ipmi log?

T_Minus · Oct 17, 2017

whitey said:
Norco 2212 w/ seasonic pwr supply, if we think it's a commonality w/ SC216/501 plat PS's then I still have to tickle why the EX4300 still went down and the norco didnt all on that same tripplite PDU. I guess it's time to start crawling logs.

Well, in that case! Maybe it actually is power then.

My Seasonic in my home AIO tower will not reboot when my SM PSU reboot from my AC kicking on in my home office.

My older APC UPS will "click on off" loud as it switches to battery too where-as my other APC UPS kicks onthe fan and switches over to battery faster (neither are online UPS). I mention this too because my Seasonic on the APC won't power off on either UPS but the SM PSU do on the older/slow/worn out APC.

Now, what caused the Tripplite PDU to drop power?

Note: This is also why I use tripplite line conditioners with my APC UPS too, when the outlet surges the APC unit will kick on fan, go through 'check' cycle, fans for 10m, etc, with the line conditioner evening out the voltage spike or dip (to a % obviously) it doesn't occur. I also use these for my TV and Range in the kitchen for brownout/blackout protection since both are known to cause issues with sensitive electronics. They also allow me to use non-sinewave generators to charge the APC UPS batteries by preventing voltage variation.

Sadly beyond voltage dips/spikes I don't know much more about the specifics of the issues in regard to damage or what they do to PSUs.

Not my BEST IT juju day

Moderator

Well-Known Member

Moderator

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Moderator

Moderator

Moderator

Moderator

Well-Known Member

Moderator

Well-Known Member

Moderator

Well-Known Member

Build. Break. Fix. Repeat

Moderator

Well-Known Member

Build. Break. Fix. Repeat