S2600WT2 Fault Troubleshooting

AJXCR · May 14, 2017

Alright, so the good news is that the machine is up and running and knocks down a little over 1.2MM/1.1MM IOPS read/write 4KQ32 with no bios performance tweaking and the fans running in "acoustic" mode.

The bad news is that I've uncovered a rather significant error which I've not been able to sort out on my own. Ideally, someone who has more experience with Intel boards will be able nail this down pretty quickly.

Symptoms:
-Front panel "status light" flashes green continuously on a steady interval 100% of the time.
-Intel's Active System Console software reports a sensor critical voltage fault on one of the "discrete" sensors in the power subsection
-Every time the server is rebooted I lose two to four drives which are suddenly reported to be 1GB capacity by the operating system. Initially I thought I had actually bricked the drives, but after several attempts to revive them via various methods (Windows, Parted Magic/secure erase, Samsung DC Toolkit, etc) I discovered that the following performed in terminal from PM seems to temporarily correct the status of the drives (for drive #3 namespace 1):

root@PartedMagic:~# nvme format /dev/nvme3n1
root@PartedMagic:~# nvme subsystem-reset /dev/nvme3
root@PartedMagic:~# nvme reset /dev/nvme3

Prior to the steps above, both DC Toolkit and nvme cli run from terminal report the drive capacity to be 1GB (I'm assuming this is the cache), fw version to be FAILMOD, and some really odd SMART values..

Is there a way to determine which discrete sensor is reporting the under voltage?
Is it possible that I've under powered the system by installing only a single 750W PSU? Right now any unnecessary items have been pulled so we're talking about the motherboard, 128GB (8 sticks) DDRR4 2400, 2x2667v4's, 8x PM963's, 1x Intel P3700, 2x 8 drive 2.5" (only 4 drives installed in each) backplanes, stock fans, and three riser cards.. I'm not getting any system events originating from the PSU.

System summary:

Any commentary, advice, or possible solution will be much appreciated!

pricklypunter · May 14, 2017

You are up around the 500W mark under some load I reckon, possibly a smidge over when getting started, so it's close. If the 5V rail is being dragged down, that might skew matters somewhat. Take out half your memory and a few drives and see how it goes, if it behaves, then you know your supply is being taxed

AJXCR · May 15, 2017

pricklypunter said:
You are up around the 500W mark under some load I reckon, possibly a smidge over when getting started, so it's close. If the 5V rail is being dragged down, that might skew matters somewhat. Take out half your memory and a few drives and see how it goes, if it behaves, then you know your supply is being taxed

New developments:
-The voltage error is related to riser card 3. After removing the card my status light displays steady green; no more errors. This did not have any effect on the drive issue, however.
-Reducing the number of drives has no effect. Restart puts some % of the drives into FAILMOD state
-The problem is isolated to the PM963's (which are supposed to be compatible with this system). I do not experience any issues with Intel P3700's, Intel 750's, or PM953's.
-The issue is not isolated to a specific set of PM963's, removing the failed drives from the system and restarting causes some portion of the remaining (previously operable) drives to fail.... All the way down to zero PM963's remaining.
-Reading the manual, it would appear that the lanes for each slot originate from multiple processors. With in mind, I moved 4x drives into each bay (one bay at a time) and retested. No change.
-I noticed a PEM_SMB receptacle on the x16 card (which according to Intel's documentation can only operate in the top slot of riser card two. There is a PEM_SMB receptacle on the motherboard, but it's way over on the other side of riser card 1. The kit from Intel did not come with a PEM_SMB cable, and as far as I'm aware makes no mention of hooking it up.
-There is an HSBP_I2C receptacle under/by riser card 2, but again, not sure where it would plug into on the card, the wire wasn't included in the kit, and I see no specifications which suggest it should be hooked up.

This is really eating my lunch.

pricklypunter · May 15, 2017

Ok, so that rules out the power supply. If this issue is directly attributed to only the PM963's, then that might suggest a firmware problem of some kind with them, something that the Intel boards don't like. I don't think it's a cable problem, rather it sounds more like a timing issue

AJXCR · May 15, 2017

pricklypunter said:
Ok, so that rules out the power supply. If this issue is directly attributed to only the PM963's, then that might suggest a firmware problem of some kind with them, something that the Intel boards don't like. I don't think it's a cable problem, rather it sounds more like a timing issue

A "timing issue" ...you're a little over my head there. What exactly is a timing issue in the context of SSD's? I'm going to try to contact Samsung regarding firmware, but I doubt I'll get far with them considering that the drives weren't purchased new. My X10DRU-i+ should be here tomorrow so I should have a second platform to test on here pretty soon.

One thing I haven't tried to determine is whether the issue is operating system specific. I may try to load Linux or FreeNAS tonight and see how they react.

pricklypunter · May 15, 2017

I meant something like the drives reporting ready status after the Intel board has given up looking for them to begin with. Either that or misunderstanding the initial commands being sent because of a protocol timing issue. Dunno, something along those sort of lines. I'm probably way off the mark here anyway and it will turn out to be something simple

AJXCR · May 15, 2017

So I believe the second Intel AOC is a "retimer"

You may be on to something there...

Edit: The product page says PCIe switch, but I believe the manual mentions "retimer". I'll dig a little deeper.

AJXCR · May 15, 2017

@Patrick Doesn't STH run a couple of similar machines? Any chance a similar issue has ever surfaced?

AJXCR · May 15, 2017

Another interesting observation.. the HDD lights on the Intels and PM953 are off most of the time and blink to signal disk activity. Even when working correctly, the HDD lights on the PM963's stay on all of the time, and then blink to signal disk activity.

The PM963 is running ~10 degC warmer than the Intel drives as well (32/22C)

Edit: Not sure how I missed this... The one PM963 that seems to be immune to the issue has different firmware on it..

7x drives have FW Rev: CXV83WCT
1x drive has FW Rev: CXV80W1Q

There are four more that were supposed to show up today, but unfortunately they shipped signature required and I missed them. Should have them first thing in the morning.

Searching the internet, the VMWare compatibility guide has several references for CXV80W1Q. I can't find anything at all on CXV83WCT.

SO, my next question would be: Does anyone know of a way to extract firmware from a drive to be reloaded to a different but identical drive? Is this even possible?

pricklypunter · May 15, 2017

I suspect they are probably being constantly accessed in an attempt to establish proper communication. I would guess this causes the disk to rescan/ reset etc, either way, disk activity. Hence the running warm, it's a by-product of the symptom

AJXCR · May 15, 2017

pricklypunter said:
I suspect they are probably being constantly accessed in an attempt to establish proper communication. I would guess this causes the disk to rescan/ reset etc, either way, disk activity. Hence the running warm, it's a by-product of the symptom

Would this not show up in HWInfo/Performance Monitor as disk activity?

Any thoughts on extracting FW from a drive?

pricklypunter · May 15, 2017

Nah, it won't show up. Technically the disk isn't actually communicating at this point, well I don't think so anyway. I think it's stuck at the stage where only the various hardware layers are attempting to establish valid communications with each other. Think of it like an old IDE disk constantly resetting due to having an ATA100 cable attached as opposed to a required ATA133 cable attached. It senses an attempt at communication, spins up, reset's the heads etc, then nothing meaningful is conveyed. It then times out, spins down and the cycle repeats. Not all disks with the wrong cable done that, it was hit and miss, but you get the idea

On the firmware front, I think you're out of luck, at least not without putting extensive effort behind extracting it, even then, it's probable that you wouldn't be able to recreate something that Samsung's DC toolkit would accept as valid. Don't Samsung place all sorts of CRC checks into their firmware packages etc to prevent exactly this from being done?

AJXCR · May 15, 2017

pricklypunter said:
Nah, it won't show up. Technically the disk isn't actually communicating at this point, well I don't think so anyway. I think it's stuck at the stage where only the various hardware layers are attempting to establish valid communications with each other. Think of it like an old IDE disk constantly resetting due to having an ATA100 cable attached as opposed to a required ATA133 cable attached. It senses an attempt at communication, spins up, reset's the heads etc, then nothing meaningful is conveyed. It then times out, spins down and the cycle repeats. Not all disks with the wrong cable done that, it was hit and miss, but you get the idea

On the firmware front, I think you're out of luck, at least not without putting extensive effort behind extracting it, even then, it's probable that you wouldn't be able to recreate something that Samsung's DC toolkit would accept as valid. Don't Samsung place all sorts of CRC checks into their firmware packages etc to prevent exactly this from being done?

No idea.. I've really never looked into trying to do something like this.

Another really odd symptom is that I can boot into PM once, fix the disk states, restart the computer and boot into Windows/PM and everything is fine... Every single time.

Anything past the first boot and the disks start to go into error mode in groups.. every time.

AJXCR · May 15, 2017

I'm going to put them on a pcie adapter card and test them out on a different MB..

pricklypunter · May 15, 2017

Dunno, I'm out of ideas...I don't think it's faulty hardware per se, perhaps just incompatibilities in firmware/ protocols. I suspect the parts will all work fine when tested individually. It's definitely on the fringe, that's for sure

AJXCR · May 15, 2017

The PM963's work fine in a Supermicro motherboard; I put the PM953's back into this machine. 963's will just have to wait for the X10DRU-i+

Search

S2600WT2 Fault Troubleshooting

AJXCR

Active Member

pricklypunter

Well-Known Member

AJXCR

Active Member

pricklypunter

Well-Known Member

AJXCR

Active Member

pricklypunter

Well-Known Member

AJXCR

Active Member

AJXCR

Active Member

AJXCR

Active Member

pricklypunter

Well-Known Member

AJXCR

Active Member

pricklypunter

Well-Known Member

AJXCR

Active Member

AJXCR

Active Member

pricklypunter

Well-Known Member

AJXCR

Active Member