Issue with SAS setup in a U-NAS NSC-800, not the HBA, backplane?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gargravarr

Member
Jul 1, 2021
38
2
8
Hey folks. A while back I asked for some advice on a SAS card for my ZFS NAS and had been running my setup contently for a few months. Back in September, it all went wrong. One of my Seagate Exos X12 drives was resilvered automatically, and exploring what was going on ultimately led to my entire zpool failing.
The rough sequence went like this:
  1. Resilver alert, RAID-Z2 otherwise stable
  2. SMART tests show they keep getting aborted, manually invoking doesn't help
  3. Remove the faulty drive from the array with zpool offline
  4. Destructive zero-pass write test of the drive, no IO errors
  5. whdd read test reveals 69 bad sectors on the suspect HDD
  6. Try to reinstall the suspect HDD into the array to restore redundancy until I can get a spare
  7. Resilver fails with all 5 other disks throwing errors and a different disk being kicked out of the zpool
So zero to catastrophic pretty quickly. I documented it in a couple of Reddit threads:

https://www.reddit.com/r/homelab/comments/pvs0k3
https://www.reddit.com/r/zfs/comments/pzsrnz
I have additional copies and backups of the data so I haven't lost anything majorly important. I don't need to recover the pool.

I ran a whdd test on the other drives sequentially and they all passed. Then I decided to try zeroing the disks to recreate the array. I kicked off zero-passes on 3 drives and also ran a whdd read test on a disk I'd missed. I came back an hour later and all SAS IO on the machine had locked up. I had to reboot it.

Because multiple drives showed errors, and because of the ZFS alerts before, I quickly suspected a common point of failure - probably the HBA. I bought another HBA to test with. Now, here's the curious thing. When I installed the new card, I ran some more minor tests and they all passed. I then tried the same thing again, zero out the drives and run a read test. Sure enough, the same lockup occurred with the new HBA. So I can hopefully rule that out. I've also tested the power supply with an LCD tester and it shows good, stable voltages.

But that leaves the cables, the backplane or 6 separate HDDs as the cause of the failure. The cables are brand new this year. And 6 HDDs all failing at the same time is kinda unlikely, especially as if I run the whdd tests one disk after another, they pass. So... could this be the backplane? I'm hesitant because the backplane of this chassis is incredibly simple. It's a SAS-1/SATA backplane with individual SATA connectors per drive and 2 Molex power connectors for 4 drives (so 2 backplane boards for 8 total slots). There are 2 LEDs per drive, power and activity. Because it's SAS-1 and the drives are SAS-3 I have to use the pin-3-tape-trick to get the drives to spin up, but they were working without noticeable error for about 6 months.

I found a single mention of backplane problems with the NSC-800 from a year ago: https://www.reddit.com/r/DataHoarder/comments/etvuiu but with no particular detail on what the problems were. Now I can see that U-NAS do sell the backplane boards separately, though I thought that would be for upgrade reasons. There's a SAS-3 board available for the NSC-810, which sees to be the same physical layout as the -800. However, they don't ship to the UK.

I'm not averse to replacing the board if it's failed, but it doesn't seem very likely - like I say, the board is very simple with just a couple of LEDs per drive, no expanders or anything seems complex. There is a 3.3V regulated supply to each drive, which explains the need for the tape trick (since Molex only supplies 5V and 12V).

Specs of the system:
  • Asus P11C-i motherboard, Intel Core i3 9100T CPU and 32GB (2x 16GB) DDR4 ECC memory
  • Adaptec ASR-78165 (original) or ASR-71605 (test) SAS-2 controller (in HBA mode)
  • Seasonic 350W PSU
  • 6x Seagate Exos X12 SAS-3 12TB HDDs
  • 1x Samsung SATA 120GB SSD for OS
  • Devuan Linux 3 (Debian 10 without systemd)
  • ZoL 2.0
Could someone double-check my logic and diagnosis please? Many thanks.
 

gargravarr

Member
Jul 1, 2021
38
2
8
So I rigged up a test system by adding my SAS drives into a separate chassis (with cooling) and running breakout cables back to the HBA, then re-ran the same load test - 5 zeroing and 1 read test. And all 6 instances ran to completion. So it looks like the issue really is with the backplane. I contacted U-NAS directly and they were able to get me a shipping quote to the UK, so I'll probably buy the SAS-3 boards.