Hard drive errors, don't know where to start

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

nonyhaha

Member
Nov 18, 2018
50
12
8
Hello everybody,

I have been having issues with a nappit install on an esxi host and I do not know where to go from here.
Hardware config 1:
4x12tb sata drives + 4x1tb ssds on a Dell H310 in IT mode. Each 4 drives on one sff8087 to 4x sas connectors breakout cable.
HP Z420 mb, e52680v2, 128gb ddr3 1866 ecc.
HP z420 original 600watt psu with added sata power connectors.
ESXI 7, nappit vm.
4x12tb hdd -> raidz pool
4x1tb ssd -> raidz pool.
Hardware config 2:
4x14tb sas toshiba brand new drives + 4x1tb ssds on LSI 9207-8i 6Gbs SAS 2308 IT mode. Again, each 4 drives on its own breakout cable.
Gigabyte z590 vision D mb, i5 10400T/i7 11700F, 64 gb ddr4 non-ecc ram.
Antec earthwatts 450w 80+ platinum psu with added sata power connectors.
Esxi 7.0u3, nappit vm updated to latest available version (current).
4x12tb hdd -> raidz pool
4x1tb ssd -> raidz pool.

On my first hardware config one of my drives in one of my pools started to get errors while writing to that pool. This was always the last drive in the list. I suspected one of the 12tb drives was beginning to die. (these errors appeared on zpool status, but i really do not remember if they were on the write attribute or on cksum)
First I replaced the 12tb drive with a new one, and started rebuilding the pool just to find out that the new drive was starting to get same errors. So I switched back to the original hdd, and changed the ports on the HBA card. reseated all the cables in the drives, but still got errors on the same drive. So it was not bout to the HBA's port, or the connectors (I also switched drives so they would be connected to different ports)
Because of this I started to change the hardware, and I ended up with the second hardware config. I copied all the data to the new pool of new 14tb drives and all was ok for a few weeks until yesterday when I started getting errors on the all new configuration, on the last drive of the pool (when using zpool status). This baffles me. I do not know how or why this would happen.

on zpool status now I have:
raidz1-0 DEGRADED 0 0 0
c14t5000039B38809ECAd0 ONLINE 0 0 0
c15t5000039B3880A85Ad0 ONLINE 0 0 0
c16t5000039B3880FF4Ad0 ONLINE 0 0 0
c17t5000039B3880FDFEd0 FAULTED 0 0 0 too many errors
and on the web interface, under disks, I have:

c17t5000039B3880FDFEd0 single ok 14 TB S:0 H:252 T:3507 TOSHIBA MG07SCP14TE

What am I doing wrong?
 

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
Indeed a very unclear situation. Normally non disk problems are related to cabling, backplane, power, ram or hba problems or a combination of reasons. As you have already changed disk port where the problems is not port related the remaining options are power, RAM or HBA. To be sure about a non disk problem, you should first run an intensive disk test. I usually bootup a Hirens USB stick with Windows PE and WD data lifeguard to run an intensive disk test. (runs few days with such large disks). For Sata disks another method is an USB case that you can connect to your desktop or laptop. A test there would rule out all hardware problems. Optionally move the HBA to your desktop to test SAS disks. This would include the HBA into tests. One word about HBA firmware. This is normally uncritical beside v20 prior 20.007 that was known to be buggy.

If an intensive disk test (best on another hardware) results in a disk ok, replace cabling (HBA breakout and power) and change ports again. If the problem remains on the last disk, replace HBA (for Sata you can switch to onboard Sata and a barebone setup for testing).

If the situation remains unclear, do a RAM test (ex memtest86) or remove half of RAM, check then and try the other half. If you can reduce ram performance in bios try that.

If the situation remains unclear, replace PSU.

Hope you can identify the bad or shaky hardware problem.
 

nonyhaha

Member
Nov 18, 2018
50
12
8
It is very frustrating as I have already replaced hba card, breakout cables, mb cpu ram and psu and disks :)) so everything. I had this issue on 2 different versions of nappit.
 

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
Unlike smart or iostat messages "too many errors" is a real hardware problem reported by ZFS so does not depend on a napp-it release or OS version (unless there is a OS bug but such is not known on current OmniOS stable releases)
 

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
These are the corresponding log entries to the iostat error warnings and the final "too many errors" that marked the disk as bad/ faulted with no real indication of the reason. What can be helpful is the date of the first occurance especially if you have modified something at that time.

about iostat:
Soft error : A disk sector fails the CRC check and needs to be re-read
Hard error : Re-read fails several times for CRC check
Transport error : Errors reported by I/O bus
Total errors : Soft error + Hard error + Transport errors
 

nonyhaha

Member
Nov 18, 2018
50
12
8
I reseated the cables again and waited for the next planned scrub to happen.
Just for tracking, the scrub completed with the following results:
Code:
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 16K in 0 days 10:52:50 with 0 errors on Mon May  1 13:52:54 2023
config:

        NAME                        STATE     READ WRITE CKSUM
        secunda                     ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            c14t5000039B38809ECAd0  ONLINE       0     0     0
            c15t5000039B3880A85Ad0  ONLINE       0     0     0
            c17t5000039B3880FF4Ad0  ONLINE       0     0     0
            c16t5000039B3880FDFEd0  ONLINE       0     0     4
and the dmesg does not show any errors:
Code:
Apr 30 13:30:05 san last message repeated 121 times
May  1 03:43:38 san smbsrv: [ID 138215 kern.notice] NOTICE: smbd[NT Authority\Anonymous]: blacksea access denied: IPC only
May  1 03:43:38 san last message repeated 27 times
May  2 03:43:39 san smbsrv: [ID 138215 kern.notice] NOTICE: smbd[NT Authority\Anonymous]: blacksea access denied: IPC only
May  2 03:43:39 san last message repeated 27 times
Should those CKSUM errors be visible in dmesg?
Web interface disk info does not show any S/H/T errors. All disks are on 0.
 

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
A detected and repaired checksum "error" is not an error but an info in zpool status as it was automatically repaired during read. Only too many errors or a situation where a repair fails ex due unsufficient redundancy results in a log entry.

If you redo a scrub and the checksum errors remain, I would do an intensive disk test to verify if the disk is ok or has problems. Some problems can be repaired on an intensive test ex via WD data lifeguard. If the disk fails the test, you can replace. Especially in a Z1 where only one failed disk is allowed until a dataloss, you must care about each disk with troubles. The positive of ZFS is that a disk problem due bad blocks in a degraded pool does not result in an array lost like with Raid-5 but in a single file lost.

Iostat counters are reset on reboot. Keep an eye on them. If they go up again this is an early warning of possible problems.
 

nonyhaha

Member
Nov 18, 2018
50
12
8
The CKSUM errors should go away on a new scrub, or should I use zpool clear prior to starting the new scrub job?
 

gea

Well-Known Member
Dec 31, 2010
3,156
1,195
113
DE
Does not matter. Important thing is if you get more/again checksum errors.
As checksum tests affect only current files this does not replace an intensive disk test that checks also empty areas of the disk.
 

nonyhaha

Member
Nov 18, 2018
50
12
8
no more errors on a new scrub. I am sure the disks are all ok as they are all brand new. So I'll leave it at that for now.