SMART error on all Seagate disks (napp-it 20.01a6 Pro)

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

berchti

New Member
Jan 12, 2019
7
1
3
Hi everyone,

After googleling for hours without finding help, I hope you can support me with my problem. I just installed the latest napp-it AIO and created a new pool I received a lot of SMART errors but don't know why. Same pool (same disks with same configuration) was running on FreeNAS without any errors. Following some information:

Environment:
Mobo: Supermicro X11SCZ-F
RAM: 4x SAMSUNG M378A2K43CB1-CRC 16GB DDR4 2400MHz
CPU: i7-8700
SAS Controller: 2x Dell Perc H310 in IT mode
Disks: 7x Seagate IronWolf (8TB, 3.5")
6x
WD Red (4TB, 3.5")
2x Samsung SSD (1TB)


latest napp-it running on latest ESXI with pci passthrough (8vCPUs, 42GB RAM)

ZFS config:
Bash:
  pool: zpool1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 3.68T in 0 days 04:29:26 with 0 errors on Thu Jun 25 07:03:50 2020
config:

    NAME                       STATE     READ WRITE CKSUM      CAP            Product /napp-it   SN/LUN           IOstat mess       SMART
    zpool1                     ONLINE       0     0     0
      mirror-0                 ONLINE       0     0     0
        c0t50014EE263ED0C59d0  ONLINE       0     0     0      4 TB           WDC WD40EFRX-68N   WDWCC7K2HRA47Z   S:0 H:0 T:0       ok
        c0t50014EE20E854B46d0  ONLINE       0     0     0      4 TB           WDC WD40EFRX-68N   WDWCC7K2HDP2D9   S:0 H:0 T:0       ok
      mirror-1                 ONLINE       0     0     0
        c0t50014EE2B942E1A8d0  ONLINE       0     0     0      4 TB           WDC WD40EFRX-68N   WDWCC7K4FEAV26   S:0 H:0 T:0       ok
        c0t50014EE263DA7C54d0  ONLINE       0     0     0      4 TB           WDC WD40EFRX-68N   WDWCC7K0JKL2TT   S:0 H:0 T:0       ok
      mirror-2                 ONLINE       0     0     0
        c0t50014EE2BA9973D7d0  ONLINE       0     0     0      4 TB           WDC WD40EFRX-68N   WDWCC7K6SFD5ES   S:0 H:0 T:0       ok
        c0t50014EE2B9CFA105d0  ONLINE       0     0     4      4 TB           WDC WD40EFRX-68N   WDWCC7K7CTX8H3   S:0 H:0 T:0       ok

errors: No known data errors

  pool: zpool2
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:03:28 with 0 errors on Tue Jun 23 20:56:30 2020
config:

    NAME                       STATE     READ WRITE CKSUM      CAP            Product /napp-it   SN/LUN           IOstat mess       SMART
    zpool2                     ONLINE       0     0     0
      mirror-0                 ONLINE       0     0     0
        c0t5000C500BEA0BB35d0  ONLINE       0     0     0      8 TB           ST8000DM004-2CX1   WCT141ZH         S:0 H:0 T:0       problem
        c0t5000C500BF2FDE89d0  ONLINE       0     0     0      8 TB           ST8000DM004-2CX1   WCT1CZ8P         S:0 H:0 T:0       problem
      mirror-1                 ONLINE       0     0     0
        c0t5000C500C3F19DE9d0  ONLINE       0     0     0      8 TB           ST8000VN004-2M21   WKD09LEX         S:0 H:0 T:0       problem
        c0t5000C500BF41876Ed0  ONLINE       0     0     0      8 TB           ST8000DM004-2CX1   WCT1CV3D         S:0 H:0 T:0       problem
      mirror-2                 ONLINE       0     0     0
        c0t5000C500CF6EE5C7d0  ONLINE       0     0     0      8 TB           ST8000VN004-2M21   WKD18CKX         S:0 H:0 T:0       problem
        c0t5000C500C3F17B70d0  ONLINE       0     0     0      8 TB           ST8000VN004-2M21   WKD09PP0         S:0 H:0 T:0       problem

errors: No known data errors

  pool: zpool3
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:00 with 0 errors on Tue Jun 23 20:53:54 2020
config:

    NAME                     STATE     READ WRITE CKSUM      CAP            Product /napp-it   SN/LUN           IOstat mess       SMART
    zpool3                   ONLINE       0     0     0
      c0t5002538D41B84DF4d0  ONLINE       0     0     0      1 TB           Samsung SSD 850    S2RFNX0J206063V  S:0 H:0 T:0       ok
      c0t5002538E49781F37d0  ONLINE       0     0     0      1 TB           Samsung SSD 860    S4CZNF0M728187A  S:0 H:0 T:0       ok

errors: No known data errors
As you can see only zpool2 gives me SMART errors:

Screenshot 2020-06-25 at 17.55.55.png

Just ask if you need some more information :) Can you give me a hint, what my problem could be? PLS NOTE THIS IS NOT A PRODUCTION ENVIRONMENT, JUST TO PLAY AROUND!!

Much appreciate you tipps.

Cheers,
Patrick
 

gea

Well-Known Member
Dec 31, 2010
3,140
1,182
113
DE
The smart "problem" notice is not an "error" but a message that some smart values are higher than one would expect from good disks.

In this case it is smart value 188. This is a timeout error. Mostly you get it after a bad sector of a disk, Some errors are ok. More than a dozen result in the problem message.

I would at least look if there is a reason about the high value. Without any reason I would replace the disks in a production environment or check it at least with a low level tool from the disk manufacturer.

Background
There are indications that some high smart values can predict a future disk failure ex
ex These 5 SMART errors help you predict your hard drive's death
 

Bob T Penguin

Member
Dec 16, 2015
55
1
8
47
hello berchti,
Did you have you enable smartmontools in a nappit menu or was it already active?
I have Nappit 19.06h with Omnios 151030bf and smartmoontools doesn't show up in the Nappit menu.
If I look at Disks > Smartinfo it says smartmontools are not installed.

Smartmontools appear to be installed under /opt/ooce/smartmontools but the smartd service is not running.


Many thanks
Bob
 

berchti

New Member
Jan 12, 2019
7
1
3
hello berchti,
Did you have you enable smartmontools in a nappit menu or was it already active?
I have Nappit 19.06h with Omnios 151030bf and smartmoontools doesn't show up in the Nappit menu.
If I look at Disks > Smartinfo it says smartmontools are not installed.

Smartmontools appear to be installed under /opt/ooce/smartmontools but the smartd service is not running.


Many thanks
Bob
Hi Bob

I had the same issue, I cant remember what I exactly did to get it solved. I think I installed ths smartmontools from source and then manually started the service with svcadm...
 

Bob T Penguin

Member
Dec 16, 2015
55
1
8
47
Thanks berchti,

I went through the NappIT install script, it installs smartmontools from the omnios extra repo.
I've manually installed smartmontools on another omnios box and the smartd service is in the disabled state after install, it doesn't autostart on system bootup either.
Seems like smartd must be started manually.

Regards
Bob
 

Bob T Penguin

Member
Dec 16, 2015
55
1
8
47
....starting smartmontools manually was not a good idea for me!
my messages log started logging lots of these messages

Jun 26 14:13:28 ServerA ahci: [ID 296163 kern.warning] WARNING: ahci0: ahci port 0 has task file error
Jun 26 14:13:28 ServerA ahci: [ID 687168 kern.warning] WARNING: ahci0: ahci port 0 is trying to do error recovery
Jun 26 14:13:28 ServerA ahci: [ID 693748 kern.warning] WARNING: ahci0: ahci port 0 task_file_status = 0x451
Jun 26 14:13:28 ServerA ahci: [ID 332577 kern.warning] WARNING: ahci0: the below command (s) on port 0 are aborted
Jun 26 14:13:28 ServerA ahci: [ID 117845 kern.warning] WARNING: satapkt 0xfffffe2d120cfc78: cmd_reg = 0xb0 features_reg = 0x0 sec_count_msb = 0x0 lba_low_msb = 0x4f lba_mid_msb = 0x4f lba_high_msb = 0x0 sec_count_lsb = 0x0 lba_low_lsb = 0x1 lba_mid_lsb = 0x4f lba_high_lsb = 0xc2 device_reg = 0x0 addr_type = 0x4 cmd_flags = 0x11
Jun 26 14:13:29 ServerA ahci: [ID 657156 kern.warning] WARNING: ahci0: error recovery for port 0 succeed


...and FMD disabled the service.

Now zpool status is showing iostat messages

1593178256114.png

I'll leave smartd disabled.

any idea how to clear the messages?

Thanks
Bob
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
3,140
1,182
113
DE
On Solaris/OpenIndiana the napp-it installer compiles Smartmontools and does not autostart as service. On OmniOS I use the default smartmontools from the extra repo. Napp-it can work with both options.

About the iostat counter
The counter sums errors since bootup. It is not intended to delete them. A reboot sets the counters back to 0.
 

berchti

New Member
Jan 12, 2019
7
1
3
The smart "problem" notice is not an "error" but a message that some smart values are higher than one would expect from good disks.

In this case it is smart value 188. This is a timeout error. Mostly you get it after a bad sector of a disk, Some errors are ok. More than a dozen result in the problem message.

I would at least look if there is a reason about the high value. Without any reason I would replace the disks in a production environment or check it at least with a low level tool from the disk manufacturer.

Background
There are indications that some high smart values can predict a future disk failure ex
ex These 5 SMART errors help you predict your hard drive's death
Hi gea

Thanks for your tipps, I just run the Seagate Bootable tool und tested all the Seagate disks with it, but none of them showed problems. Do you know another tool which I could give a try?

Thanks,
Patrick
 

gea

Well-Known Member
Dec 31, 2010
3,140
1,182
113
DE
I usually test my problemdisks on a different machine (mostly Windows and WD data lifeguard and an intensive test). But it is highly unlikely that all disks in pool2 are bad.

Is there something common/different in pool2 like cabling, power, HBA/firmware?

btw
I have just googled "ST8000VN004-2M21"
There are some problem reports related to this disk model
 
Last edited:

berchti

New Member
Jan 12, 2019
7
1
3
Thanks for your help so far. I also thought about a power and/or HBA problem. I just ordered a new power supply 850W (I think know is a 450 or 500W installed) and a new HBA (9207-8i from ebay).

The strange thing about this is, that neither freenas nor ubuntu zfs storage server which I've had installed out of curiousity showed me any warnings.

I keep you posted guys.
 

gea

Well-Known Member
Dec 31, 2010
3,140
1,182
113
DE
r HBA problem. I just ordered a new power supply 850W (I think know is a 450 or 500W installed) and a new HBA (9207-8i from ebay).

The strange thing about this is, that neither freenas nor ubuntu zfs storage server which I've had installed out of curiousity showed me any warnings.

I keep you posted guys.
Have you checked there smartvalues 187 and 188 too to decide if it is a firmware/driver problem for this Seagate disks related to Smartmontools on OmniOS.
 

ARNiTECT

Member
Jan 14, 2020
92
7
8
I am also seeing smart errors on my Seagate HDD. It is a new 2.5" 5TB drive extracted from a portable USB casing. My other drives are WD and are not showing smart errors.
I also have another portable 2.5" Seagate I will open up and try later
 

Attachments

Last edited:

zepanv

New Member
Oct 11, 2013
2
0
3
All of our Seagate drives show as problem since upgrading napp-it. I have about 35 ST4000NM0023 drives that all say problem as well as 22 ST900MM0006 drives that do as well. In the same system there are about 70 hgst drives that report ok. I'm guessing something about the way seagate is reporting is not the same as expected.

Speaking of smart tests, I see in Jobs > Reports > help there is a smart job but I am not sure exactly how to schedule it. I have created a daily job with that report but I don't get any emails and the job output is just "info:"
 

gea

Well-Known Member
Dec 31, 2010
3,140
1,182
113
DE
The first disk seems bad, replace.
The second disk has a smart check error, I would replace
Disk nr 6 is ok without smart warnings.

Disk 3-5 report smart warnings. This may mean nothing and the disks can continue to work for years. The point of the warning is that they have an increased propability of a future failure. In a home environment you may ignore until they fail finally, on a critical storage you should replace now.

I would at least run a short or long smart check.
On check errors replace. With only a warning on smartvalue 5 or 197, you can decide.

see for ex

 
Last edited:

davros123

New Member
Feb 15, 2021
8
1
3
Thanks Gea. Very much appreciate your help
I am doing a fresh backup now to my backup server now. Once that is done I'll run some more smart runs and so a scrub and a fresh rsync backup.
Then I'll power down, remove disk2 and check it my windows PC.

I changed servers (to an AMD EPYC 7203P on H12SSL) and have had some issues with these controller in esxi - so I wonder if my server swap has triggered any issues.
In any case, I want to move to esxi7 some time so I have ordered new HBA's (LSI3008) to replace the old ones and we'll see how it goes.

Once again, thanks for the help.