PX04SVB320 - Every time SMART accessed, Non-medium error count increments

serverian · May 4, 2020

I've recently acquired 14 of used Toshiba PX04SVB320 SSDs off eBay.

For some strange reason, every time I access their SMART values using smartctl, their "Non-medium error count" values increment by 1. This is for all of them without an exception.

First I've installed them in a Dell R620 server with H710P RAID controller (Similar to LSI 9271-8i). Then I've installed them on a Supermicro server with LSI 9266-8i. The outcome was the same.

I'll test them on a HBA tomorrow to see if the cause is somehow the RAID card.

I've done a fio write test on one of them to see if that one's value go higher than others on the next check. But no, it made no difference. They were all just incremented by 1 again. So this has nothing to do with IO operations.

I'm actively using Toshiba PX02SMB160 series SSDs but I don't see this happening with them. They are SAS SSDs from the same maker, older generation, though.

I was wondering if anyone had something similar with any SAS SSDs.

From what I read "Non-medium error count" errors are just SCSI command errors which might be related to bad cable, backplane or card. But this doesn't seem to be the case here. So it might be some other SCSI command getting error'ed. Any idea what it might be?

Here is how I read those values. Note that there is only a second between these 2 commands and all the drives are idle.

Code:

root@debian:~# for i in {0..11}; do echo -e "\nDisk $i"; smartctl -a -d megaraid,$i /dev/sdb | grep 'Serial number\|Non-medium error count'; done

Disk 0
Serial number:        56G0A00KT40E
Non-medium error count:      143

Disk 1
Serial number:        56G0A00LT40E
Non-medium error count:      132

Disk 2
Serial number:        56G0A01DT40E
Non-medium error count:      140

Disk 3
Serial number:        56G0A00TT40E
Non-medium error count:      126

Disk 4
Serial number:        56G0A01GT40E
Non-medium error count:      136

Disk 5
Serial number:        56G0A00XT40E
Non-medium error count:      132

Disk 6
Serial number:        56G0A00AT40E
Non-medium error count:      139

Disk 7
Serial number:        56G0A00QT40E
Non-medium error count:      142

Disk 8
Serial number:        56G0A00NT40E
Non-medium error count:      143

Disk 9
Serial number:        56G0A00MT40E
Non-medium error count:      135

Disk 10
Serial number:        56G0A006T40E
Non-medium error count:      136

Disk 11
Serial number:        56E0A00LT40E
Non-medium error count:      276
root@debian:~# for i in {0..11}; do echo -e "\nDisk $i"; smartctl -a -d megaraid,$i /dev/sdb | grep 'Serial number\|Non-medium error count'; done

Disk 0
Serial number:        56G0A00KT40E
Non-medium error count:      144

Disk 1
Serial number:        56G0A00LT40E
Non-medium error count:      133

Disk 2
Serial number:        56G0A01DT40E
Non-medium error count:      141

Disk 3
Serial number:        56G0A00TT40E
Non-medium error count:      127

Disk 4
Serial number:        56G0A01GT40E
Non-medium error count:      137

Disk 5
Serial number:        56G0A00XT40E
Non-medium error count:      133

Disk 6
Serial number:        56G0A00AT40E
Non-medium error count:      140

Disk 7
Serial number:        56G0A00QT40E
Non-medium error count:      143

Disk 8
Serial number:        56G0A00NT40E
Non-medium error count:      144

Disk 9
Serial number:        56G0A00MT40E
Non-medium error count:      136

Disk 10
Serial number:        56G0A006T40E
Non-medium error count:      137

Disk 11
Serial number:        56E0A00LT40E
Non-medium error count:      277

And here's the complete SMART output of one:

Code:

root@debian:~# smartctl -a -d megaraid,0 /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              PX04SVB320
Revision:             0106
Compliance:           SPC-4
User Capacity:        3,200,631,791,616 bytes [3.20 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x500003970c88f3b9
Serial number:        56G0A00KT40E
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue May  5 04:37:02 2020 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 0%
Current Drive Temperature:     23 C
Drive Trip Temperature:        64 C

Manufactured in week 20 of year 2016
defect list format 6 unknown
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0       7842.635           0
write:         0        0         0         0          0       3781.659           0
verify:        0        0         0         0          0          0.108           0

Non-medium error count:      145

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -     171                 - [-   -    -]
# 2  Background short  Completed                   -     171                 - [-   -    -]

Long (extended) Self Test duration: 1800 seconds [30.0 minutes]

Terry Kennedy · May 5, 2020

serverian said:
I've recently acquired 14 of used Toshiba PX04SVB320 SSDs off eBay.

For some strange reason, every time I access their SMART values using smartctl, their "Non-medium error count" values increment by 1. This is for all of them without an exception.

It is a bug in either the drive firmware or a fault in the specification. Unfortunately, smartmontools has no way to work around it as their "quirks" mechanism doesn't extend to SAS drives (or this problem on SATA drives). Try this patch:

Code:

*** scsicmds.cpp.bak    Sun Dec  2 11:07:26 2018
--- scsicmds.cpp        Tue Jun 25 20:21:13 2019
***************
*** 1129,1134 ****
--- 1129,1148 ----
   * command not supported, 3 if field in command not supported, 101 if
   * defect list not found (e.g. SSD may not have defect list) or returns
   * negated errno. SBC-3 section 5.18 (rev 35; vale Mark Evans) */
+ /*
+ /*
+  * SanDisk LT[nn]00MO/WM/RO SSD units with (at least) Dell Firmware D416
+  * reject this command and return "Defect list not found" (0/1c/00). But
+  * requesting the defect list logs errors like (wrapped for convenience):
+  * mfi0: 7210 (614822013s/0x0002/info) - Unexpected sense: PD 03(e0x20/s3)
+  * Path 5001e8200289f13a, CDB: b7 0c 00 00 00 00 00 00 00 08 00 00, Sense:
+  * 1/1c/00
+  * Worse, this increments the "Non-medium error count" on the drive. So
+  * this patch should be applied ONLY to systems including the above drive
+  * models (or other drives exhibiting the same misbehavior, such as the
+  * Pliant / SanDisk LB[n]06M/S/R.
+  */
+ 
  int
  scsiReadDefect12(scsi_device * device, int req_plist, int req_glist,
                   int dl_format, int addrDescIndex, uint8_t *pBuf, int bufLen)
***************
*** 1138,1143 ****
--- 1152,1158 ----
      uint8_t cdb[12];
      uint8_t sense[32];
  
+     return 101; /* just bail out without doing anything */
      memset(&io_hdr, 0, sizeof(io_hdr));
      memset(cdb, 0, sizeof(cdb));
      io_hdr.dxfer_dir = DXFER_FROM_DEVICE;

Here is another patch that fixes "Unexpected sense" on drives attached to PERC H700 (and possibly other) controllers:

Code:

*** scsiprint.cpp.orig  Thu Dec 27 12:07:44 2018
--- scsiprint.cpp       Tue Jun 25 19:45:44 2019
***************
*** 138,144 ****
          if (err)
              return;
          memcpy(sup_lpgs, gBuf, LOG_RESP_LEN);
!     } else if ((scsi_version >= SCSI_VERSION_SPC_4) &&
                 (scsi_version <= SCSI_VERSION_HIGHEST)) {
          /* unclear what code T10 will choose for SPC-6 */
          memcpy(sup_lpgs, gBuf, LOG_RESP_LEN);
--- 138,151 ----
          if (err)
              return;
          memcpy(sup_lpgs, gBuf, LOG_RESP_LEN);
! /*
!  * For FreeBSD we change this check to only trigger on SPC-5, as SPC-4
!  * drives would otherwise trigger a request that the Dell PERC H700 con-
!  * troller doesn't support, logging errors like (wrapped for convenience):
!  * mfi0: 7204 (614818162s/0x0002/info) - Unexpected sense: PD 03(e0x20/s3) 
!  * Path 5001e8200289f13a, CDB: 4d 00 40 ff 00 00 00 3e fc 00, Sense: 5/24/00
!  */
!     } else if ((scsi_version >= SCSI_VERSION_SPC_5) &&
                 (scsi_version <= SCSI_VERSION_HIGHEST)) {
          /* unclear what code T10 will choose for SPC-6 */
          memcpy(sup_lpgs, gBuf, LOG_RESP_LEN);

azev · May 5, 2020

I have similar issues with sandisk lightning II drive as well, have many of these and all of them are throwing non medium error every time you ran smartctl commands. If you reboot I notice the counter drop back to 0 and then it would incrementally go up again. 4x of my Ultra version of the drive also have firmware bug that would cause the drive to stop working if the server lost power abruptly, well sometimes a graceful reboot could do it as well although only in rare cases. You have to format the drive using camcontrol or scli to get it to work again. All these firmware bug is buggin me !!!

serverian · May 6, 2020

@Terry Kennedy Thank you. Is this only relevant to the smartmontools? Are there no other issues using these drives?

serverian · May 6, 2020

@azev My counts stay the same after reboots, though.

serverian · May 6, 2020

@Terry Kennedy So I've patched the files as you guided and now the smartctl does not read those values and therefore not returning the error. I'm not sure how this is supposed to help, though

Code:

root@debian:~/smartmontools-7.1# ./smartctl -a -d megaraid,0 /dev/sdb
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              PX04SVB320
Revision:             0106
Compliance:           SPC-4
User Capacity:        3,200,631,791,616 bytes [3.20 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x500003970c88f3b9
Serial number:        56G0A00KT40E
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed May  6 16:26:15 2020 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

Device does not support Self Test logging

Terry Kennedy · Jun 1, 2020

serverian said:
@Terry Kennedy So I've patched the files as you guided and now the smartctl does not read those values and therefore not returning the error. I'm not sure how this is supposed to help, though

I see you're on Linux with a different version of smartctl. FreeBSD also exposes the drives through the CAM passthru, while your Linux box uses "-d megaraid". I don't know if any of that is relevant. I only verified my patches against those specific drives on FreeBSD, but I wouldn't expect whole sections to go missing, as they seem to have done for you. Here's one of mine:

Code:

(0:2) gate:/sysprog/terry# smartctl -a /dev/pass0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 10.4-STABLE amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SanDisk
Product: LT0400MO
Revision: D417
Compliance: SPC-4
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x5001e8200289f174
Serial number: 42594676
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Mon Jun 1 03:35:54 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
Drive Trip Temperature:        60 C

Manufactured in week 17 of year 2016
Specified cycle count over device lifetime: 100000
Accumulated start-stop cycles: 38
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 5156.740 0
write:         0        0         0         0          0      13582.945           0

Non-medium error count:       16

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed 96 7 - [- - -]
# 2 Background short Completed 96 2 - [- - -]
# 3  Background long   Completed                  96       2                 - [-   -    -]

Long (extended) Self-test duration: 1800 seconds [30.0 minutes]

I don't know why the column formatting got trashed, but with my broken arm in a cast, I'm not going to bother trying to fix it up - the meanings should still be clear.

Ocean7 · Sep 23, 2020

Guys, I have the same problem with FreeNAS (Freebsd).
Interestingly enough I noticed this issue right after an upgrade and that was the reason why I rolled back..
So in my case, smartctl 7.1 increments that error counter by 1, and smrtctl version 6.6 works fine.
My problem is, we're using smartcl for monitoring. So, every check increments "Non-medium error count" by one which triggers a lot of alerts.

We're using MB6000FEDAU and similar with different sizes (4TB-6TB) with different cards and backplanes. ALL servers are having this problem with smartcl.

I posted more info on FreeNAS jira but they said that my issue is not reproducible: NAS-105721

Is there anything that can be done?

Search

PX04SVB320 - Every time SMART accessed, Non-medium error count increments

serverian

New Member

Terry Kennedy

Well-Known Member

azev

Well-Known Member

serverian

New Member

serverian

New Member

serverian

New Member

Terry Kennedy

Well-Known Member

Ocean7

New Member