HGST(WD) U.2 NVMe drive with strange(to me) failure state. Works...but doesn't?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Prophes0r

Active Member
Sep 23, 2023
152
186
43
East Coast, USA
I bought 4x drives from THIS deals post. (Which I posted)

3x work fine.
1x is...strange.

The 3x that work were DEFINITELY used for "read intensive" workloads.
Code:
Data Units Read:                    109,454,739,956 [56.0 PB]
Data Units Written:                 577,844,157 [295 TB]
Host Read Commands:                 88,021,380,192
Host Write Commands:                653,025,118
...
Power On Hours:                     42,956

But as I said. One of them is acting up.
As boot starts, all 4x have the onboard amber LED lit.
Then the LEDs all go out.
Then they all have the green LED lit.
THEN the 1 drive lights the amber LED again after a few seconds.

All 4x drives show the same results with lspci -vv.
Same number of active lanes. Same power states. I can't find any differences.

All 4x drives show up with lsblk.

All 4x drives respond to smartctl -a DRIVE.
Except... the one weird drive has a bunch of the log values zeroed out.
It also has entries in the error log.

Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-2-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       HUSPR3238ADP301
Serial Number:                      STM0001A9185
Firmware Version:                   KMGNP131
PCI Vendor/Subsystem ID:            0x1c58
IEEE OUI Identifier:                0x000cca
Controller ID:                      3
NVMe Version:                       <1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          3,820,752,101,376 [3.82 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            000cca 0060074b80
Local Time is:                      Thu Oct 24 22:33:41 2024 EDT
Firmware Updates (0x09):            4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x01):         S/H_per_NS

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0    15000   15000
 1 +    20.00W       -        -    1  1  1  1    15000   15000
 2 +    15.00W       -        -    2  2  2  2    15000   15000
 3 +    10.00W       -        -    3  3  3  3    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -     512       8         2
 2 -    4096       0         0
 3 -    4096       8         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    109,454,739,956 [56.0 PB]
Data Units Written:                 577,844,157 [295 TB]
Host Read Commands:                 88,021,380,192
Host Write Commands:                653,025,118
Controller Busy Time:               2,403,146
Power Cycles:                       78
Power On Hours:                     42,956
Unsafe Shutdowns:                   72
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged

Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-2-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       HUSPR3238ADP301
Serial Number:                      CJH001000DB6
Firmware Version:                   KMGNP131
PCI Vendor/Subsystem ID:            0x1c58
IEEE OUI Identifier:                0x000cca
Controller ID:                      3
NVMe Version:                       <1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          3,820,752,101,376 [3.82 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            000cca 0061164801
Local Time is:                      Thu Oct 24 22:33:57 2024 EDT
Firmware Updates (0x09):            4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x01):         S/H_per_NS

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0    15000   15000
 1 +    20.00W       -        -    1  1  1  1    15000   15000
 2 +    15.00W       -        -    2  2  2  2    15000   15000
 3 +    10.00W       -        -    3  3  3  3    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -     512       8         2
 2 -    4096       0         0
 3 -    4096       8         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    0
Data Units Written:                 0
Host Read Commands:                 0
Host Write Commands:                0
Controller Busy Time:               0
Power Cycles:                       0
Power On Hours:                     42,290
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      2

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          2     -       -  0xdead      -            0     0     -
  1          1     -       -  0xdead      -            0     0     -

The non-working drive just throws a bunch of errors if you try to create a partition on it.

I've tried the normal troubleshooting steps.
  • Switch the ports
  • Use a different cable/card
The working drives keep working.
The broken one stays the same.


Is this thing just dead?
I'm assuming so.

Any Ideas?
 

CyklonDX

Well-Known Member
Nov 8, 2022
1,639
584
113
Error Information Log Entries: 2

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 2 - - 0xdead - 0 0 -
1 1 - - 0xdead - 0 0 -


/\ dead
 
  • Like
Reactions: nexox

Prophes0r

Active Member
Sep 23, 2023
152
186
43
East Coast, USA
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 2 - - 0xdead - 0 0 -
0xdead is a pretty common error code used in a bunch of places to catch your eye.
It often doesn't literally mean "dead".
It's just something immediately recognizable that can be made with hex characters like 0xb00b or 0xbad or 0xb1ade.

I was hoping someone would come back with "Oh yeah, the firmware is corrupted and you can manually reflash it" or something like that.
I dunno. These enterprise drives are new to me.

I could see how having out of band management over SPI could lead to more ways to fix stuff than a regular consumer drive.