WD100EMAZ strange noise

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

msg7086

Active Member
May 2, 2017
423
148
43
36
Just got my black friday WD Easystore opened. (One month late because I've been lazy.)

Shuck it and start using it, I noticed that there's random noise from the drive when reading / writing. Sounds to me like the head loses the track and tries to relocate itself. Installed Debian on to this drive, and did some random apt-get install, I heard a few times the "head reset" noise during the reading / writing, and apt lags for a fraction of a second every time I hear it.

So I thought, screw it, let me scan this drive.

Timeout threshold was set to 300ms per cylinder since normally it's 10-40ms. Let it run over night (20 hours till now) and today I see a whooping 800+ timed out cylinders all over the disk by now, and is still increasing.

0<=<=60ms 1000000+
60<=<=200ms 4500+
200<=<=300ms 0
300ms+ (Timeout) 850+

For those who have the same drive, do you observe similar issues? Do you hear random and frequent "head reset" when reading / writing? I want to make sure it's a problematic drive before asking for an exchange.

Note that I have un-mount partitions before doing disk scans, so should not be interfered by someone else on the computer.
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
That's a problem. I have 12 of those fully burned in and they never make such noises. Really quiet actually.
 

madbrain

Active Member
Jan 5, 2019
212
44
28
Which software are you using to scan the drives ? I have 5 of these in my NAS in ZFS / RAID2 . I have seen some disconnects happen randomly which cause a very short resilver.

I used h2testw under Windows to test the surface of each of them for nearly 24 hours, found no errors, though.
 

msg7086

Active Member
May 2, 2017
423
148
43
36
This is DiskGenius, produced by a Chinese company. They also have an English version called PartitionGuru which basically does the same thing. Free version is enough for scanning the drives and even able to fix logical bad sectors.

However random disconnects don't sound like a disk surface / bad sectors issue. Do you observe disconnections on single drive or random drives?
 

madbrain

Active Member
Jan 5, 2019
212
44
28
Seems to be random. it's mostly on bootup. Sometimes the ZFS volume won't import automatically, but will manually. I think some disks are taking longer than others to spin up. dmesg shows a 10 second interval with different times for the drives being seen.
I don't think the surface is bad on any of them.
 

madbrain

Active Member
Jan 5, 2019
212
44
28
Pretty much every bootup there is an issue. SMART data looks OK on all drives, though the number of poweups is wildly different between them, strangely.
I don't get any issues with suspend/resume. I just rebooted and now I got issues with 2 drives. Definitely different serial numbers than before, so not consistent.

madbrain@SERVER10G:~$ sudo tcsh
[sudo] password for madbrain:
SERVER10G:~# zpool status
no pools available
SERVER10G:~# zpool import array
SERVER10G:~# zpool status
pool: array
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: ZFS Message ID: ZFS-8000-9P
scan: resilvered 355M in 0h0m with 0 errors on Fri Jan 4 21:15:49 2019
config:

NAME STATE READ WRITE CKSUM
array ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sde ONLINE 0 0 2
sda ONLINE 0 0 0
sdd ONLINE 0 0 0
sdf ONLINE 0 0 0

errors: No known data errors
SERVER10G:~# zpool status
pool: array
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Jan 8 18:40:51 2019
22.4G scanned out of 20.0T at 234M/s, 24h57m to go
8.75G resilvered, 0.11% done
config:

NAME STATE READ WRITE CKSUM
array ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sde ONLINE 0 0 2 (resilvering)
sda ONLINE 0 0 0
sdd ONLINE 0 0 0
sdf ONLINE 0 0 0 (resilvering)

errors: No known data errors
SERVER10G:~# zpool status

I don't think it will really take the 25 hrs that it says it will to resilver. It's already down to 18hrs after just a few minutes.
I have never had 2 drives resilver at a time, though ...
 

msg7086

Active Member
May 2, 2017
423
148
43
36
TBH, this sounds to me like a power supplement issue. Either PSU or cable.
 

madbrain

Active Member
Jan 5, 2019
212
44
28
Thanks. The resilver of the two HDDs took 1h16min this time, longest I have seen it take after these boot issues.

I noticed that one of my two LSI cards to which the disks are connected had backlevel firmware and BIOS, so I just updated them.
Next (warm) boot after that, the ZFS pool still didn't automatically import, but there were no errors upon manual import.
I then did a shutdown and a cold boot. Now, two disks are missing and not even resilvering !

madbrain@SERVER10G:~$ zpool status
pool: array
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: ZFS Message ID: ZFS-8000-4J
scan: resilvered 673G in 1h16m with 0 errors on Tue Jan 8 19:57:30 2019
config:

NAME STATE READ WRITE CKSUM
array DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sdc ONLINE 0 0 0
sdf ONLINE 0 0 0
sda ONLINE 0 0 0
8203626860017461470 FAULTED 0 0 0 was /dev/sde1
167936804069715386 UNAVAIL 0 0 0 was /dev/sdg1

You may be right about the power issue. Time to open the beast and try to run some drives off another PSU cable.

OTOH, fdisk -l shows that all 5 10TB disks are powered up. Very weird.

dmesg isn't reporting any issue either.

[ 2.898946] mpt2sas_cm0: host_add: handle(0x0001), sas_addr(0x500605b006de0ba0), phys(8)
[ 3.029323] mpt2sas_cm1: host_add: handle(0x0001), sas_addr(0x500605b008f2c6b0), phys(8)
[ 3.051632] scsi 8:0:0:0: Direct-Access Generic- USB3.0 CRW -0 1.00 PQ: 0 ANSI: 0 CCS
[ 3.066013] scsi 8:0:0:1: Direct-Access Generic- USB3.0 CRW -1 1.00 PQ: 0 ANSI: 0 CCS
[ 3.080115] scsi 8:0:0:2: Direct-Access Generic- USB3.0 CRW -2 1.00 PQ: 0 ANSI: 0 CCS
[ 3.528055] scsi 7:0:0:0: Direct-Access ATA WDC WD100EMAZ-00 0A83 PQ: 0 ANSI: 6
[ 3.528057] scsi 7:0:0:0: SATA: handle(0x0009), sas_addr(0x4433221104000000), phy(4), device_name(0x0000000000000000)
[ 3.528058] scsi 7:0:0:0: enclosure logical id (0x500605b008f2c6b0), slot(7)
[ 3.528132] scsi 7:0:0:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[ 3.777924] scsi 7:0:1:0: Direct-Access ATA WDC WD100EMAZ-00 0A83 PQ: 0 ANSI: 6
[ 3.777926] scsi 7:0:1:0: SATA: handle(0x000a), sas_addr(0x4433221105000000), phy(5), device_name(0x0000000000000000)
[ 3.777927] scsi 7:0:1:0: enclosure logical id (0x500605b008f2c6b0), slot(6)
[ 3.778114] scsi 7:0:1:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[ 4.027902] scsi 7:0:2:0: Direct-Access ATA WDC WD100EMAZ-00 0A83 PQ: 0 ANSI: 6
[ 4.027904] scsi 7:0:2:0: SATA: handle(0x000b), sas_addr(0x4433221107000000), phy(7), device_name(0x0000000000000000)
[ 4.027905] scsi 7:0:2:0: enclosure logical id (0x500605b008f2c6b0), slot(4)
[ 4.027978] scsi 7:0:2:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[ 4.278127] scsi 7:0:3:0: Direct-Access ATA WDC WD100EMAZ-00 0A83 PQ: 0 ANSI: 6
[ 4.278129] scsi 7:0:3:0: SATA: handle(0x000c), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000)
[ 4.278130] scsi 7:0:3:0: enclosure logical id (0x500605b008f2c6b0), slot(5)
[ 4.278300] scsi 7:0:3:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[ 9.024095] mpt2sas_cm0: port enable: SUCCESS
[ 9.146894] scsi 0:0:0:0: Direct-Access ATA WDC WD100EMAZ-00 0A83 PQ: 0 ANSI: 6
[ 9.146897] scsi 0:0:0:0: SATA: handle(0x0009), sas_addr(0x4433221100000000), phy(0), device_name(0x0000000000000000)


TBH, this sounds to me like a power supplement issue. Either PSU or cable.
 
Last edited:

madbrain

Active Member
Jan 5, 2019
212
44
28
Did another soft reboot and the disks showed up, and resilvered automatically and instantly ...

madbrain@SERVER10G:~$ zpool status
pool: array
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: ZFS Message ID: ZFS-8000-9P
scan: resilvered 172K in 0h0m with 0 errors on Tue Jan 8 23:28:30 2019
config:

NAME STATE READ WRITE CKSUM
array ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdf ONLINE 0 0 0
sda ONLINE 0 0 0
sde ONLINE 0 0 3
sdg ONLINE 0 0 1

errors: No known data errors
madbrain@SERVER10G:~$

Seems there is a definite problem with the initial cold boot.

I had not seen the disks go completely missing and never show up until reboot, though .
 

madbrain

Active Member
Jan 5, 2019
212
44
28
Actually checked and the disks spin up as soon as I press the power button from a power-off. So this isn't a problem with delayed spin up. My kill-a-watt showed a peak of 189W during the drives spinup. Usage goes back down to 102W afterwards.
This time around, ZFS didn't automatically import on boot. But no problems on manual import. I can't seem to be getting the same results twice on boot ...

I noticed the device names for the disks aren't always the same also, even when all 5 drives are present. Really weird.
 

msg7086

Active Member
May 2, 2017
423
148
43
36
With zfs you are supposed to use the unique id path, like /dev/disk/by-id/*, to specify devices. Could this trigger the problem somehow?
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
Oh, I didn't know that. I followed the Ubuntu tutorial on ZFS which instructed to use the /dev names .

Setup a ZFS storage pool | Ubuntu tutorials

Is there any way to fix the pool to use different device names ?
Export the pool:

# zpool export array

Now, import the pool while specifying the path:

# zpool import -d /dev/disk/by-id array

If that succeeds, the cache info should update, and the next time you can import without -d and it will use the cached block device paths.

I like to use /dev/disk/by-id for SATA drives, but for SAS, I prefer /dev/disk/by-path instead.
 
  • Like
Reactions: madbrain

madbrain

Active Member
Jan 5, 2019
212
44
28
Export the pool:

# zpool export array

Now, import the pool while specifying the path:

# zpool import -d /dev/disk/by-id array

If that succeeds, the cache info should update, and the next time you can import without -d and it will use the cached block device paths.

I like to use /dev/disk/by-id for SATA drives, but for SAS, I prefer /dev/disk/by-path instead.
Thanks. It succeeded.

root@SERVER10G:~# zpool status
pool: array
state: ONLINE
scan: resilvered 172K in 0h0m with 0 errors on Tue Jan 8 23:28:30 2019
config:

NAME STATE READ WRITE CKSUM
array ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-WDC_WD100EMAZ-00WJTA0_JEGLJXAN ONLINE 0 0 0
wwn-0x5000cca267c7c89e ONLINE 0 0 0
wwn-0x5000cca267c8561f ONLINE 0 0 0
wwn-0x5000cca267c89be6 ONLINE 0 0 0
wwn-0x5000cca267c78fd4 ONLINE 0 0 0

errors: No known data errors
root@SERVER10G:~#

A little odd that one device has different name that the 4 others. There are two LSI controllers and one Intel (on the motherboard). I thought I attached all the HDDs to the LSIs, but they may not all be on the same LSI card. There is one 9207-8i and one 9207-4i4e .
 

madbrain

Active Member
Jan 5, 2019
212
44
28
BLinux & msg7806, thank you. I think you solved my problems. No more inconsistencies on boot, warm or cold. ZFS volume now always imports. And no more disconnects / bad checksums / resilverings so far.
 

BLinux

cat lover server enthusiast
Jul 7, 2016
2,672
1,081
113
artofserver.com
Thanks. It succeeded.

root@SERVER10G:~# zpool status
pool: array
state: ONLINE
scan: resilvered 172K in 0h0m with 0 errors on Tue Jan 8 23:28:30 2019
config:

NAME STATE READ WRITE CKSUM
array ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ata-WDC_WD100EMAZ-00WJTA0_JEGLJXAN ONLINE 0 0 0
wwn-0x5000cca267c7c89e ONLINE 0 0 0
wwn-0x5000cca267c8561f ONLINE 0 0 0
wwn-0x5000cca267c89be6 ONLINE 0 0 0
wwn-0x5000cca267c78fd4 ONLINE 0 0 0

errors: No known data errors
root@SERVER10G:~#

A little odd that one device has different name that the 4 others. There are two LSI controllers and one Intel (on the motherboard). I thought I attached all the HDDs to the LSIs, but they may not all be on the same LSI card. There is one 9207-8i and one 9207-4i4e .
yeah,.. in /dev/disk/by-id, there are many symlinks that allow you to find the block device in different ways. For SATA drives, I much prefer the format:

ata-[BRAND]_[MODEL]-[SERIALNUMBER]

as that is easiest for me to confirm that I'm pulling the correct drive when it is time to swap the drive. I don't know of a way to tell zpool to look only at ata-*, but one way I've used to force it into using the ata-* name is to delete all the wwn-0x* symlinks in /dev/disk/by-id/, and then do the "zpool import -d ... ". That usually makes zpool find the ata-* symlinks only and then everything is consistent. once you reboot, the wwn-0x* symlinks will get re-created anyway. (I'm sure there must be a command to re-create all the symlinks, but I don't know it)

For SAS drives, there are no symlinks in /dev/disk/by-id like the ata-* format, so that's why I usually use /dev/disk/by-path, as that gives me enough info about which HBA and port the faulty drive is on for me to narrow it down to the physical HDD slot. I wish I knew how to have it make symlinks like:

sas-[BRAND]_[MODEL]-[SERIAL]

that would be nice... and i'm sure there is way... i just don't know it.
 
  • Like
Reactions: madbrain

madbrain

Active Member
Jan 5, 2019
212
44
28
yeah,.. in /dev/disk/by-id, there are many symlinks that allow you to find the block device in different ways. For SATA drives, I much prefer the format:

ata-[BRAND]_[MODEL]-[SERIALNUMBER]

as that is easiest for me to confirm that I'm pulling the correct drive when it is time to swap the drive. I don't know of a way to tell zpool to look only at ata-*, but one way I've used to force it into using the ata-* name is to delete all the wwn-0x* symlinks in /dev/disk/by-id/, and then do the "zpool import -d ... ". That usually makes zpool find the ata-* symlinks only and then everything is consistent. once you reboot, the wwn-0x* symlinks will get re-created anyway. (I'm sure there must be a command to re-create all the symlinks, but I don't know it)

For SAS drives, there are no symlinks in /dev/disk/by-id like the ata-* format, so that's why I usually use /dev/disk/by-path, as that gives me enough info about which HBA and port the faulty drive is on for me to narrow it down to the physical HDD slot. I wish I knew how to have it make symlinks like:

sas-[BRAND]_[MODEL]-[SERIAL]

that would be nice... and i'm sure there is way... i just don't know it.
Thanks again. I followed your instructions, and now all drives in the pool are showing with the ata-[BRAND]-[MODEL]-[SERIALNUMBER] format.
 

EffrafaxOfWug

Radioactive Member
Feb 12, 2015
1,394
511
113
As an aside... does ZFS not write its metadata to the drives then...? Any number of things can transpire to change /dev/sdX to /dev/sdY and I didn't think anything relied on persistent device names any more.

If that's the case, doesn't that also mean that if you use by-path IDs the pool will fail to import if you move to a different HBA/different PCI slot?