Anyone experienced with storage messages on Dell servers? Un-sure how to interpret.

frogtech · Mar 22, 2018

So I have a VRTX enclosure, 25 SFF version, with one Shared PERC8 controller and 1 expander backplane module.

All firmwares are up to date, however, I am not sure what is going on here.

I'm using Hitachi HUC109045CS600 hard drives (450GB, SFF SAS). I put them in a RAID 10 and initialize, however, every time I initialize and it completes or during initialization this message in the "Recently Logged Storage Events" is showing:

BAT1003 Wed Mar 21 22:00:22 2018The battery on Shared PERC8 learn cycle has completed.
VDR93 Wed Mar 21 20:37:39 2018 Virtual Disk 0 bad block medium error is cleared.
VDR23 Wed Mar 21 20:37:37 2018 Initialization of Virtual Disk 0 has completed.
BAT1002 Wed Mar 21 19:49:17 2018 The battery on Shared PERC8 learn cycle has started.
BAT1027 Wed Mar 21 19:49:17 2018 The battery on Shared PERC8 completed a charge cycle.
VDR11 Wed Mar 21 19:45:48 2018 Virtual Disk 0 has started initializing.
VDR3 Wed Mar 21 19:43:41 2018 Redundancy normal on Virtual Disk 0.
VDR98 Wed Mar 21 19:43:39 2018 Virtual Disk 0 has switched active controllers. Its active path is now through Chassis Integrated RAID 1.
VDR4 Wed Mar 21 19:43:24 2018 Virtual Disk 0 was created.

The message I am concerned with is VDR93, Virtual Disk 0 bad block medium error is cleared. I've seen this come up on virtual disks that was also preceeded by a message saying "A SCSI device experienced an error, but may have recovered" but in this instance, I'm not seeing that message.

I'm not really sure what to make of the message, I've googled it but it's still unclear to me if it means that some underlying physical disk is experiencing errors or something like that?

An extended detail about the error: "This alert message occurs if there are medium errors in the physical drives."

Can someone clarify? The storage logs do not reference a particular physical disk. These are not Dell branded drives however as far as I know they're not special purpose fiber channel drives or drives from an EMC, NetApp, or IBM disk array.

I am wondering if this error is just generically occuring to the drives not because of a physical issue but because the drives most likely have generic or third party firmware?

Thank you.

marcoi · Mar 22, 2018

Im not sure about the error but you can try to do a few troubleshooting steps.
Can you create two arrays, split the disk counts into two groups and then perform the initialization. See if the same error appears on both groups. or just one set. If both, it might be a general message? if single group then you know one of the drives in that group might be having issues. next step would be to try to find that drive with issues and remove it from raid set etc.

rootgremlin · Mar 22, 2018

this errormassage "Virtual Disk 0 bad block medium error is cleared." sounds a lot like a SMART sector remap or reallocation.

smartctl -a /dev/yourdiskdevice will tell you 2 important values:
realocated sector count (already defective sectors that the drive remapped to internal reserved area) and
current pending sectors (sectors the drive could not reliably read / usually fixed by writing to that sector)

SAS Disks usually do smart tests themselves (or you could start it with the command)
smartctl --test=long /dev/yourdiskdrive

when this completes without error the drive-surface / every sector is ok.

The long test tries to read every sector and writes a message about the first sector it could not read in the drive error-log
In this state the current pending sector counter increases by one.

Then you could try to "revive" it by writing to it. If that works, all is ok and the current pending sector counter gets reset to 0.
It it does not work, that sector gets remapped to a spare area on the disk and the realocated sector count increases by one.

frogtech · Mar 22, 2018

rootgremlin said:
this errormassage "Virtual Disk 0 bad block medium error is cleared." sounds a lot like a SMART sector remap or reallocation.

smartctl -a /dev/yourdiskdevice will tell you 2 important values:
realocated sector count (already defective sectors that the drive remapped to internal reserved area) and
current pending sectors (sectors the drive could not reliably read / usually fixed by writing to that sector)

SAS Disks usually do smart tests themselves (or you could start it with the command)
smartctl --test=long /dev/yourdiskdrive

when this completes without error the drive-surface / every sector is ok.

The long test tries to read every sector and writes a message about the first sector it could not read in the drive error-log
In this state the current pending sector counter increases by one.

Then you could try to "revive" it by writing to it. If that works, all is ok and the current pending sector counter gets reset to 0.
It it does not work, that sector gets remapped to a spare area on the disk and the realocated sector count increases by one.

For whatever reason I'm not able to get smart info from these disks, I'm not sure if it has to do with the adapter or back plane. The backplane itself is SATA on both the front and rear ports, and the adapter is an H310 flashed to IT mode, however the disk itself is definitely SAS.

When I run a long command I get this error:

Long (extended) offline self test failed [unsupported field in scsi command]

Also looks like trying to use built in Windows 10 SMART data cmdlets in PowerShell is reporting much less info than it should. It must be a controller or interfacing issue... According to this link Determining disk health using Windows PowerShell on Windows Server 2012 and Windows 8 I should be getting a lot more output from Get-StorageReliabilityCounter.

These disks dont even show in CrystalDiskInfo, only my SATA SSD and NVMe SSD show...weird.

Search

Anyone experienced with storage messages on Dell servers? Un-sure how to interpret.

frogtech

Well-Known Member

marcoi

Well-Known Member

rootgremlin

Member

frogtech

Well-Known Member