Freenas error

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

azev

Well-Known Member
Jan 18, 2013
768
251
63
Recently I have been battling freenas error that would randomly show up in the logs. (pic attached)
The drive have clean bill of smart health and short & extended test passed without any issue.
I go as far as secure erasing all the drive and rebuilding the pool, which helps for a week or two before the error would return. Below is the spec of my build:

Supermicro 826 case with SAS3 backplane
X9DRD-7LN4F with 2x E5-2680v2
256Gb Ram
12 x HUSMM1680ASS201 (IBM OEM 98Y6325)
Emulex 14102, 2x10Gb NIC

This freenas box is the shared storage (iscsi) to 4 esxi node
From what I can tell, the error does not cause any data corruption or any issue the vmware environment, but nonetheless I'd like to get to the bottom of it.

Does anyone have any ideas where I should check ?? I am planning to upgrade it to Freenas 11.2 and see if that would fix it but I thought maybe I asked around before I do that.

Thanks.
 

Attachments

marcoi

Well-Known Member
Apr 6, 2013
1,532
288
83
Gotha Florida
Did you do the basic trouble shooting? Power off, check cables, check ram, etc?

Also seems like some kind of security issue. Does your scsi have any permission setup?

When the pool degraded is it the same drives?
 

azev

Well-Known Member
Jan 18, 2013
768
251
63
Before I put the system together I did bunch of test of the hardware it self such as memory, cpu, even the drives.
As far as cables, I bought original supermicro cables for this build which is known to be one of the best you can get in this diluted market.
I also think bad cable will materialized it self pretty early and should throw error constantly.

There are no security setup in my implementation of ISCSI, no chap no drive permission etc.

Here's the weird part, when the issue first materialized, it was mirror-0 that was having problem. Once I start troubleshooting the issue, the problem would appear randomly against a different mirror set. The last troubleshooting I did was delete the pool, secure erase all the drives and rebuilt the pool. After about 2 weeks the issue appear again but this time on the last 2 mirror drive (mirror-4 & mirror-5) which have the least amount of clocked usage (both read/writes). I am pretty stump on what to do, maybe it is a bad backplane but I am not too sure that was the case.
I had a bad backplane channel before and it would continuously throw an error when I put load on the array. After the last secure erase I had done many test & benchmark to check and make sure the issue is resolved and I see no error until today.
 

marcoi

Well-Known Member
Apr 6, 2013
1,532
288
83
Gotha Florida
Maybe good idea to update freenas to make sure you are running the latest version as a next step.

Also did you do smartctl -a command on the degraded drives to see if it provides any details?
 

azev

Well-Known Member
Jan 18, 2013
768
251
63
So I reslivered the pool and then update freenas to 11.2 yesterday... It has been a day since then and the error have not returned regardless how much artificial load I put on the pool. Hopefully this upgrade is the silver bullet to the issue I experienced.
 

azev

Well-Known Member
Jan 18, 2013
768
251
63
I spoke to soon, today when when I checked the log I saw the same issue again on a different mirror pair.
Did some more investigation, it would appear that it is possible this has something to do with TCG encryption capable SSD I am using.

Data Protect
Data Protect is received when the device is working but locked, either a physical write lock or for Data-at-Rest encryption when the device was not yet unlocked or the band was not yet unlocked.

I ran this commands sedutil-cli --scan but the result is NO which indicate its not SED drive.
Then I ran sedhelper unlock:

root@freenas:~ # sedhelper unlock
All SED disks unlocked

Let see if this command actually does anything.
 

azev

Well-Known Member
Jan 18, 2013
768
251
63
I am still battling this issue, I've done everything I can think off all the way to replacing the SAS Controller and new SAS Cables.
The drive I am using are 12 x IBM 98Y6325 Hitachi HGST SSD HUSMM1680ASS201, and I have no issue writing to each individual drive using dd. However when I ran dd command on the pool (dd if=/dev/urandom of=/mnt/ssd/ddtest bs=1M) the errors pop up randomly.
There are no consistency as to when the error would show up, sometimes as early as within hundreds of gig writes, sometime after 1-2TB writen, some other time almost all the way to the end before it starts popping up. It will show up at random drive too, never the same one.

SMART data on all drive shows healthy with half of them have about 500TB writes on the odometer, and the other half only have about 10-20TB writes.

@marcoi you mentioned to check RAM, I have not done that but could RAM cause SCSI write protect issue ? IPMI never reported any ecc error like it did when I had a bad RAM in the past.
I am reaching out to supermicro support to see if they would help me upgrade the SAS3 backplane, but initial respond from them that this will not solve the issue.

I guess if I really want to spend more time on troubleshooting this, I can try swapping the ram, or try to put the drive on a completely different Server with different backplane, or as the last resort buy a different set of drives.

Anyone have any suggestion what else I can do to try and troubleshoot this ?
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
Sounds to me like a (raid controller) firmware issue - especially the "Once I start troubleshooting the issue, the problem would appear randomly against a different mirror set."
Can you test an older/newer firmware release?
I assume you can't update drive firmware? Did your pre deployment tests of the drives include same or similar amount of data?
 

azev

Well-Known Member
Jan 18, 2013
768
251
63
@Rand_ the HBA that I used are LSI 9341-8I flashed to IT firmware initially and I did try different firmware from P13 to P16 with the same result.
I bought a IBM M1215 and convert it to LSI IT firmware P16 and a new supermicro sas cable, and the issue remains.

As far as pre deployment test, before I build the pool, I usually run a few rounds of dd if=/dev/urandom of=/dev/daX bs=1M and all drives completed successfully & validate smart data after the test is done.
 

Rand__

Well-Known Member
Mar 6, 2014
6,626
1,767
113
urandom is rather slow so might not tax the drives fully.
But then its not likely to be a hba issue, so if you are the (more or less) only one with this issue and you replaced all except the drives then it must be the drives...
 

azev

Well-Known Member
Jan 18, 2013
768
251
63
One last thing to try is to move the drives to another server and ran another similar test... I think I am going to do that sometime this weekend and see. If the problem persist on a different server with different but similar hardware then it is probably the drives. Maybe the firmware is locked to specific controller.
 

azev

Well-Known Member
Jan 18, 2013
768
251
63
Decided to do one more thing right now, I installed windows server use an adaptec raid card to build raid 10. Build process is now complete and successfull, and now I am running a tools (disk-filltest) to fill the drive with random data. I am curious if the issue persist in windows with a raid controller.