Freenas error

Discussion in 'FreeBSD and FreeNAS' started by azev, Feb 15, 2019.

  1. azev

    azev Active Member

    Joined:
    Jan 18, 2013
    Messages:
    561
    Likes Received:
    137
    Recently I have been battling freenas error that would randomly show up in the logs. (pic attached)
    The drive have clean bill of smart health and short & extended test passed without any issue.
    I go as far as secure erasing all the drive and rebuilding the pool, which helps for a week or two before the error would return. Below is the spec of my build:

    Supermicro 826 case with SAS3 backplane
    X9DRD-7LN4F with 2x E5-2680v2
    256Gb Ram
    12 x HUSMM1680ASS201 (IBM OEM 98Y6325)
    Emulex 14102, 2x10Gb NIC

    This freenas box is the shared storage (iscsi) to 4 esxi node
    From what I can tell, the error does not cause any data corruption or any issue the vmware environment, but nonetheless I'd like to get to the bottom of it.

    Does anyone have any ideas where I should check ?? I am planning to upgrade it to Freenas 11.2 and see if that would fix it but I thought maybe I asked around before I do that.

    Thanks.
     

    Attached Files:

    #1
  2. marcoi

    marcoi Well-Known Member

    Joined:
    Apr 6, 2013
    Messages:
    1,136
    Likes Received:
    172
    Did you do the basic trouble shooting? Power off, check cables, check ram, etc?

    Also seems like some kind of security issue. Does your scsi have any permission setup?

    When the pool degraded is it the same drives?
     
    #2
  3. azev

    azev Active Member

    Joined:
    Jan 18, 2013
    Messages:
    561
    Likes Received:
    137
    Before I put the system together I did bunch of test of the hardware it self such as memory, cpu, even the drives.
    As far as cables, I bought original supermicro cables for this build which is known to be one of the best you can get in this diluted market.
    I also think bad cable will materialized it self pretty early and should throw error constantly.

    There are no security setup in my implementation of ISCSI, no chap no drive permission etc.

    Here's the weird part, when the issue first materialized, it was mirror-0 that was having problem. Once I start troubleshooting the issue, the problem would appear randomly against a different mirror set. The last troubleshooting I did was delete the pool, secure erase all the drives and rebuilt the pool. After about 2 weeks the issue appear again but this time on the last 2 mirror drive (mirror-4 & mirror-5) which have the least amount of clocked usage (both read/writes). I am pretty stump on what to do, maybe it is a bad backplane but I am not too sure that was the case.
    I had a bad backplane channel before and it would continuously throw an error when I put load on the array. After the last secure erase I had done many test & benchmark to check and make sure the issue is resolved and I see no error until today.
     
    #3
  4. marcoi

    marcoi Well-Known Member

    Joined:
    Apr 6, 2013
    Messages:
    1,136
    Likes Received:
    172
    Maybe good idea to update freenas to make sure you are running the latest version as a next step.

    Also did you do smartctl -a command on the degraded drives to see if it provides any details?
     
    #4
  5. azev

    azev Active Member

    Joined:
    Jan 18, 2013
    Messages:
    561
    Likes Received:
    137
    So I reslivered the pool and then update freenas to 11.2 yesterday... It has been a day since then and the error have not returned regardless how much artificial load I put on the pool. Hopefully this upgrade is the silver bullet to the issue I experienced.
     
    #5
  6. azev

    azev Active Member

    Joined:
    Jan 18, 2013
    Messages:
    561
    Likes Received:
    137
    I spoke to soon, today when when I checked the log I saw the same issue again on a different mirror pair.
    Did some more investigation, it would appear that it is possible this has something to do with TCG encryption capable SSD I am using.

    Data Protect
    Data Protect is received when the device is working but locked, either a physical write lock or for Data-at-Rest encryption when the device was not yet unlocked or the band was not yet unlocked.

    I ran this commands sedutil-cli --scan but the result is NO which indicate its not SED drive.
    Then I ran sedhelper unlock:

    root@freenas:~ # sedhelper unlock
    All SED disks unlocked

    Let see if this command actually does anything.
     
    #6
  7. azev

    azev Active Member

    Joined:
    Jan 18, 2013
    Messages:
    561
    Likes Received:
    137
    I am still battling this issue, I've done everything I can think off all the way to replacing the SAS Controller and new SAS Cables.
    The drive I am using are 12 x IBM 98Y6325 Hitachi HGST SSD HUSMM1680ASS201, and I have no issue writing to each individual drive using dd. However when I ran dd command on the pool (dd if=/dev/urandom of=/mnt/ssd/ddtest bs=1M) the errors pop up randomly.
    There are no consistency as to when the error would show up, sometimes as early as within hundreds of gig writes, sometime after 1-2TB writen, some other time almost all the way to the end before it starts popping up. It will show up at random drive too, never the same one.

    SMART data on all drive shows healthy with half of them have about 500TB writes on the odometer, and the other half only have about 10-20TB writes.

    @marcoi you mentioned to check RAM, I have not done that but could RAM cause SCSI write protect issue ? IPMI never reported any ecc error like it did when I had a bad RAM in the past.
    I am reaching out to supermicro support to see if they would help me upgrade the SAS3 backplane, but initial respond from them that this will not solve the issue.

    I guess if I really want to spend more time on troubleshooting this, I can try swapping the ram, or try to put the drive on a completely different Server with different backplane, or as the last resort buy a different set of drives.

    Anyone have any suggestion what else I can do to try and troubleshoot this ?
     
    #7
  8. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,048
    Likes Received:
    428
    Sounds to me like a (raid controller) firmware issue - especially the "Once I start troubleshooting the issue, the problem would appear randomly against a different mirror set."
    Can you test an older/newer firmware release?
    I assume you can't update drive firmware? Did your pre deployment tests of the drives include same or similar amount of data?
     
    #8
  9. azev

    azev Active Member

    Joined:
    Jan 18, 2013
    Messages:
    561
    Likes Received:
    137
    @Rand_ the HBA that I used are LSI 9341-8I flashed to IT firmware initially and I did try different firmware from P13 to P16 with the same result.
    I bought a IBM M1215 and convert it to LSI IT firmware P16 and a new supermicro sas cable, and the issue remains.

    As far as pre deployment test, before I build the pool, I usually run a few rounds of dd if=/dev/urandom of=/dev/daX bs=1M and all drives completed successfully & validate smart data after the test is done.
     
    #9
  10. Rand__

    Rand__ Well-Known Member

    Joined:
    Mar 6, 2014
    Messages:
    3,048
    Likes Received:
    428
    urandom is rather slow so might not tax the drives fully.
    But then its not likely to be a hba issue, so if you are the (more or less) only one with this issue and you replaced all except the drives then it must be the drives...
     
    #10
  11. azev

    azev Active Member

    Joined:
    Jan 18, 2013
    Messages:
    561
    Likes Received:
    137
    One last thing to try is to move the drives to another server and ran another similar test... I think I am going to do that sometime this weekend and see. If the problem persist on a different server with different but similar hardware then it is probably the drives. Maybe the firmware is locked to specific controller.
     
    #11
  12. azev

    azev Active Member

    Joined:
    Jan 18, 2013
    Messages:
    561
    Likes Received:
    137
    Decided to do one more thing right now, I installed windows server use an adaptec raid card to build raid 10. Build process is now complete and successfull, and now I am running a tools (disk-filltest) to fill the drive with random data. I am curious if the issue persist in windows with a raid controller.
     
    #12
Similar Threads: Freenas error
Forum Title Date
FreeBSD and FreeNAS FreeNAS 11.1 U5 - Fatal Trap 12 error Jun 26, 2018
FreeBSD and FreeNAS freenas cksum errors? Dec 15, 2017
FreeBSD and FreeNAS FreeNAS 9.10 errors in console and slow re-silver. Apr 29, 2017
FreeBSD and FreeNAS FreeNAS NFS Share to Proxmox - Errors Dec 6, 2016
FreeBSD and FreeNAS FreeNAS Driver error Nov 29, 2015

Share This Page