Performance tuning three monster ZFS systems

Discussion in 'Solaris, Nexenta, OpenIndiana, and napp-it' started by the spyder, Oct 1, 2014.

  1. the spyder

    the spyder Member

    Joined:
    Apr 25, 2013
    Messages:
    79
    Likes Received:
    8
    Just a few weeks back I was asked to put together a large amount of storage for a last minute project. I usually avoid projects like this, but it's a special case and thus far it has gone as planned (knock on wood). Since I am waiting on a few external delays that are outside my control, I figured there's no better time then now to do some performance tuning. I'm working on a quick set of repeatable tests that best represent our usage. Up till now, I have relied on bonnie/iostat/dd bench/crystaldiskmark. I've read pretty much everything I can find and wanted to get an outside opinion.

    Here are the three systems spec's:
    Processing
    (1)Supermicro 2u 24 bay Chassis
    (2) Xeon E5-2620 v2's
    (16) 32GB DDR3 1866
    (3) Supermicro 3008 based internal HBA's flashed with IT mode firmware
    (2) LSI 9300-8E HBA's
    (1) Mellanox ConnectX3 Dual Port QDR IB
    (22) 1TB Samsung 850 Pro's
    (4) 256GB Samsung 850 Pro's
    (2) Supermicro 45 Bay JBOD's (Single expander)
    (90) WD RE4 4tb 7200rpm Enterprise SATA

    Rpool:
    (2) 256GB Mirrored

    SSD Pool: (9.4TB Formatted)
    (10) Mirrored 1TB vdev's
    (1) 256GB ZIL Drive
    (1) 1TB Spare


    Spindle Pool: 157TB
    (44) Mirrored 4TB vdev's
    (2) 4TB Spares
    (1) 256GB ZIL
    (1) 1TB L2ARC
    90% limited

    Archival (2x)
    (1)Supermicro 2u 24 bay Chassis
    (2) Xeon E5-2620 v2's
    (16) 32GB DDR3 1866
    (1) Supermicro 3008 based internal HBA's flashed with IT mode firmware
    (4) LSI 9300-8E HBA's
    (1) Mellanox ConnectX3 Dual Port QDR IB
    (2) 1TB Samsung 850 Pro's
    (4) 256GB Samsung 850 Pro's
    (4) Supermicro 45 Bay JBOD's (Single expander)
    (180) WD RE4 4tb 7200rpm Enterprise SATA

    Rpool:
    (2) 256GB Mirrored

    Spindle Pool: 475TB Formatted
    (22) 8x4TB Raid Z2 vdev's
    (4) 4TB Spares
    (2) 256GB Mirrored ZIL
    (2) 1TB L2ARC
    90% Limited

    OS: Solaris 11.2 (or possibly OmniOS, I'm going to play with it tomorrow.)

    There are a few important notes: (1) We were limited by what we could order due to the time frame. This caused major issues- the Intel SSD's and SAS hard drives I originally specced ended up be weeks out from our deadline. The WD RE4's and Samsung 850's were the only drives available. (2) The original project requirement was 1PB of storage. After speaking with the teams using this storage and doing a quick analysis, I decided to split it in to the three systems above. Mainly because of how they move data as it is processed. The first archival machine is really a input data server, where everything from the field is uploaded and organized. The processing data server is where larger groupings are copied, broken in to smaller chunks, and processed off the SSD array by an attached 240 core/1.5TB cluster. The last machine is an output directory, where everything is QA'd and copied off for delivery. It's a complicated process with several data moves, but if you could see how they are doing it now- it's 100x better.

    I'm building the arrays as I write this, let me know what you think or what benchmarks you would like to see.
     
    #1
    Last edited: Oct 1, 2014
  2. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,185
    Likes Received:
    708
    Impressive project!

    What you may consider:
    - if sync is disabled, a ZIL is not used/necessary
    - I would add more RAM than 32 GB (up to 128 GB) .
    - limit capacity to 90% (=10% initial pool reservation) is ok
    - with SSDs, you may add an extra 5-10% overprovisioning to keep write performance high under load
    (create a host protected area on new SSDs) so you do not need to care about during usage.
    - use Raid-Zn vdevs build from 4 or 8 datadisks (6 or 10 disks per Z2 vdev)
    - in case of SSDs you may use Z2 instead of mirrors as iops of SSDs do not really require mirrors

    What services are you using (ex iSCSI, SMB, NFS)?
    Service performance from a client (Example Windows, Crystaldiskmark via iSCSI) or via SMB or NFS (sequential and iops)
    or box to box performance over IB would be of interest. (local raw pool performance should be "more than enough")
     
    #2
  3. the spyder

    the spyder Member

    Joined:
    Apr 25, 2013
    Messages:
    79
    Likes Received:
    8
    Hi Gea!

    You will be happy to know we purchased Pro licensing for this system.
    1) I set sync to standard. I had initially disabled it for a quick comparison and forgot to re-enable it.
    2) Each systems has 512 GB (16x32).
    3) I left the default 10% for now.
    4) I'm planning on only allowing 75% on the SSD pools due to the concerns you mentioned. Thanks for the tip, I was unaware I could create a host protected area.
    5) I used 8 disk RaidZ2 groups for the archival storage systems and 2 disk mirrors for the processing storage system's spindle drives.
    6) I will benchmark both Mirror and RZ2 on the SSD pool and report back.

    The system will be accessed mainly via NFS and some SMB. I'm doing my initial testing based on your tuning guide and will post the results.

    Here's a quick shot of it being burnt in.
    [​IMG]
     
    #3
    T_Minus, Patrick and Chuntzu like this.
  4. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,185
    Likes Received:
    708
    In the past it was not recommended to use more than 128GB RAM.
    I am not sure if this is a problem with current OmniOS/ Solaris 11.2 as I have not used that many RAM and not heard of anyone using > 128 GB RAM. You may need to ask at Oracle/ Omniti.

    Listbox • Email Marketing and Discussion Groups
    Listbox • Email Marketing and Discussion Groups
    Listbox • Email Marketing and Discussion Groups
    Nex7's Blog: ZFS: Read Me 1st
     
    #4
  5. legen

    legen Active Member

    Joined:
    Mar 6, 2013
    Messages:
    195
    Likes Received:
    34
    Nice one. When accessing this through NFS, why dont you go with something like ZeusRAM to speed up sync writes (or are all your workloads async)?
     
    #5
  6. rubylaser

    rubylaser Active Member

    Joined:
    Jan 4, 2013
    Messages:
    842
    Likes Received:
    229
    This is a ridiculous build. I can't wait to see some benchmarks :)
     
    #6
    Patrick likes this.
  7. the spyder

    the spyder Member

    Joined:
    Apr 25, 2013
    Messages:
    79
    Likes Received:
    8
    Well,

    I was hoping to spend today doing some initial testing, but instead I ended up troubleshooting a very odd bug. On all three systems, mirroring the RPOOL caused the original drive to to degrade with hundreds of checksum errors. Eventually both drives would report degraded and report checksum errors. FMADM reports the drives are not faulty and scrub return back zero errors. I swapped drives, updated controller firmware, and reinstalled. The issue persisted. I'm not sure if this is a controller/drive bug or Solaris 11.2 issue. I'm going to try Omni in the morning. I was able to get the processing array controller working after removing the degraded drive and re-adding it- but it uses the built in SATA controller on the motherboard to drive the two additional rear mounted drives. The other two systems have AOC-3008-8i controllers flashed to the latest IT firmware.

    I dislike these days where you feel like you are chasing your tail.
     
    #7
  8. the spyder

    the spyder Member

    Joined:
    Apr 25, 2013
    Messages:
    79
    Likes Received:
    8
    Gea,

    Out existing systems use 192GB ram and show no errors/issues. I'm assuming it's because we treat them as giant NAS boxes. If it becomes a issue during testing, I will limit the ARC and hopefully change it back when it's patched.
     
    #8
  9. gea

    gea Well-Known Member

    Joined:
    Dec 31, 2010
    Messages:
    2,185
    Likes Received:
    708
    Thanks
    There are not so many setups with more than 128 GB RAM around
    so reports about remaining problems or success in Solaris 11.2 and current OmniOS are important
    as the problem reports are older than a year.
     
    #9
  10. the spyder

    the spyder Member

    Joined:
    Apr 25, 2013
    Messages:
    79
    Likes Received:
    8
    Sadly I was unable to perform any testing today due to the OS drive issue. I tried every combo I could think of, but no matter what mirroring the RPOOL caused checksum errors. In the end, it's a bug between the Supermicro AOC-3008 and the Samsung 850 Pro SSD's. IR or IT mode did not change a thing. The one system that uses onboard SATA ports works fine. I had high hopes for the 850 Pr0s- but for now, they are not usable as OS drives with the SAS3 controller. I swapped the 850's for Intel S3500's and the issue went away. I'm actually concerned enough that I'm going to replace the ZIL drives with S3700's while I'm at it. I'll keep the 1TB 850's for L2ARC/SSD Pool and monitor them closely. Hopefully next week I can do some testing before we install the system at the customers site.
     
    #10
    spazoid likes this.
  11. kroem

    kroem Active Member

    Joined:
    Aug 16, 2014
    Messages:
    238
    Likes Received:
    35
    (only here for the pics/benchmarks :p~~)
     
    #11
  12. lmk

    lmk Member

    Joined:
    Dec 11, 2013
    Messages:
    128
    Likes Received:
    20
    @the spyder thanks for all these detailed updates and information - invaluable!
     
    #12
  13. legen

    legen Active Member

    Joined:
    Mar 6, 2013
    Messages:
    195
    Likes Received:
    34
    Have not tested the Samsung 850 Pro but we have tested the 840 Pro for zil with very very poor results. The 840 Pro simply has two high latency to work any good as a ZIL. We actually got worse results with the 840 Pro as ZIL than without (on a SSD array).

    The S3500 or S3700 is a much better choice for ZIL (look at their write latency in the datasheet).
     
    #13
  14. spazoid

    spazoid Member

    Joined:
    Apr 26, 2011
    Messages:
    91
    Likes Received:
    10
    Why do you want a SLOG for an SSD based pool? Unless your SLOG device is considerably faster than the actual pool, there should be no noticeable difference in performance.
     
    #14
  15. wlee

    wlee New Member

    Joined:
    Aug 8, 2014
    Messages:
    20
    Likes Received:
    2
    I found interesting 730 has better write latency than S3700, on paper at least.

    Is Intel the only vendor who publish latency?
     
    #15
  16. PigLover

    PigLover Moderator

    Joined:
    Jan 26, 2011
    Messages:
    2,767
    Likes Received:
    1,110
    I would agree that the log device (ZIL/SLOG) adds little value to an SSD pool running in pool or mirror mode. But if the SSD pool is running in parity mode (RaidZ/Raid5, etc) then the log device is actually very valuable to limit write events on the SSDs and improve their longevity. Writing to a parity raid is a very sloppy affair - and the log device allows the writes to be safely cached, scheduled, and completed in rational units to limit the total number of writes that occur.
     
    #16
  17. the spyder

    the spyder Member

    Joined:
    Apr 25, 2013
    Messages:
    79
    Likes Received:
    8
    So I spent the better part of last week fighting a combination of problems with one of the systems. Twenty one drives appeared to have dropped over the course of the weekend. They were all on the same backplane and same controller. Destroying the pool, moving the controllers around, and bam- still dropping. The drives continued to rack up errors under S: H: T: and would eventually drop offline. I move the drives to a separate controller and JBOD- one that had no previous issues and they still dropped. Removing the drives and testing them via manufacture software showed them as healthy. In a last resort, I replaced the original controller and rebuilt the pool. No drives have gone offline in five days. I am not satisfied this is resolved as the disks are still generating errors under S: H: T:- but clear after a reboot and thus far have not failed during performance testing.
    [​IMG]


    On to performance testing. Since starting this thread, I have changed the access/pool setup.

    Archival servers:

    8 Disk Raid-Z2 (22) + 4 Hot Spares
    (2) Intel DC 3700 ZIL Mirror
    (2) 1TB Samsung 850 Pro L2ARC
    Compression ON
    Sync= Standard

    Processing server:

    Spindles- 2 Disk Mirror (44) + 2 Hot Spares
    (2) Intel DC 3700 ZIL Mirror
    (2) 1TB Samsung 850 Pro L2ARC
    Compression ON
    Sync= Standard

    SSD- 2 Disk Mirror (10)
    (No ZIL or L2ARC)
    Compression ON
    Sync= Disabled

    Out of box performance is, well, good- but not great. I'm working on more tuning tomorrow, but so far it's not quite as amazing as I had hoped.

    [​IMG]
    [​IMG]
    [​IMG]
    [​IMG]
    [​IMG]
    [​IMG]
    [​IMG]
    [​IMG]
    [​IMG]
    [​IMG]

    Client side, I'm still working on benchmarks- but here's the initial IPERF results.
    Windows barely hit 1GB's after tweaking the jumbo frame size and send/receive buffers. (DBA, I would love to know your settings.)
    [​IMG]

    Solaris hits nearly 3.12GB's out of box! (ConnectX3 + IS5030)
    [​IMG]
     
    #17
    Last edited: Oct 15, 2014
    T_Minus and Patrick like this.
  18. J-san

    J-san Member

    Joined:
    Nov 27, 2014
    Messages:
    66
    Likes Received:
    42
    I'm not sure if this could be your problem with the Transport errors, but when I built a server recently I ran into many Transport errors in OmniOS with my 3 LSI 9211-8i cards flashed to P20 firmware. (always flash to latest firmware right?)

    I downgraded to P19 firmware and I haven't seen any Transport/Hard errors since, so that might be worth a try. I think your hba controller is different, but there might be a bug in shared firmware code.

    My Intel S3500 SSDs caused more Transport/Hard errors for me than the SATA RE4s that I had, but both would consistently run up errors during/after benchmarking the disks locally via Napp-it. As soon as I downgraded to P19 firmware in IT mode that went away.
     
    #18
  19. PigLover

    PigLover Moderator

    Joined:
    Jan 26, 2011
    Messages:
    2,767
    Likes Received:
    1,110
    This likely is the problem. P20 firmware is widely reported as unstable. I had troubles after upgrading some cards to P20 - everything cleared up when reflashed to P19.
     
    #19
  20. Stanza

    Stanza Active Member

    Joined:
    Jan 11, 2014
    Messages:
    205
    Likes Received:
    40
    What do you get with

    Iperf
    update interval every 2 seconds
    running for 20 seconds
    packet size 1024k
    running 6 concurrent threads

    iperf -s 192.168.0.1 -i 2 -t 20 -w 1024k -P 6

    .
     
    #20
Similar Threads: Performance tuning
Forum Title Date
Solaris, Nexenta, OpenIndiana, and napp-it omnios+nappit 10gb performance: iperf fast, zfs-send slow Jun 17, 2019
Solaris, Nexenta, OpenIndiana, and napp-it Windows SMB performance issues Feb 25, 2019
Solaris, Nexenta, OpenIndiana, and napp-it WD Ultrastar DC SS 530 SAS SSD for high performance HA storage/Slog Jan 10, 2019
Solaris, Nexenta, OpenIndiana, and napp-it Crap iperf3 performance in OmniOS? Nov 26, 2018
Solaris, Nexenta, OpenIndiana, and napp-it Performance troubleshooting. Aug 19, 2018

Share This Page