My ZFS NAS build/benchmark history so far (4+ builds) - long time lurker now a member

Discussion in 'DIY Server and Workstation Builds' started by sfbayzfs, May 23, 2015.

  1. sfbayzfs

    sfbayzfs Active Member

    May 6, 2015
    Likes Received:
    I have been reading the site/forums for a few years now, but am finally posting!

    Linux sysadmin by trade, my first NAS was a Thecus N7700. After outgrowing it, I decided to do a 15-drive ZFS build (#1) which a friend copied with a different motherboard. My wife said it would be full by the time I finished it, so I built a 20-disk unit (#2). That filled up and, I started using the N7700 again, and I am finally deploying a 36-drive Supermicro build (#3) now! Between builds 2 and 3, I almost deployed an inexpensive 8-11 drive unit (#4) and convinced a friend with 2 Drobos to consolidate onto it.

    All of these have the same usage profile, so I'm listing it first:

    Usage Profile:
    General SAMBA share for media storage and streaming from, plus random PC backups. Squeezeboxserver eventually (that is still on the Thecus, but I need to docker-ize it or something)

    Build’s Name: #1 - 15-drive custom trayless hotswap tower
    Operating System/ Storage Platform: Nas4Free 9 since Freenas was only 8.x at the time, and he didn't want to go with ZOL at the time
    CPU: Xeon e3-1220L 20W 2C/4T
    Motherboard: Supermicro X9SCM-F for me, an Asus workstation board for my friend
    Chassis: AZZA Solano 1000 ATX Full Tower Computer Case CSAZ-1000R, 10x 5.25in Bays fitted with 5x iStarUSA 2x5.25in to 3x3.5in SAS / SATA Trayless Hot-Swap Cage, Model: BPN-DE230SS-RED
    Drives: 15x 5400RPM Hitachi 4TB plus an SSD for the OS
    RAM: 32GB ECC unbuffered DDR3 1333
    Add-in Cards: 2x IBM M1015 IT mode
    Power Supply: Seasonic S12-330 originally, but I switched to a Kingwin LZP-550 modular when I was considering adding it to the current 20 as a JBOD for a 35-drive backup array since the amazing Kingwin shaves about 6W off the base AC draw of the already great Seasonic!
    Other Bits:
    Great quiet case, looks great with red inside and red trayless hotswap aluminum fronts! I routed the power cables under the motherboard tray. Use the case cover fan to blow on the LSI cards since they don't come with fans. I set the drive cage fans to low and this thing runs cool with those fans the big top fan, the big side fan and 1 of the exhaust fans running. My friend's top bay or 2 pop open easily, but don't disconnect the drives in them, mine are fine.

    Build’s Name: #2 - 20-drive custom trayless hotswap tower
    Operating System/ Storage Platform: ZOL prerelease on Centos 6
    CPU: Xeon e3-1260L 45W 4C/8T
    Motherboard: Supermicro X9SCM-F
    Chassis: Antec 1200 (since it has 12x 5.25" bays) and 4x iStarUSA BPN-DE350SS-RED trayless hotswap 3x 5.25 -> 5x 3.5 cages
    Drives: 20x 5400RPM Hitachi 4TB plus an SSD for the OS
    RAM: 32GB ECC unbuffered DDR3 1333
    Add-in Cards: LSI 9201-8i single linked to Intel RES2SV240 SAS expander
    Power Supply: Kingwin AP-550
    Other Bits:
    I don't like the Antec case quite as well as the AZZA, but I needed 12 5.25" for the trayless cages. I had to smash down 2 of 3 5.25" drive supports so the cages would fit in - a large C-clamp and 2 blocks of wood came in very handy for this (I think any 5x3.5 in 3x5.25 cage can't afford to have the notches.) Roughly 160W total idle draw with disks spinning.

    Build’s Name: #3 - 36-drive Supermicro SC847 expander
    Operating System/ Storage Platform: ZOL current on Centos 7
    CPU: Xeon e3-1265L v2 45W 4C/8T
    Motherboard: Supermicro X9SCM-F
    Chassis: Supermicro SC847 36-bay 4U with 2x sas expander backplanes
    Drives: 36x 5400RPM Hitachi 4TB plus an SSD for the OS
    RAM: 32GB ECC unbuffered DDR3 1600
    Add-in Cards: LSI 9207-8i see bandwidth notes, may change card
    Power Supply: Supermicro PWS-920P-1R, also tested with PWS-920P-SQ
    Other Bits: generic Renesas USB3 card
    Great case, expander backplanes makes cabling super easy, I am only using 4 of 7 possible fans in the middle of the case, and they are connected to the motherboard and run quietly for a server after initial whoosh, but nowhere near as quiet as the towers above. Trays are a PITA compared to trayless, but cheaper and higher density. See below for bandwidth notes on these builds. Roughly 250W idle draw with disks spinning.

    Build’s Name:
    #4 - 8-11 drive cheap NAS build
    Operating System/ Storage Platform: FreeNAS 9.3
    CPU: Pemtium G620T 35W 2C/2T
    Motherboard: Supermicro X9SCL-F
    Chassis: NZXT Source 210 ELITE Midtower - 8x tool free 3.5 bays inside, 3x 5.25, under $50 shipped after tax!
    Drives: 10x HGST 5400RPM 4TB (started with old Seagate 3TB disasters)
    RAM: 8GB ECC DDR3 1066
    Add-in Cards: IBM M1015 IT mode
    Power Supply: Seasonic S12-330 or an Antec Earthwatts 380, I forget which
    Other Bits:
    Moved an exhaust fan to the case side for air to the LSI card. Drives are a tight slide out past the CPU cooler, etc. - great value overall.

    General notes:

    Looks: Builds #1 and #2 look great, and are very quiet (*not* silent though.) Build #4 is also very quiet, and super-cheap to put together.

    LSI Firmware versions: Build #4 was throwing read CRC errors on many disks and reading incredibly slowly with the now infamous P20 AVAGO/LSI firmware - at first I though it was the junk Seagate 3TB drives, but it turned out to be the controller firmware. If you hunt around, Freenas suggests P16 firmware since it includes the P16 drivers and some have reported issues in FreeBSD with P19 firmware under P16 drivers. P16 drivers are also in Centos 6 and 7, so I have flashed P16 on all of my cards with no issues yet.

    Bandwidth: All of these get close to 10GBe speeds even with z3! Build #2 gets almost 1.7GB/s write of a single file to ZFS stripe, 1.4GB/s to a Z1, 1.2GB/S to a Z2, and just under 1GB/S to a Z3. For #3, the 9207-8i card is connected to the 24-drive backplane, which daisy-chains to the 12-drive backplane. Single linked, it gets very similar numbers to #2, although as much as 5% lower in some cases, probably due to inefficiencies of daisy-chaining backplanes. Double-cabled, it gets 2.78 GB/s write onto a pure ZFS stripe, but that drops to numbers only 5-10% above #2's bandwidth at Z1/Z2/Z3, because of my next topic.

    CPU: A single 36-drive z3 pool in #3 is CPU bound on writes at ~1.15GB/s with a 4C/8T Intel low power Xeon. A stripe of 2x 18 drive Z2 vdevs brought the bandwidth back up to 2.2ish GB/S write, and 4x 9-drive Z1 vdevs was maybe 0.1GB/S faster with only about half the overall CPU. When I built #2 years ago, I tested it with a split 2x 10-drive Z2, but saw no performance increase vs a single z2 vdev zpool. Other than backups onto a DAS JBOD, this is still ridiculous bandwidth for home use, and although I have not thoroughly tested #4, it definitely gets at least 400MB/S write on a 10-drive z2 pool with it's 2C/2T CPU. CPU use seems most dependent on the number of disks per vdev with a small penalty for double or triple parity. Maybe hyperthreading should be off.

    RAM use: In my typical basic usage on #2, (at most crawling through the array double-checksuming files), zfs eats up about 18-19GB of 32GB available. A full scrub made it increase to about 21GB used. So far in my limited testing, #3 does not use a lot more than 22GB under light use, I will report back later as I use it more.

    AMD: I need to benchmark these with large arrays when I have time. As mentioned at the top, I have a Fujitsu microserver, a couple of HP Microservers, and an Asus desktop board with DDR3 unbuffered ECC RAM support. Even though I strongly dislike Intel, I did not end up going with any of the AMD options before for several reasons:
    - Fujitsu microserver has a great AM3 board and is a well designed server, but the PSU is proprietary and the board seems to run on just lots of 12V lines from it, so it is not easy to transplant into another case. I have a bunch of low power AM3 CPUs I want to benchmark in it going to an external JBOD for comparison purposes.
    - HP Microserver's board is great, but also proprietary mounting - if it just had standard mounting holes I would have used it in builds #1 and #2 above. If I only needed 6 drives max, I would just use an HP Microserver as-is.
    - My Asus micro ATX desktop AM3 board is only supposed to support up to 16GB of RAM (must test it with 4x 8GB) and would be hard to replace if it fails.
    - I really wanted an AMD Supermicro board, but the only socket AM3 one is impossible to find and larger than micro ATX. AMD went mid to high power massively multicore for their main server stuff and has no competitors to the low power small board Xeon e3 and Atom 2xxx series. A 16-core+ Opteron or 4 would be great for ZFS, but I leave this on 24/7...

    More musings / overall conclusions:
    I have also tried some higher power Intel dual proc boards which can take a lot more RAM, but it looks like more than 32GB of RAM is not needed, so I'll stick with the X9SC* series.

    ZFS on Linux does all I need it to do, but Freenas is very good recently, so I am recommending it to less technical friends who just need a reliable NAS. If you install the current FreeNAS, be SURE to dd /dev/zero over your target flash drive before installing onto it - we had lots of mysterious GPT related bootloader failures until we figured that out!
    altqpeeg, Boris, Patrick and 2 others like this.
  2. T_Minus

    T_Minus Moderator

    Feb 15, 2015
    Likes Received:
    So you're liking ZoL with CentOS... good to know.

    I also appreciate you posting and sharing your benchmarks, and the various raidz# levels is interesting too.

    Have you compared w/out the L2ARC, RAM and SLOG? Or was that how you did it?
  3. sfbayzfs

    sfbayzfs Active Member

    May 6, 2015
    Likes Received:
    No, for fun I just set sync=disabled with the z3 version of the testpool, but it didn't go any faster, it still peaked at 1.19GB/S write, 1.16GB/S write typical.

    I am not doing super-tight measuring here, here's what I am doing, suggestions/improvements are welcome!

    I test with a 5GB file dd'ed from /dev/urandom file onto the system SSD. I cat it to a file or /dev/null to get it into disk cache. For each test, I cat the file 20 times in a row appended into a 100GBish test file on the zpool and time the run (just with time) and run it a few times to get an idea of discrepancies. while that is going, I also run zpool iostat 10 and htop, and sometimes iostat -xm 10 if I care about individual drive saturation.

    I have mostly stock settings on my test zpools, other than atime off and compression=lz4 since I will be running with that (hence /dev/urandom instead of /dev/zero for test data)

    I have also noticed what may be large discrepancies in the percentage of reserved space for different zpool configurations, I am running some tests and a report now, more info within a half hour or so.
  4. T_Minus

    T_Minus Moderator

    Feb 15, 2015
    Likes Received:
    I'd be curious to see the test results, with the RAID configuration details nearby/organized. You wrote a lot, and it's a ton of great info but I just tried to glance at it again to compare raidZ# setup to performance, drive #, vdev setup and found it challenging.

    Any chance you want to make a couple headings, and bulleted list of the configurations, and output for your test so it's easier to compare them all?

    Yes. I'm asking a lot.
    Sorry :)
  5. sfbayzfs

    sfbayzfs Active Member

    May 6, 2015
    Likes Received:
    How's this for a start:

    Disks used:
    36x Hitachi / HGST 4TB 5400RPM model HDS5C4040ALE630​

    Zpool base creation options:
    zpool create -f -o ashift=12 -o autoexpand=on -O atime=off -O compression=lz4 testpool

    Zpool test configurations
    1. no raid (36 drives, no parity) reported as 129T by df -h, as expected (assuming 1 marketing TB = 0.9 real TB)
    2. raidz1 (36 drives, 1 is parity) reported as 121T, I would expect 126T minus any extra reservation
    3. raidz2 (36 drives, 2 are parity) reported as 114T, 122 expected
    4. raidz3 (36 drives, 3 are parity) reported as 114T, 118.8 expected
    5. raidz2 (18 drives) + raidz2 (18 drives) (4 total drives parity) reported as 114T, 115T expected
    6. raidz1 (9 drives) + raidz1 (9 drives) + raidz1 (9 drives) + raidz1 (9 drives) (4 total drives parity) reported as 114T ,115T expected
    7. raidz2 (9 drives) + raidz2 (9 drives) + raidz2 (9 drives) + raidz2 (9 drives) (8 total drives parity) reported as 98T, 111T expected
    I will refer to those test numbers to refer to those configurations. Yes, I could do 3 sets of 12 drives as z1 and z2, but the main questions I want to answer are:
    1. Is a sas2308 based card worth using over a sas2008 card?
    2. Is dual linking worth it in a raidz2 or z3 configuration, or will things be CPU bound anyway?
    3. Should I single-link to each of the backplanes instead of daisy-chaining them?
    4. To hook up an external JBOD for backups, should I use single or dual linking, or will I be even more CPU bound.
    Available Space issues:
    For the space issue, it's almost as if something is reporting the size only rounding to the nearest 2-3 full drives worth of space or else an entire drive is reserved per vdev in addition to parity.

    I know these numbers are expected to be inaccurate, but at some point like on my old system, I have been as low as 18GB "free" with over 1TB "unallocated" according to zpool status - is there any way to do the tune2fs reserved block percentage reduction like on ext* for zfs? Zfs reservations and refreservations seem to not be what I'm looking for, although I haven't tested them yet, but the examples people give look wrong.

    I am running a new test suite now, and will report in a bit
    lmk, altqpeeg, Boris and 2 others like this.
  6. T_Minus

    T_Minus Moderator

    Feb 15, 2015
    Likes Received:
    Above and beyond what I was expecting, very well organized and appreciated :)

    Those are some large single arrays :) and even the other are 9 disk, awesome :)

    I look forward to the results.
  7. sfbayzfs

    sfbayzfs Active Member

    May 6, 2015
    Likes Received:
    I just re-ran my now more automated test suite in single link mode, and here are the results - it may be that single link can't quite get enough data to fully peg the CPU, the highest I saw was just under 90% on all 8 threads.

    Serial write throughput single linked to daisy-chained sas2 expander backplanes from a 9207-8i card:
    1. 1.62GB/s
    2. 1.46GB/s
    3. 1.37GB/s
    4. 1.19GB/s
    5. 1.42GB/s
    6. 1.44GB/s
    7. 1.24GB/s
    Now I need to re-run them dual linked, but I need to wrap up for the night.
    Boris, Patrick and T_Minus like this.
  8. Patrick

    Patrick Administrator
    Staff Member

    Dec 21, 2010
    Likes Received:
    Super posts and thread!
  9. sfbayzfs

    sfbayzfs Active Member

    May 6, 2015
    Likes Received:
    Thanks, maybe I should start a separate benchmarking zfs thread, there are so many permutations I would like to test (2008 based controller, 2 controllers dual linked to each backplane, different CPUs, less RAM, etc.)

    Until then, here are the stats from the dual linked 9207-8i:

    Serial write throughput dual linked to daisy-chained sas2 expander backplanes from a 9207-8i card:
    1. 2.73 GB/s
    2. 1.47 GB/s
    3. 1.28 GB/s
    4. 1.07 GB/s
    5. 1.69 GB/s
    6. 2.17 GB/s
    7. 1.85 GB/s
    It looks like my memory/notes were off before with the dual link numbers, or something was different with my test setup then, but you can see the trend - in the single link tests, some cases with higher parity levels were faster (perhaps due to dual link multipath?) - for all of these tests, this is peak speed which looks like it would be maintained for a long large file write, there have been occasional glitches, and often a much higher burst at the very beginning, probably due to buffering.

    I ran vmstat 10 this time, and noted that zfs does free all of the RAM when you destroy a pool, and this system seems to have a base usage of about 1.4GB of RAM with no zpools

    I am going to shut it down and re-run with only 8GB available, but dial linked to see what happens.

    Oh, also worth noting, as these tests chew up CPU, the fans spin up and down a lot - I think the ramp increment is enough to cool things better relatively quickly, but then they get hot again. I need to trade out the stock Intel cooler for a 2U heatsink to minimize turbulence.

    ## Update ##

    The 8GB dual linked test is on it's way, but it's going to take a while - the first couple of test cases are jumping around in the 580 - 650 MB/s range, and from vmstat, zfs is trying to leave RAM free for the system, the test file is in system disk cache, and there has been a small amount of swapping.

    ## Update 2 ##

    It finally finished:

    Serial write throughput dual linked to daisy-chained sas2 expander backplanes from a 9207-8i card but limited to 8GB RAM:
    1. 0.6 GB/s
    2. 0.6 GB/s
    3. 0.6 GB/s
    4. 0.6 GB/s
    5. 0.6 GB/s * as high as 0.7 but as low as 0.2 for a while, probably an anomaly
    6. 0.5 GB/s
    7. 0.5 GB/s
    [free output removed since it was taken right after destroying the test pool]

    I am now going to retry with 16GB RAM, also it should be noted that I went down to 1 stick, so it was single channel as well as being not much RAM, I don't have any 1600 speed compatible RAM sticks smaller than 8GB, although I do have as small as 2GB in 1333 and 1066

    ## Update 3 ##

    Serial write throughput dual linked to daisy-chained sas2 expander backplanes from a 9207-8i card but limited to 16GB RAM:

    1. 2.46 GB/s
    2. 1.34 GB/s
    3. 1.16 GB/s
    4. 1.00 GB/s
    5. 1.53 GB/s
    6. 2.01 GB/s
    7. 1.64 GB/s
    ...So not as good as with 32GB, but not a lot worse either - Hmmm, would more ram help? 32GB is the max for the X9SCM / X9SCL boards. I might be able to get a dual proc board up to almost 64GB, if I have time I will set that up and disable NUMA in the BIOS and run the test on that as well, but that may be more involved than I have time for..

    For 16GB, free after the final test but before destroying the test pool is:

    total used free shared buff/cache available
    Mem: 16239944 9555004 699272 9588 5985668 6008284
    Swap: 32767996 0 32767996

    ## Update 4 ##
    Serial write throughput separately linked to each sas2 expander backplane from a 9207-8i card:

    1. 1.93 GB/s
    2. 1.35 GB/s
    3. 1.17 GB/s
    4. 1.00 GB/s
    5. 1.57 GB/s
    6. 1.75 GB/s
    7. 1.53 GB/s
    This is probably limited by the worst case bandwidth to the 24-drive backplane, now I am going to trade the card for a 9201-8i and see what happens...

    ## Update 5 ##

    Serial write throughput dual linked to daisy-chained sas2 expander backplanes from a 9201-8i card:

    1. 2.69 GB/s
    2. 1.51 GB/s
    3. 1.31 GB/s
    4. 1.10 GB/s
    5. 1.73 GB/s
    6. 2.19 GB/s
    7. 1.87 GB/s
    So basically the same as the 9207-8i in the same situation, hmmm.

    Are there any other tests anyone would like me to run before I start putting real data on this?
    Last edited: May 24, 2015
    altqpeeg and T_Minus like this.
  10. sfbayzfs

    sfbayzfs Active Member

    May 6, 2015
    Likes Received:
    Today, time permitting, I am going to run more tests with the 9201-8i card dual linked to the daisy-chained expanders:
    1. Hyperthreading disabled - running now
    2. Hyperthreading disabled and only 2 cores
    3. SAS1 expander backplanes instead of SAS2 - I have the same chassis with SAS1 backplanes
    Also, I am going to test what happens when there is a bad SATA drive in the expanders as suggested in the LSI firmware thread.

    As a general observation throughout this process, it appears that zfs limits it's CPU usage if any cores are over ~85% utilized - even when things are getting CPU bound, total core utilization in htop bounces around between 80 and 90% for all available cores, except with the dual core experiment, where 1 of the cores did basically peg during most of the tests, although the other stayed around 80-90%.

    Serial write throughput dual linked to daisy-chained sas2 expander backplanes from a 9201-8i card with hyperthreading disabled (4 real cores):
    1. 2.67 GB/s
    2. 1.09 GB/s
    3. 0.98 GB/s
    4. 0.81 GB/s
    5. 1.27 GB/s
    6. 1.77 GB/s
    7. 1.43 GB/s

    Serial write throughput dual linked to daisy-chained sas2 expander backplanes from a 9201-8i card with hyperthreading disabled and only 2 cores:
    1. 1.62 GB/s (this jumped around between 1.52 and 1.72 more than other tests did)
    2. 0.64 GB/s
    3. 0.56 GB/s
    4. 0.41 GB/s
    5. 0.72 GB/s
    6. 1.10 GB/s
    7. 0.76 GB/s

    ## Update ##
    I was trying to do the sas1 backplanes test I planned above, but after wasting time troubleshooting and replacing a bad cable in the chassis, and upgrading the firmware on the 9207-8i to P19 with BIOS, one drive was not showing up, even after reseating it a couple of times (I think there was a chunk of dust or something in that bay). I will troubleshoot that later, but I ran test case 7 with 1 of the 4 raidz2 vdevs being degraded and got about 450MB/s read and 390MB/s write (degraded) and the CPU was not stressed at all. This was single cabled into the 12-bay backplane, which daisy-chained into the 24-bay backplane, not terrible bandwidth considering that and the missing drive.

    I then moved the motherboard tray into a different sas2 expander chassis and ran the test suite with P19 firmware, compare to "update 4 "in my previous post above:

    Serial write throughput separately linked to each sas2 expander backplane from a 9207-8i card updated to P19 firmware instead of P16:

    1. 2.12 GB/s
    2. 1.47 GB/s
    3. 1.28 GB/s
    4. 1.06 GB/s
    5. 1.76 GB/s
    6. 1.90 GB/s
    7. 1.65 GB/s
    Noticeably faster with the newer firmware! I may leave it cabled like this since it is plenty fast and people have warned of performance drops and strange behavior when daisy-chaining sas expanders.
    Last edited: May 25, 2015

Share This Page