Performance tuning three monster ZFS systems

the spyder · Oct 1, 2014

Just a few weeks back I was asked to put together a large amount of storage for a last minute project. I usually avoid projects like this, but it's a special case and thus far it has gone as planned (knock on wood). Since I am waiting on a few external delays that are outside my control, I figured there's no better time then now to do some performance tuning. I'm working on a quick set of repeatable tests that best represent our usage. Up till now, I have relied on bonnie/iostat/dd bench/crystaldiskmark. I've read pretty much everything I can find and wanted to get an outside opinion.

Here are the three systems spec's:
Processing
(1)Supermicro 2u 24 bay Chassis
(2) Xeon E5-2620 v2's
(16) 32GB DDR3 1866
(3) Supermicro 3008 based internal HBA's flashed with IT mode firmware
(2) LSI 9300-8E HBA's
(1) Mellanox ConnectX3 Dual Port QDR IB
(22) 1TB Samsung 850 Pro's
(4) 256GB Samsung 850 Pro's
(2) Supermicro 45 Bay JBOD's (Single expander)
(90) WD RE4 4tb 7200rpm Enterprise SATA

Rpool:
(2) 256GB Mirrored

SSD Pool: (9.4TB Formatted)
(10) Mirrored 1TB vdev's
(1) 256GB ZIL Drive
(1) 1TB Spare

Spindle Pool: 157TB
(44) Mirrored 4TB vdev's
(2) 4TB Spares
(1) 256GB ZIL
(1) 1TB L2ARC
90% limited

Archival (2x)
(1)Supermicro 2u 24 bay Chassis
(2) Xeon E5-2620 v2's
(16) 32GB DDR3 1866
(1) Supermicro 3008 based internal HBA's flashed with IT mode firmware
(4) LSI 9300-8E HBA's
(1) Mellanox ConnectX3 Dual Port QDR IB
(2) 1TB Samsung 850 Pro's
(4) 256GB Samsung 850 Pro's
(4) Supermicro 45 Bay JBOD's (Single expander)
(180) WD RE4 4tb 7200rpm Enterprise SATA

Rpool:
(2) 256GB Mirrored

Spindle Pool: 475TB Formatted
(22) 8x4TB Raid Z2 vdev's
(4) 4TB Spares
(2) 256GB Mirrored ZIL
(2) 1TB L2ARC
90% Limited

OS: Solaris 11.2 (or possibly OmniOS, I'm going to play with it tomorrow.)

There are a few important notes: (1) We were limited by what we could order due to the time frame. This caused major issues- the Intel SSD's and SAS hard drives I originally specced ended up be weeks out from our deadline. The WD RE4's and Samsung 850's were the only drives available. (2) The original project requirement was 1PB of storage. After speaking with the teams using this storage and doing a quick analysis, I decided to split it in to the three systems above. Mainly because of how they move data as it is processed. The first archival machine is really a input data server, where everything from the field is uploaded and organized. The processing data server is where larger groupings are copied, broken in to smaller chunks, and processed off the SSD array by an attached 240 core/1.5TB cluster. The last machine is an output directory, where everything is QA'd and copied off for delivery. It's a complicated process with several data moves, but if you could see how they are doing it now- it's 100x better.

I'm building the arrays as I write this, let me know what you think or what benchmarks you would like to see.

gea · Oct 1, 2014

Impressive project!

What you may consider:
- if sync is disabled, a ZIL is not used/necessary
- I would add more RAM than 32 GB (up to 128 GB) .
- limit capacity to 90% (=10% initial pool reservation) is ok
- with SSDs, you may add an extra 5-10% overprovisioning to keep write performance high under load
(create a host protected area on new SSDs) so you do not need to care about during usage.
- use Raid-Zn vdevs build from 4 or 8 datadisks (6 or 10 disks per Z2 vdev)
- in case of SSDs you may use Z2 instead of mirrors as iops of SSDs do not really require mirrors

What services are you using (ex iSCSI, SMB, NFS)?
Service performance from a client (Example Windows, Crystaldiskmark via iSCSI) or via SMB or NFS (sequential and iops)
or box to box performance over IB would be of interest. (local raw pool performance should be "more than enough")

the spyder · Oct 1, 2014

gea said:
Impressive project!

What you may consider:
- if sync is disabled, a ZIL is not used/necessary
- I would add more RAM than 32 GB (up to 128 GB) .
- limit capacity to 90% (=10% initial pool reservation) is ok
- with SSDs, you may add an extra 5-10% overprovisioning to keep write performance high under load
(create a host protected area on new SSDs) so you do not need to care about during usage.
- use Raid-Zn vdevs build from 4 or 8 datadisks (6 or 10 disks per Z2 vdev)
- in case of SSDs you may use Z2 instead of mirrors as iops of SSDs do not really require mirrors

What services are you using (ex iSCSI, SMB, NFS)?
Service performance from a client (Example Windows, Crystaldiskmark via iSCSI) or via SMB or NFS (sequential and iops)
or box to box performance over IB would be of interest. (local raw pool performance should be "more than enough")

Hi Gea!

You will be happy to know we purchased Pro licensing for this system.
1) I set sync to standard. I had initially disabled it for a quick comparison and forgot to re-enable it.
2) Each systems has 512 GB (16x32).
3) I left the default 10% for now.
4) I'm planning on only allowing 75% on the SSD pools due to the concerns you mentioned. Thanks for the tip, I was unaware I could create a host protected area.
5) I used 8 disk RaidZ2 groups for the archival storage systems and 2 disk mirrors for the processing storage system's spindle drives.
6) I will benchmark both Mirror and RZ2 on the SSD pool and report back.

The system will be accessed mainly via NFS and some SMB. I'm doing my initial testing based on your tuning guide and will post the results.

Here's a quick shot of it being burnt in.

gea · Oct 1, 2014

the spyder said:
2) Each systems has 512 GB (16x32).

In the past it was not recommended to use more than 128GB RAM.
I am not sure if this is a problem with current OmniOS/ Solaris 11.2 as I have not used that many RAM and not heard of anyone using > 128 GB RAM. You may need to ask at Oracle/ Omniti.

Listbox • Email Marketing and Discussion Groups
Listbox • Email Marketing and Discussion Groups
Listbox • Email Marketing and Discussion Groups
Nex7's Blog: ZFS: Read Me 1st

legen · Oct 1, 2014

Nice one. When accessing this through NFS, why dont you go with something like ZeusRAM to speed up sync writes (or are all your workloads async)?

rubylaser · Oct 1, 2014

This is a ridiculous build. I can't wait to see some benchmarks

the spyder · Oct 2, 2014

Well,

I was hoping to spend today doing some initial testing, but instead I ended up troubleshooting a very odd bug. On all three systems, mirroring the RPOOL caused the original drive to to degrade with hundreds of checksum errors. Eventually both drives would report degraded and report checksum errors. FMADM reports the drives are not faulty and scrub return back zero errors. I swapped drives, updated controller firmware, and reinstalled. The issue persisted. I'm not sure if this is a controller/drive bug or Solaris 11.2 issue. I'm going to try Omni in the morning. I was able to get the processing array controller working after removing the degraded drive and re-adding it- but it uses the built in SATA controller on the motherboard to drive the two additional rear mounted drives. The other two systems have AOC-3008-8i controllers flashed to the latest IT firmware.

I dislike these days where you feel like you are chasing your tail.

the spyder · Oct 2, 2014

gea said:
In the past it was not recommended to use more than 128GB RAM.
I am not sure if this is a problem with current OmniOS/ Solaris 11.2 as I have not used that many RAM and not heard of anyone using > 128 GB RAM. You may need to ask at Oracle/ Omniti.

Listbox • Email Marketing and Discussion Groups
Listbox • Email Marketing and Discussion Groups
Listbox • Email Marketing and Discussion Groups
Nex7's Blog: ZFS: Read Me 1st

Gea,

Out existing systems use 192GB ram and show no errors/issues. I'm assuming it's because we treat them as giant NAS boxes. If it becomes a issue during testing, I will limit the ARC and hopefully change it back when it's patched.

gea · Oct 3, 2014

Thanks
There are not so many setups with more than 128 GB RAM around
so reports about remaining problems or success in Solaris 11.2 and current OmniOS are important
as the problem reports are older than a year.

the spyder · Oct 3, 2014

Sadly I was unable to perform any testing today due to the OS drive issue. I tried every combo I could think of, but no matter what mirroring the RPOOL caused checksum errors. In the end, it's a bug between the Supermicro AOC-3008 and the Samsung 850 Pro SSD's. IR or IT mode did not change a thing. The one system that uses onboard SATA ports works fine. I had high hopes for the 850 Pr0s- but for now, they are not usable as OS drives with the SAS3 controller. I swapped the 850's for Intel S3500's and the issue went away. I'm actually concerned enough that I'm going to replace the ZIL drives with S3700's while I'm at it. I'll keep the 1TB 850's for L2ARC/SSD Pool and monitor them closely. Hopefully next week I can do some testing before we install the system at the customers site.

kroem · Oct 4, 2014

(only here for the pics/benchmarks

~~)

lmk · Oct 4, 2014

the spyder said:
Sadly I was unable to perform any testing today due to the OS drive issue. I tried every combo I could think of, but no matter what mirroring the RPOOL caused checksum errors. In the end, it's a bug between the Supermicro AOC-3008 and the Samsung 850 Pro SSD's. IR or IT mode did not change a thing. The one system that uses onboard SATA ports works fine. I had high hopes for the 850 Pr0s- but for now, they are not usable as OS drives with the SAS3 controller. I swapped the 850's for Intel S3500's and the issue went away. I'm actually concerned enough that I'm going to replace the ZIL drives with S3700's while I'm at it. I'll keep the 1TB 850's for L2ARC/SSD Pool and monitor them closely. Hopefully next week I can do some testing before we install the system at the customers site.

@the spyder thanks for all these detailed updates and information - invaluable!

legen · Oct 5, 2014

the spyder said:
Sadly I was unable to perform any testing today due to the OS drive issue. I tried every combo I could think of, but no matter what mirroring the RPOOL caused checksum errors. In the end, it's a bug between the Supermicro AOC-3008 and the Samsung 850 Pro SSD's. IR or IT mode did not change a thing. The one system that uses onboard SATA ports works fine. I had high hopes for the 850 Pr0s- but for now, they are not usable as OS drives with the SAS3 controller. I swapped the 850's for Intel S3500's and the issue went away. I'm actually concerned enough that I'm going to replace the ZIL drives with S3700's while I'm at it. I'll keep the 1TB 850's for L2ARC/SSD Pool and monitor them closely. Hopefully next week I can do some testing before we install the system at the customers site.

Have not tested the Samsung 850 Pro but we have tested the 840 Pro for zil with very very poor results. The 840 Pro simply has two high latency to work any good as a ZIL. We actually got worse results with the 840 Pro as ZIL than without (on a SSD array).

The S3500 or S3700 is a much better choice for ZIL (look at their write latency in the datasheet).

spazoid · Oct 6, 2014

Why do you want a SLOG for an SSD based pool? Unless your SLOG device is considerably faster than the actual pool, there should be no noticeable difference in performance.

wlee · Oct 6, 2014

I found interesting 730 has better write latency than S3700, on paper at least.

Is Intel the only vendor who publish latency?

PigLover · Oct 6, 2014

spazoid said:
Why do you want a SLOG for an SSD based pool? Unless your SLOG device is considerably faster than the actual pool, there should be no noticeable difference in performance.

I would agree that the log device (ZIL/SLOG) adds little value to an SSD pool running in pool or mirror mode. But if the SSD pool is running in parity mode (RaidZ/Raid5, etc) then the log device is actually very valuable to limit write events on the SSDs and improve their longevity. Writing to a parity raid is a very sloppy affair - and the log device allows the writes to be safely cached, scheduled, and completed in rational units to limit the total number of writes that occur.

the spyder · Oct 14, 2014

So I spent the better part of last week fighting a combination of problems with one of the systems. Twenty one drives appeared to have dropped over the course of the weekend. They were all on the same backplane and same controller. Destroying the pool, moving the controllers around, and bam- still dropping. The drives continued to rack up errors under S: H: T: and would eventually drop offline. I move the drives to a separate controller and JBOD- one that had no previous issues and they still dropped. Removing the drives and testing them via manufacture software showed them as healthy. In a last resort, I replaced the original controller and rebuilt the pool. No drives have gone offline in five days. I am not satisfied this is resolved as the disks are still generating errors under S: H: T:- but clear after a reboot and thus far have not failed during performance testing.

On to performance testing. Since starting this thread, I have changed the access/pool setup.

Archival servers:

8 Disk Raid-Z2 (22) + 4 Hot Spares
(2) Intel DC 3700 ZIL Mirror
(2) 1TB Samsung 850 Pro L2ARC
Compression ON
Sync= Standard

Processing server:

Spindles- 2 Disk Mirror (44) + 2 Hot Spares
(2) Intel DC 3700 ZIL Mirror
(2) 1TB Samsung 850 Pro L2ARC
Compression ON
Sync= Standard

SSD- 2 Disk Mirror (10)
(No ZIL or L2ARC)
Compression ON
Sync= Disabled

Out of box performance is, well, good- but not great. I'm working on more tuning tomorrow, but so far it's not quite as amazing as I had hoped.

Client side, I'm still working on benchmarks- but here's the initial IPERF results.
Windows barely hit 1GB's after tweaking the jumbo frame size and send/receive buffers. (DBA, I would love to know your settings.)

Solaris hits nearly 3.12GB's out of box! (ConnectX3 + IS5030)

J-san · Dec 3, 2014

I'm not sure if this could be your problem with the Transport errors, but when I built a server recently I ran into many Transport errors in OmniOS with my 3 LSI 9211-8i cards flashed to P20 firmware. (always flash to latest firmware right?)

I downgraded to P19 firmware and I haven't seen any Transport/Hard errors since, so that might be worth a try. I think your hba controller is different, but there might be a bug in shared firmware code.

My Intel S3500 SSDs caused more Transport/Hard errors for me than the SATA RE4s that I had, but both would consistently run up errors during/after benchmarking the disks locally via Napp-it. As soon as I downgraded to P19 firmware in IT mode that went away.

PigLover · Dec 3, 2014

This likely is the problem. P20 firmware is widely reported as unstable. I had troubles after upgrading some cards to P20 - everything cleared up when reflashed to P19.

Stanza · Dec 4, 2014

What do you get with

Iperf
update interval every 2 seconds
running for 20 seconds
packet size 1024k
running 6 concurrent threads

iperf -s 192.168.0.1 -i 2 -t 20 -w 1024k -P 6

.

Performance tuning three monster ZFS systems

Member

Well-Known Member

Member

Well-Known Member

Active Member

Active Member

Member

Member

Well-Known Member

Member

Active Member

Member

Active Member

Member

New Member

Moderator

Member

Member

Moderator

Active Member