40TB build, ZFS Raid-50 = 50% perf of Raid-0

gigatexal · Sep 22, 2013

It's been a while but here we go, I moved what's in this from a Norco 4020.

inside before the asus board...but you get the idea, also the panel that seperates the chambers with the two 140mm fans is not shown, and it's got the old heatsinks, but the 1Us are boss

Raw space: 60TB (20x3TB drives, 19 ST3000DM001's and one hitachi 3TB)
Usable space: 40TB (drives set up in striped raidz akin to raid-50)

Other hardware:

Asus Z8PE-18
3 IBM 1015s in IT mode
72GB DDR3 1333 ECC
2x Intel L5639 (2.13 Ghz, 12MB, 6-cores)
Supermicro 500W psu - main components
Corsair CX650W psu - drives, fans
ESXi with a Ubuntu 12.04.3 ZoL .61

Cooling:

Cooljag 1U 1366 HSF all copper fans
6 - 120mm cougar black fans for the drives
2 - 140mm fans blowing air onto the 1015s and ram
1 - 120mm cougar fan exhaust

I'm currently performing baseline testing on the drives to get the sweetspot. I'll move this to ESXi and virtualize Windows Server 2012 and SQL 2012.

I did some testing:

Raid-0 on 20
dd if=/dev/zero of=/share/zeros.out bs=1M count=10240

= 1.9GB/s which isn't bad, it's about the 100 MB/s I was looking for per disk

the same command

dd if=/dev/zero of=/share/zeros.out bs=1M count=10240

= 950 or so MB/s with 5 vdevs of 4 disks in nested raid-z's

I'm not mad as I know my setup can do what it should do, close to 2GB/s, but just surprised at the write performance hit

Any thoughts?

Jeggs101 · Sep 22, 2013

Wait... did I miss this. what case is that?

I think around 100MB/s/disk is good.

ColdCanuck · Sep 22, 2013

Not a optimal configuration in terms of layout ( non 2^n ), although it might not make much of a difference. Perhaps you could retry the test with a 7 x 3 config. Also I assume you set the pool up with an ashift=12.

So what you are getting is 950 MB/s over 5 x 3 data disks, or about 63 MB/s each. You are also calculating a GB/s of parity. What was the CPU load during this test, were any of the 12 cores pegged ?

You are using ZFS on Linux on the Ubuntu VM with the HBAs passed through ESXI ? If so make sure the disk I/O scheduler is set to noop. It would be on a physical setup but I don't know about a VM.

Finally the zfs-discuss mailing list is another good place for advice.

gigatexal · Sep 22, 2013

@jegs: lian li pc-d8000, and i was getting 100MB/s a disk with raid-0. You didn't miss this, I have been MIA for a bit with a new job and the like. It's good to be back on the forums.

@coldcanuck, thanks for the zfs-discuss mailing list but I trust you guys, there are data wizards on these forums, much more knowledgable than myself. But I'll start hitting them up.

noop scheduler, I'll have to look into that.

Here's the layout:

5 raidz vdev's each containing 4 3TB disks for a total of 20, don't have 21. For the whole array, 20 disks, 5 vdevs, software raid-50 etc, 950 MB/s.

I'll look into the cores being pegged.

update:

currently with a 1.1TB write using dd I'm seeing all 4 virtual cores being pegged at only 1.2Ghz out of the possible 2.19 Ghz each thread should have.

Here's the output of zpool iostat 1

share 438G 53.9T 0 9.36K 0 1.14G
share 440G 53.9T 0 5.96K 0 752M
share 442G 53.9T 31 5.67K 67.4K 701M
share 442G 53.9T 1 7.94K 1022 987M
share 443G 53.9T 29 3.87K 66.4K 488M
share 443G 53.9T 1 5.74K 1022 705M
share 445G 53.9T 29 6.02K 66.4K 760M
share 446G 53.9T 1 6.56K 1022 813M
share 446G 53.9T 1 6.35K 1022 787M
share 448G 53.9T 29 7.36K 66.4K 927M
share 450G 53.9T 31 5.71K 67.4K 701M
share 451G 53.9T 31 6.40K 67.4K 792M
share 451G 53.9T 1 8.72K 1022 1.05G
share 453G 53.9T 0 7.02K 0 885M
share 454G 53.9T 1 7.09K 1022 879M
share 456G 53.9T 1 6.56K 1022 813M
share 456G 53.9T 1 7.42K 1022 917M
share 457G 53.9T 0 9.06K 0 1.12G
share 459G 53.9T 31 6.16K 67.4K 762M
share 461G 53.9T 31 6.01K 67.4K 743M
share 461G 53.9T 1 7.95K 1022 988M
share 462G 53.9T 0 9.08K 0 1.12G
share 464G 53.9T 1 6.72K 1022 833M
share 466G 53.9T 31 5.66K 67.4K 699M
share 466G 53.9T 3 8.57K 7.49K 1.03G
share 467G 53.9T 29 4.92K 66.4K 620M
share 469G 53.9T 31 6.71K 67.4K 827M
share 469G 53.9T 3 8.58K 18.0K 1.04G
share 471G 53.9T 29 7.16K 66.4K 900M
share 472G 53.9T 31 6.08K 67.4K 752M
share 472G 53.9T 1 5.97K 1022 738M
share 474G 53.9T 29 9.40K 66.4K 1.16G
share 476G 53.9T 33 6.11K 73.9K 758M
share 477G 53.9T 1 5.34K 1022 659M
share 477G 53.9T 1 7.62K 1022 945M
share 479G 53.9T 29 8.17K 66.4K 1.01G
share 480G 53.9T 31 4.79K 67.4K 590M
share 480G 53.9T 3 5.85K 7.49K 723M
share 482G 53.9T 29 9.30K 66.4K 1.14G

The disks were not set up with ashift=12, I'll have to recreate that.

apnar · Sep 22, 2013

When you recreate to fix ashift you may want to try 4 five disk vdevs instead of 5 four disk vdevs. That way your data stripes are power of two (4 disks plus 1 for parity).

gigatexal · Sep 22, 2013

ok, let me try that, I'll net more usable space anyway.

gigatexal · Sep 22, 2013

now recreated with ashift=12 and 4 vdevs of 5-disks seeing still seeing speeds bursting from 1.2GB/s to lows of about 665 MB/s - 800MB/s.

update:
noop scheduler makes the lows much better. I'm seeing a tighter range from 800 - 1.1GB/s. I'm thinking it might be a esxi issue because the vm doesn't peg the chips that much, only using 1.2ghz

apnar · Sep 23, 2013

Since it seems like you're still doing lots of testing I'd test on bare metal just to confirm if it's ESXi or not. I'd also test out some other ZFS distros like Solaris and/or OmniOS both on bare metal and through ESXi.

gigatexal · Sep 23, 2013

I can't really test bare metal. I did Do some testing on bare metal freebsd and got the same speeds with nested raids. I mean I'll try different zfs distributions and it'll be cumbersome but I'll see about doing some bare metal testing.

gigatexal · Sep 23, 2013

What are others seeing with raid 50 and zfs?

gigatexal · Sep 23, 2013

Are my speeds even close to normal for nested raidz?

dba · Sep 23, 2013

gigatexal said:
now recreated with ashift=12 and 4 vdevs of 5-disks seeing still seeing speeds bursting from 1.2GB/s to lows of about 665 MB/s - 800MB/s.

update:
noop scheduler makes the lows much better. I'm seeing a tighter range from 800 - 1.1GB/s. I'm thinking it might be a esxi issue because the vm doesn't peg the chips that much, only using 1.2ghz

Based on my limited experience with ZFS, you are already running at or a bit above what I'd expect for your setup. Nice work - I would be happy with your current large-file read and write results.

Now did you mention VMWare? Unless you add some ZIL and probably L2ARC devices, you are going to be really disappointed with your speed when it comes to running VMs. Have you tried benchmarking your 4kb reads and writes? How many IOPS are you seeing?

gigatexal · Sep 23, 2013

I'm just going to mirror some SSDs and give them to esxi for the other VMs

gigatexal · Sep 23, 2013

The biggest culprit is samba being so awfully slow (likely also zfs) with writing lots of small files

gigatexal · Sep 23, 2013

Arc and logs don't Do anything for random writes and reads I thought

dba · Sep 23, 2013

gigatexal said:
Arc and logs don't Do anything for random writes and reads I thought

A ZIL will give you a nice improvement in random write performance. An SSD ZIL will give you an enormous improvement. With an SSD ZIL, I found 10x better IOPS for single-threaded small block random writes and much higher than 10x when thesting multiple write threads.

From my notes:
http://constantin.glez.de/blog/2010/06/closer-look-zfs-vdevs-and-performance#raidz

Here is a user who unpacked a .tar file containing 40K small files, generating a low queue depth random write workload. He tested once with no ZIL and a second time with an SSD ZIL and the screenshots show the improvement:
http://dtrace.org/blogs/brendan/2009/06/26/slog-screenshots/

gigatexal · Sep 23, 2013

Hmm I'll Def look into adding a zil then

ColdCanuck · Sep 23, 2013

A ZIL ( actually an SLOG ) will ***only*** help with synchronous writes, it will do nothing for async reads or writes. Is your workload full of synchronous writes (think nfs and rsync)

The 2^n thing is much overrated in my testing, your results may vary. The difference between a 5 x 4 and 4 x 5 vdev will manifest itself in the the number of independent IOPs which can be done. 5 vdevs have a 25% edge over 4 vdevs **** in certain workloads *** ; if all you are doing is single stream sequential IO you not see much of a difference.. Test your use case..

I would give the VM more cores, at least temporarily, to make sure you are not CPU bound.

L2ARC only helps if your working set can fit in it, or if you are doing lots of file opening, closing and lookup, in which case make the L2ARC metadata only.

Confused yet

If you could describe the primary use of the box I might be able to give some less vague guidance. PM me if you don't wish to put details in a public forum..

dba · Sep 23, 2013

ColdCanuck said:
A ZIL ( actually an SLOG ) will ***only*** help with synchronous writes, it will do nothing for async reads or writes. Is your workload full of synchronous writes (think nfs and rsync)...

That is my understanding as well. The explanation that I have in my head is that ZFS can treat a portion of RAM like a ZIL for asynch writes, which makes them pretty fast. Of course synchronous writes were the performance problem in the first place, so the separate SSD Zil is accelerating the writes that we really need to be faster.

gigatexal · Sep 23, 2013

A folder containing large isos and some a bunch of smaller files of various sizes being sent to a samba share on the zfs box is a synchronous write?

40TB build, ZFS Raid-50 = 50% perf of Raid-0

I'm here to learn

Well-Known Member

Member

I'm here to learn

Member

I'm here to learn

I'm here to learn

Member

I'm here to learn

I'm here to learn

I'm here to learn

Moderator

I'm here to learn

I'm here to learn

I'm here to learn

Moderator

I'm here to learn

Member

Moderator

I'm here to learn