Crazy space loss when using 4k block sizes in ZFS for non-power-of-2 vdevs

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

apnar

Member
Mar 5, 2011
115
23
18
Note: see my post farther down, I no longer think this is accurate.

As with many other folks here I've generally had the understanding that best practice when setting up ZFS vdevs was to keep the number of data disks as a power of 2. It's a pretty constant recommendation and people talk about slight performance boosts and better space efficiency. I saw a great post over at Hard Forum today describing the math so I decided to throw the numbers into a spread sheet for various configs. I was pretty shocked at the results so I figured I'd post here to see if I'm off base and if not create an awareness of how important the power of 2 rule is when using 4k blocks (ashift 12).

When talking disk count I'll refer to data disk count, so 8 data disks would include a 9 disk vdev in raidz1, 10 disk vdev in raidz2, or 11 disk vdev in raidz3.

Under 8 data disks there is a loss of space efficiency when not aligned with power-of-two but you at least get more usable space as you add disks. Above 8 data disks things get really interesting. A vdev with 8 data disks has exactly the same usable space as one with 9 or 10 data disks. The same with 11-15 and again with 16-31.

So from a space perspective there is no difference between 10 disk raidz2 and a 12 disk raidz2, both will yield the same free space once created. Or if you wanted to go with huge vdevs a 32 disk raidz1 would have the exact same free space as a 17 disks raidz1. So there really should never be a reason to create a 4k vdev with 8,9,12-15, or 17-31 data disks unless you're going for sequential speed.

I haven't tested this out for real yet, but that's how the math worked out. I'm curious if this is inline with what others have seen.
 
Last edited:

Aluminum

Active Member
Sep 7, 2012
431
46
28
I'm not 100% sure, but I think I'm alright in that I love to make Raid 10 equivalents, which end up as a stripe across N mirrored pairs so the base vdev is always 2 data drives.

If not, then I really should split my 12 drive into an 8 and 4 :(
 

apnar

Member
Mar 5, 2011
115
23
18
I'm not 100% sure, but I think I'm alright in that I love to make Raid 10 equivalents, which end up as a stripe across N mirrored pairs so the base vdev is always 2 data drives.

If not, then I really should split my 12 drive into an 8 and 4 :(
Assuming you are using 4k drives, and have a stripe across 6 mirrors that'd be equivalent to 6 data disks. With the numbers I ran that'd be 88.89% efficient for space, so you're giving up about 11% of your space vs. setting them up as an 8 and a 4.
 

apnar

Member
Mar 5, 2011
115
23
18
Note: see my post farther down, I no longer think this is accurate.

Here are the efficiency numbers I came up with, again disk count is data disks only so add parity on top. Lines with a trailing 'x' denote no gain in usable space at all when adding the drive with 4k block sizes.

Disks512b4kno gain
1100.00%100.00%
2100.00%100.00%
399.22%96.97%
4100.00%100.00%
598.46%91.43%
699.22%88.89%
798.84%91.43%
8100.00%100.00%
998.08%88.89%x
1098.46%80.00%x
1196.97%96.97%
1296.97%88.89%x
1398.46%82.05%x
1496.24%76.19%x
1594.81%71.11%x
16100.00%100.00%
1794.12%94.12%x
1894.81%88.89%x
1996.24%84.21%x
2098.46%80.00%x
2193.77%76.19%x
2296.97%72.73%x
2392.75%69.57%x
2496.97%66.67%x
2593.09%64.00%x
2698.46%61.54%x
2794.81%59.26%x
2891.43%57.14%x
2998.08%55.17%x
3094.81%53.33%x
3191.76%51.61%x
32100.00%100.00%
3396.97%96.97%x
3494.12%94.12%x



As you can see with 512b block size the worst case is only 91% but with 4k blocks it can be as bad as 51% (not that anyone should be using vdevs that large, but you get the idea).
 
Last edited:

apnar

Member
Mar 5, 2011
115
23
18
Note: see my post farther down, I no longer think this is accurate.


I whipped up a chart showing how the usable space scales when adding disks in a single vdev:



You can clearly see the sizing plateaus that you hit with 4k.
 
Last edited:

apnar

Member
Mar 5, 2011
115
23
18
Well I ran some tests using some different files and my numbers weren't anywhere near this bad. So either I'm no good at math or the seeming logical approach omniscence went over in that thread isn't exactly what's happening. Using 20 1g files I created a bunch of pools of differing sizes with both ashift of 9 and of 12 (using hacked binary). Here were the results per total data disk count with ashift of 12:

1 98%
2 100%
3 97%
4 100%
5 96%
6 98%
7 96%
8 100%
9 99%
10 98%
11 97%
12 96%
13 96%
14 95%
15 95%
16 100%
17 100%
18 99%
19 99%

Oddly enough if I just striped and didn't use at least raidz1, there was no loss at all (I think it makes each one it's own vdev in that case). The worse case here is only 95% but I was expecting to see something closer to 71% in the 15 drive setup.

In summary, please ignore the thread as I obviously have no idea what's actually going on here :)
 
Last edited:

cactus

Moderator
Jan 25, 2011
830
75
28
CA
Oddly enough if I just striped and didn't use at least raidz1, there was no loss at all (I think it makes each one it's own vdev in that case).
Maybe I am reading this wrong, but this is the expected outcome in this config. This is how one gets RAID0 with ZFS.

Link to Hardforum thread?