ZFS: zero reads on 2 disks of 6-drive RAIDZ2 : because of power-of-two data disks?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

TheBloke

Active Member
Feb 23, 2017
200
40
28
44
Brighton, UK
Hey all

I'm in the process of building a new pool on my home Solaris 11.3 server. The final pool will be 16 x 2TB drives, configured as 2 x 8-drive RAIDZ2 for a total of 12 data drives and 4 parity.

One complication is that I have a mixture of 4k and 512 sector drives, and so I've been experimented with different ashift settings and benchmarking and monitoring of various different configs.

When doing so, I noticed something I thought was really weird: when I create a 12 drive pool as 2 x 6 drive RAIDZ2, then do a bonnie++ rewrite test - or just copy data within the pool - I always got four drives (two per VDEV) with 0 reads.

I did further did tests using a single VDEV to simplify things, and here's a quick example. First, a 7 drive RAIDZ2 during bonnie's "Rewrite" test:
Code:
                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd22    409.6   38.6 46271.7 34548.0  0.0  0.7    1.5   0  32
sd24    359.2   37.0 46738.0 34547.9  0.0  2.3    5.7   0  98
sd25    361.2   38.4 46531.6 34549.5  0.0  2.1    5.2   0  89
sd26    368.4   38.0 46377.2 34550.3  0.0  1.9    4.6   0  85
sd28    360.2   38.2 46188.4 34549.5  0.0  1.9    4.7   0  88
sd30    355.6   38.4 46422.0 34550.3  0.0  1.9    4.7   0  88
sd35    376.0   38.0 46241.2 34548.7  0.0  2.1    5.0   0  83

That all looks normal to me. Then, a 6 drive RAIDZ2 in the same test:
Code:
                 extended device statistics
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd22    109.8   95.4 52777.6 92636.0  0.0  1.3    6.1   0  63
sd24      0.0   62.0    0.0 59208.8  0.0  1.1   17.0   0  53
sd25      0.2   64.0    1.6 61256.8  0.0  1.1   16.6   0  53
sd26     70.6   62.4 52985.6 61436.0  0.0  1.9   14.6   0  94
sd28     64.8   68.8 54267.2 64917.6  0.0  2.0   14.7   0  97
sd35     88.2   65.4 53001.6 62812.0  0.0  1.7   11.1   0  84

The difference is clear - with 6 drives, there's always two drives that are reading 0 or close to 0. I see the same during read-only tests: reads on only four disks, not 6.

At first this really confused me, and I thought it might be some kind of problem. But in my other thread about ashift values, @gea mentioned that there is a benefit with VDEVs containing a power-of-two number of data drives. Which is what I have here with my 6-drive RAIDZ2s with 4 data disks.

So, is that the difference? In which case - wow, this at first seems really significant? At first I thought it meant a big performance hit, because I am only using 4 spindles for reading instead of 6. But maybe it does this because it doesn't need to read from those spindles, thereby presumably increasing overall performance - or at least reducing wear on the drives?

But it's hard to compare performance, because if I compare performance of a 7-drive RAIDZ2 with a 6-drive, I have an extra spindle overall which makes it faster. I did do that bonnie test, and the 7 drive test is certainly faster for reading and re-reading because of the extra spindle.

Although if I work out the per-data-disk performance (overall read speed divided by number of data disks), it does seem like I get more performance per-disk from the 6 drive than the 7 drive - it makes better use of its 6 drives, just not so much better that it can out-perform having an extra drive.

So overall I don't know - it definitely doesn't seem like it's worth using fewer disks, because perf will definitely suffer with fewer spindles. But maybe if I had planned to use one-less than a power of two data drives, it should encourage me to add an extra disk, if possible, because I get more than the usual amount of extra perf/value out of it?

In my case I had planned to have 16 disks total in 2 x 8 drive RAIDZ2, so I'd have to add four more drives total and make 2 x 10 drive = 20 drives; not sure I want to spend that much extra. But it is good to think about!

Anyway, I'd be grateful if anyone could confirm that I'm understanding this correctly!
 

gea

Well-Known Member
Dec 31, 2010
3,155
1,193
113
DE
ZFS writes datablocks in a power of 2 (32k, 64k, 128k etc). If datadisks are also a power of 2, you can distribute such a datablock over all disks without a rest after the division. If there is a rest, this rest must be written over all disks with a physical disk blocksize of 4k. This can end in a reduced overall capacity (up to 10%). You can reduce the effect with a higher ZFS blocksize (ex 512k or 1M instead the default 128k).

On a raid-z2 two of the disks are only for redundancy. While on writes you must write data + reundancy over all disks on reads you only need to read the data not the redundancy that you only need in case of a disk failure or a checksum error.
 
  • Like
Reactions: TheBloke

TheBloke

Active Member
Feb 23, 2017
200
40
28
44
Brighton, UK
Thanks a lot @gea, that's great to know.

As I just posted in my other thread, I have decided to upgrade to a 20 disk pool, 2 x 10-drive RAIDZ2. I wanted a bit more capacity to compensate for going ashift=12, and the extra perf/reduced wear associated with power-of-two data disks is appealing.

Plus then I'll know I have so much space I don't even have to think about upgrading for many, many years :)

Thanks again for all your help.
 

TheBloke

Active Member
Feb 23, 2017
200
40
28
44
Brighton, UK
I've now created my 20-drive pool, 2 x 10-drive RAIDZ2 (ashift=12)

It's working nicely, and performance is pretty good. But I am no longer seeing the same read pattern I originally described in this thread. Now when I copy data from one part of the pool to another, or run a bonnie++ test, I see reads across all disks, as seen in this iostat -xMen output:
Code:
                            extended device statistics       ---- errors ---
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
   97.0   83.8   21.0    7.3  0.2  0.7    1.2    3.9  13  38   0   0   0   0 c3t0d0
   92.0   81.6   21.0    6.1  0.2  0.7    1.2    4.1  11  38   0   0   0   0 c3t1d0
   95.2  219.8   20.7    6.1  0.2  0.7    0.7    2.1  10  38   0   0   0   0 c3t2d0
   93.8  206.4   21.0    5.4  0.5  0.4    1.6    1.2  30  36   0   0   0   0 c3t3d0
   96.8  253.4   21.6    7.9  0.5  0.4    1.3    1.0  29  36   0   0   0   0 c3t4d0
   93.6  281.0   21.6    7.9  0.2  0.7    0.5    1.7   9  36   0   0   0   0 c3t5d0
  118.2  378.0   21.4    7.8  0.0  0.6    0.0    1.3   0  29   0   0   0   0 c7t0d0
  117.8  213.8   21.3    7.9  0.2  0.4    0.5    1.2   8  22   0   0   0   0 c7t1d0
  110.8  210.6   21.3    8.0  0.0  0.6    0.0    1.9   0  25   0   0   0   0 c7t2d0
  112.6  209.4   21.4    7.9  0.4  0.3    1.3    0.8  23  27   0   0   0   0 c8t0d0
  108.0  192.2   21.3    8.0  0.0  0.6    0.0    2.1   0  26   0   0   0   0 c8t2d0
  115.0  229.2   21.2    7.9  0.0  0.6    0.0    1.6   0  22   0   0   0   0 c8t3d0
  155.4  486.6   21.0    7.7  0.0  0.1    0.0    0.2   0   9   0   0   0   0 c0t5000039FF3C36DCBd0
  121.8  187.8   21.3    7.9  0.0  0.5    0.0    1.6   0  19   0   0   0   0 c0t5000CCA222C2F61Bd0
  121.2  200.8   21.2    7.9  0.0  0.5    0.0    1.6   0  19   0   0   0   0 c0t5000CCA222C16AC0d0
   93.0  195.6   21.2    7.3  0.0  0.8    0.0    2.8   0  36   0   0   0   0 c0t50024E92041C99A5d0
  164.2  483.8   21.1    7.7  0.0  0.2    0.0    0.2   0   9   0   0   0   0 c0t5000039FE5C6D545d0
  156.4  491.0   21.0    7.7  0.0  0.2    0.0    0.2   0  10   0   0   0   0 c0t5000039FE5C6D33Dd0
  166.6  480.8   21.5    7.7  0.0  0.2    0.0    0.3   0  10   0   0   0   0 c0t5000039FE5C6D60Dd0
  151.8  490.2   21.0    7.7  0.0  0.1    0.0    0.2   0   9   0   0   0   0 c0t5000039FE5C6D530d0

I thought that having two 10-drive RAIDZ2s would exhibit the same read pattern as 2 x 6-drive, because they have 8 data drives each which is still a power of two. Have I misunderstood something, or is something else different here? The only difference I can think of, besides the extra number of disks, is that I am now additionally using my Marvell 88SE9215 controllers which I wasn't in my 2 x 6-drive tests, but although they're not great controllers (and I already experienced one bug with them), I can't imagine they could affect this? Either way, I hope to stop using them soon, once I've got a second LSI SAS2008.

It's no big deal, just interested to know what's different compared to the 2 x 6 drive RAIDZ2 tests I did.
 
Last edited: