ZFS: Mixture of 4k/512 drives, best ashift to use? (Solaris 11.3)

TheBloke · Feb 23, 2017

Hey all

I am in the process of upgrading the storage on my home file server, which runs Solaris 11.3 on an Ivy Bridge i5-3550 (H77 chipset) with 32GB of RAM, using a mixture of storage controllers : 8 ports from an LSI 9211-8i; 6 ports from the onboard H77 controller and up to another 8 ports available from 2 x 4-port Marvell 88SE9215 controllers in PCIE-x1 slots. (I would like to get a second LSI in the future, but would need to first change my mobo in order to use it - the H77 chipset only provides 4 lanes on the second PCIE-16 slot.)

I have 17 x 2TB + 1 x 3TB drives available, but my final array will be 16 x 2TB drives configured as 2 x 8-drive RAIDZ2, providing a total of 12 data drives and 4 parity.

My question relates to ashift, and what ashift I am best using for my two VDEVs, given that I - unfortunately - have a mix of sector sizes amongst my drives. I haven't been a professional sysadmin for several years and I'm rather out of touch with hardware. As such I only really learnt about ashift, and the impact of Advanced Format drives on ZFS, in the last week - after I had already bought a bunch of refurbished drives to expand my original 9 x 2TB RAIDZ2 into the planned 16 drive config. (Ironically, it turned out the drives I already had - purchased new in 2011 - were 4k, but most of the extra drives I bought this week were 512b!)

Of the 18 drives I have available, exactly half are native 512b sectors, with the other half being 4k drives that present as 512 ('512e')

However - and this part confuses me - of the nine 4k drives I have, seven result in an ashift of 9 when used in any VDEV, and two result in an ashift of 12. I tested this by making one pool for every type of drive and checking ashift with zdb -C.

The seven giving ashift=9 makes sense as they're 512e drives, but I don't quite follow why the other two give ashift=12 given they're also 512e. I'd have expected either all of them to be 12 (if Solaris can work out they're really 4k), or all 9 (if it can't.) But maybe there's some extra info that these two drives provide that others don't? Or maybe Solaris contains a configuration list of specific drives to use certain ashift values with - as I have since read that Illumos does with its configurable sd driver?

Of the 16 drives I actually plan to use, only one results in ashift=12. Therefore, if I just went ahead and created my zpool without further thought, I would end up with one VDEV with an ashift of 9 and the other with an ashift of 12. I've noticed that using this config results in slightly unbalanced data allocation across the two VDEVs (as monitored with zpool iostat -v)

However, I have found that I can manipulate the ashift. Solaris 11 doesn't provide a direct way to set ashift (unlike most/all other newer ZFS implementations, I've read). But I can force the ashift to be 9 by building the pool using a file instead of the ashift=12 drive, then switching in the real drive with zpool replace afterwards. Similarly, I can also force both VDEVs to be ashift=12 using one of my two spare drives which has that ashift value. Again, I can build the pool with this drive in place and then zpool replace in the drive I actually want. I've tested both of those methods and they seem to work OK.

Here's a summary of all the drives I have, including their sector size and the ashift value they result in when used in a pool:

Code:

Qty | Drive               | Sect Int-Ext| Ashift
7   | Samsung HD204UI     | 4096 - 512e | 9
6   | Hitachi HUA72202    | 512  - 512  | 9
1   | WDC WD2003FYYS      | 512  - 512  | 9
1   | Toshiba DT01ACA200  | 4096 - 512e | 12
1   | Hitachi HDS72202    | 512  - 512  | 9
1   | WDC WD20EARS-00MV   | 4096 - 512e | 9
1   | Seagate ST3000DM003 | 4096 - 512e | 12

This totals 18 drives, with the 16 I want in my final pool being all except the last two rows. Of the last two, the Seagate ST3000 is a 3TB drive that I will remove when I'm done testing, but may be useful in the meantime as the second ashift=12 drive I mentioned. The WD20EARS also won't be in the final pool but will likely be kept attached to be configured as a hot spare.

So here's my real question after all that blurb!: given this less-than-homogeneous mix of drives, what ashift value should I use? Should I go with the default value Solaris gives me - one VDEV of ashift=9, one of ashift=12? Or should I manipulate them to be both ashift=12, or both ashift=9?

My strong preference, assuming there's no big problem I am not aware of, would be ashift=9 for both, because it gives me 1.1TB extra capacity (21.4TB usable space versus 20.3TB), and then also uses less space for each file allocation and dataset. I have found that this allocation difference can occasionally be quite noticeable - I have a few datasets containing 100's of thousands of files, one of which consumes 40GB on an ashift=9 VDEV but 65GB on an ashift=12.

That said, my file server is primarily used for large media files, so datasets with lots of files will be the exception. Similarly, while extra space is always nice, 20.3TB should be more than enough space for the next few years. In other words, if there's some big gain to be had by going ashift=12, then the loss of space is not likely to impact me all that much or any time soon.

I did do some bonnie++ benchmarking, creating a 2 x 6-drive RAIDZ2 pool (12 total drives) both as ashift=9 and ashift=12. The ashift=9 pool came out 4% faster on sequential writes compared to the ashift=12 but 1% slower on sequential reads. My feeling is that these small differences are simply within the margin of error and therefore probably not significant (I have only benchmarked each pool once so far.) There was no major difference, to be sure.

Anyway, sorry for the length of all that! Any advice and further info would be much appreciated. My hope is that I can just use ashift=9, but I'd definitely like to hear if there's anything else I should consider (besides spending yet more money to standardise on using all 4k drives

)

TB

gea · Feb 23, 2017

Technically, ashift is a property of a vdev that is set accordingly to the physical structure of the memberdisks automatically. While there was a special zpool version available for Illumos to force ashift years ago (like on BSD or ZoL), the Oracle Solaris and Illumos developers decided it the proper technical way. Ashift is due a disk property so problems must be solved based on disk properties, see ZFS and Advanced Format disks - illumos - illumos wiki (I would have preferred the more practical way to make it s set parameter of zpool).

Main problems came up because the first 4k disks lied about their internal structure. If you use such a disk with ashift=9 you have a massive performance problem. Next problem that came up was the unability to replace a 512B disk in an ashift=9 vdev with a newer 4k disk.

Beside performance differences or the fact that 4k disks with ashift=12 offer only a lower capacity with many small files or with vdevs that are not build from the golden numbers (raid-z where number of datadisks is a power of 2) the only pracical way is creating ashift=12 vdevs all the way. On a disk failure you may only find newer 4k disks. Only with ashift=12 you can use them to replace a failed disk.

btw
You can force ashift on Solaris not only with a modifiction of sd.conf but also if at least one disk of a vdev is 4k. To modify ashift, you must recreate the pool.

TheBloke · Feb 23, 2017

OK thanks a lot, that's very helpful.

First thing I learnt was that sd.conf setting (physical-block-size) is possible in Solaris too. I had already read the Illumos Wiki article you linked, but the article seemed to imply the sd.conf modification only applied in Illumos-derived OS - ie that it was implemented after the fork from OpenSolaris. So it's great to know it applies in Solaris 11.x too. I just read the sd(7D) man page, and see physical-block-size mentioned in there - I should have checked that more thoroughly before!

You mention two downsides of ashift=9 with 4K drives: performance problems, and the inability to replace a 512b drive with a 4k drive.

Regarding performance: could you elaborate at all? I have only done a couple of benchmarks, but I could not immediately see any difference - at least when it came to sequential read/writes. Is it IOPS that it affects rather than sequentials? Is there an easy benchmark I can run to see that? Because the issue I have is that half my disks are 512 and half 4096, so neither setting is right for all drives. Either I'm setting 4k for all drives including 512b, or I'm setting 512 for all drives including 4096. Either choice is a compromise; I am hoping to work out which is the best compromise and so far, I can not see any clear performance difference - in fact, ashift=9 was 4% faster for writes in the benchmark I did.

Regarding replacing a drive: This doesn't seem a problem when I replace a 512 sector drive with a 4k sector, at least when the 4k is a 512e drive. I've tested making a RAIDZ2, and a mirror, using drives that are natively 512 and report ashift=9 and then replacing them with a 4k drive which reports ashift=12, and it works fine.

But all my drives are 512e, and I don't have any 4k-n drives which actually report 4k. So maybe that is when there will be a problem? Would it just refuse to replace the drive? Have you (or anyone) experienced this issue - the OS refusing to allow a zpool replace when the new drive is 4k and the old is 512?

I'm still undecided. I do understand the potential problems, but also 1.1TB space extra space is quite a lot - and if I can't see it in benchmarks, I'm wondering if I'll really see any real performance issues. I'm not going to be running production VMs on this, 95% of my workloads will be sequential reads and writes, and (so far at least) I seem to get the same or better figures when it's ashift=9 versus ashift=12.

I guess then the main issue is whether I'll have trouble replacing drives in the future. I'm not buying new drives now, I buy used/refurbished ones, and I guess those aren't going to be around forever if I only want 512e drives. If I still have this pool in 3 years, it may well be hard to find a 512e 2TB drive that's still in good working order - I won't want to be paying for 5+ year old drives.

So I need to think about that some more I guess. I would be grateful for any more clarification on performance and if there's any other situations where the wrong ashift could cause issues.

Thanks again! (And PS - yeah, so far I've been controlling ashift by including drives with the desired value, ie if I want ashift=12 I make sure to include at least one drive with that value. But now I also know about sd.conf, I will probably use that instead as it means I don't need to zpool replace afterwards.)

gea · Feb 23, 2017

Using disks that are internally 4k in 512e mode like a 512B disk can result in alignment problems with the result of a reduced performance and (I had this problems several times) in the unability to replace a disk in an ashift=9 vdev with a 4k disk.

My general recommendation (from a view to avoid possible troubles), use/force always ashift=12

btw
You can force a 512B disk in sd.conf to announce itself as 4k physical but you cannot announce a 4k disk as physical 512B. If you use them as 512B in 512e emulation mode, you are affected by a lower performance.

TheBloke · Feb 24, 2017

Thanks a lot again @gea

I have done some more testing and I am definitely going to go with ashift=12 as you suggested. For two reasons:

1: Performance: I did more testing, and now I was able to clearly see the difference you described. I created a 14 drive pool, 2 x 7-drive RAIDZ2.

First I created it with vanilla sd.conf in place and with a couple of files in the pool so to force the pool as ashift=9. After replacing the files with drives, I ran 6 bonnie++ benchmark runs on the pool.

Next I added the Illumos sd.conf entries for physical-block-size to sd.conf, rebooted, then re-created the pool which now defaults to ashift=12. I ran 6 bonnie++ benchmarks again.

After averaging these results, the ashift=12 pool was 40MB/s faster in sequential writes, 30MB/s faster in reads, and 13MB/s faster in rewrites.

2: Serious bug/issue with my Marvell 88SE9215 controllers: I have two Marvell 88SE9215-based Syba PCIE-x1 cards, providing 4 ports each. I have had a weird problem with them that has been bugging me for more than a week, and I finally figured it out as a result of talking to you, @gea, and doing this testing.

There is some issue with this Marvell chip - or at least with it on my Syba cards - which I now realise directly relates to sector size/ashift. Specifically, if I have a 4k sector disk that is attached to a 88SE9215 port and is in an ashift=9 pool, sometimes the disk will 'hang' during heavy writes - specifically, heavy writes involving large numbers of files.

The symptom is that iostat will show %b go to 100 with all other values at zero. It will hang in this state for up to minutes, and then I will get several timeout errors in /var/adm/messages:

Code:

magrathea ahci: [ID 517647 kern.warning] WARNING: ahci6: watchdog port 2 satapkt 0xffffa1008817d3d0 timed out

The disk would then recover and continue, although usually once it has failed once it will quickly fail again and over and over until eventually the number of errors exceeds a threshold and fmadm takes the drive offline.

This never happens with my LSI or onboard H77 controllers. I had been seeing it for a while and had no idea what it was, I just figured it was some incompatibility with the Marvell chip and Solaris 11.3. It caused me to go buy an LSI controller so I could stop using these Marvell-based controllers.

I've confirmed it's definitely related to 4k sectors + ashift by creating a 4-disk RAIDZ1 using 4k/512e disks attached to the Marvell controllers, and then running bonnie++'s file-creation tests to create hundreds-of-thousands of files. With ashift=9 one or more of the disks would hang within seconds. With ashift=12, it ran with no problems for dozens of iterations until I cancelled it.

So ashift=12 is now clearly the only choice for me - performance is better, and more importantly it means I can safely use all my controller ports. I had been thinking of spending more money on a second LSI, but now I don't have to.

Instead, I decided to spend money on more drives so as to solve the issue of lesser available space from ashift=12. I bought four brand new 2TB Toshiba DT01ACA200 SATA3 drives (with 4k sectors), and now plan to run 2 x 10-drive RAIDZ2, which also gives me the benefit of power-of-two data disks. That totals 27.2T capacity, which really should be enough for many, many years.

It's slightly more risky from a redundancy point of view, but I have just enough controller ports available (22 in total) to have one hotspare online, and as it's not a production server I have no problems taking it offline for the duration of resilvers.

Thanks again for your help!

ttabbal · Feb 24, 2017

I've got a few Marvell based SATA controllers around. I've come to the conclusion that they are complete crap. I don't know if it's a driver thing, or silicon, but they suck and cause problems. They might be ok for basic Windows use, but as soon as I start moving a lot of data around, they fall over. Linux, Solaris, and BSD all show the same symptoms. Swap to a LSI HBA, problems all go away.

TheBloke · Feb 24, 2017

ttabbal said:
I've got a few Marvell based SATA controllers around. I've come to the conclusion that they are complete crap. I don't know if it's a driver thing, or silicon, but they suck and cause problems. They might be ok for basic Windows use, but as soon as I start moving a lot of data around, they fall over. Linux, Solaris, and BSD all show the same symptoms. Swap to a LSI HBA, problems all go away.

Yeah, well my experience so far certainly bears that out! Although I have (hopefully) now solved that one issue, the mere fact it exists is hardly a great sign. I suppose I can't expect too much from a card that cost me £27 for 4-ports, but still.

I really would love a second LSI controller. My big problem is that I'm on a consumer-grade Asus motherboard that uses the H77 chipset. This chipset provides 2 x PCIE-x16 slots, but only one of them provides x16 speed; the other is max x4.

The 8 port LSI cards need PCIE-x8, so in order to run two such cards I need a mobo that can do 2 x PCIE-x8. The cheapest upgrade would require being able to use my same Ivy Bridge i5-3550 CPU, meaning I need a mobo supporting the Z77 chipset. The Z77 has 2 x16 slots that can be used together as 2 * x8, as well as usually providing one x4 and a couple of x1 slots. Sadly, even though Z77 mobos are now several years old, they still seem to sell for stupidly high amounts - often around £150-£200 used on eBay. I don't quite understand that, I must admit.

So, for me, the cost of an LSI is not only the £70 - £120 of the card itself (depending if I get a Chinese OEM or a 'proper' card), but also at least £150 for a new mobo. Which is really rather a lot, particularly after all the money I've recently spent on my first LSI, 9 used/refurb HDDs and 4 brand new, upgrading RAM to 32GB, cables and more.

Or one final option would be a 16-port LSI card that runs in an x8 or x16 slot. But when I had a quick look these seem to be super expensive, even more than an 8 port card + new mobo - and not available cheap from China.

My original plan of a 16 drive pool would have involved me having only a couple of drives on Marvell chips, one per card, which I hoped would be no problem. But now my new plan for a 20 drive pool means I need to have three active drives on each card. So I'm having to cross my fingers that this will work OK, now I've solved the ashift issue.

If it doesn't, I suppose I will have to bite the bullet, sell some crap on eBay and spend yet more on changing the mobo and getting a second LSI; I've now spent so much on everything else I couldn't possibly let it run badly, or not at all!

ttabbal · Feb 24, 2017

I'm relatively sure I've got one H310 running on an x4 (electrical) slot. It won't get the maximum speed, but you can't saturate it with spinners anyway.

You could also use an SAS expander. Though those can get pricey as well.

TheBloke · Feb 24, 2017

Regarding Marvell chips: all that said, I do know of at least one known-good Marvell-Solaris combo: the Marvell 88SX6081-BCZ1.

My current pool started life running from 2 x 8-port SuperMicro AOC-SAT2-MV8 PCI-X cards, using that chipset. One reason I chose them was that they very cheap, the other was that I was told that the chipset on them was the same one that Sun used in their remarkable X4500 server - the one that had 48 x 500GB SATA2 drives in a 4U chassis. (Of course it also had a bunch of port multipliers.)

And those cards worked flawlessly for me for many years. I'd probably still be using them if the Titan mobo I had them on hadn't died some years ago, and rather than find an old server that still had PCI-X slots I decided to cannibalise a desktop PCs to make something a little more modern.

ttabbal said:
I'm relatively sure I've got one H310 running on an x4 (electrical) slot. It won't get the maximum speed, but you can't saturate it with spinners anyway.

You could also use an SAS expander. Though those can get pricey as well.

Oh wow, really? I did wonder if the x8 was a firm requirement or just for bandwidth. That would be definitely be something to consider. 8 drives on a x4 is half-a-lane per drive, which is still more than I'll be getting from 3 drives on a x1 - besides which, yeah as you say, HDDs aren't going to be using all that BW anyway. Yeah that would definitely be very worth considering, because I could get the LSI now and then still have the option of doing the mobo upgrade some time later to get more BW, spreading the cost - and more bandwidth may not even be necessary.

Cool, thanks, I'll definitely have to look into that.

And yeah I've wondered about port expanders, but having never used them I know very little about them. Or actually, no, it's port multipliers I've looked at - which I just realised are different things. I know a multiplier is pretty cheap , like 1-port to 5-port I see for £30.

I suppose that can still be OK for bandwidth.. given SATA 3 is 6gbit/s, although it's more like 5Gbit/s on the PCIE-x8 slot. So dividing one port by 5 is still 1 Gbit/s per port, which is 125MB/s, which is about the max I can usually get from most of my 2TB drives; certainly the seven 5400RPM drives I have.

Probably there's some overhead too, not a perfect division. But still, two multipliers on two of my LSI ports, and I've got 6 + 5 + 5 = 16 ports, for only an extra £60. Hmm! Or probably better but still affordable - spend £90 on three, and then run 3 or 4 drives per multiplier and then there's definitely no BW issues.

I'm guessing things aren't quite as simple and nice as that, but yeah I definitely should research it and consider it amongst the options if the Marvell chips show the slightest sign of flaking out. And for £30 it's worth buying one just to learn about it and give it a go (and I can always send it back.)

Thanks!

ttabbal · Feb 24, 2017

SATA port multipliers are not a good idea. They have a lot of issues from what I can see. What you want is an SAS Expander. They cost more than port multipliers, but they are much more reliable.

One possibility, you have only mentioned 2TB drives. If you keep to those, SAS1 based expanders and HBAs are crazy cheap due to the 2TB limit. It might be worth considering as a way to spread the costs over time. Even SAS1 bandwidth is not an issue with spinning drives, unless you are using some crazy 15k RPM things. Even then, once you figure in real world performance in a server, you never get max sequential speed out of a drive. It just doesn't happen. Almost all server based workloads are random by the time you get to the disk itself. There are all sorts of caching and such that try to mitigate that, but at the end of the day, you have to hit the rust at some point. Out of curiousity, I tried keeping everything else the same and swapping SAS1 for SAS2 cards. With 7200RPM spinners, identical performance on a 2x 6-disk raidz2. With 2x2 mirrors, same thing. With SSD it matters, with HDD, not so much. With expanders in the mix it can start to add up, but it takes a fair number of drives sharing an expander to matter. An SAS expander generally takes 4 ports from the HBA and shares those with however many drives are attached to the expander. So that's 12Gbps with SAS1.

As for SAT2-MV8s... well, I got 3 of them with my server chassis. I couldn't get any of them to work reliably under Linux and BSD. Could be a driver thing and Solaris would work better, but not worth the bother to me. Every time I tried to use them for ZFS mirrors they would start out OK, but as soon as I started moving a lot of data around, they fell over and got weird timeouts etc.. It's also possible the cards I got are just all bad, seems unlikely that I would get 3 bad ones, but it could happen. Identical hardware and software, just swap the cards for LSI SAS1 HBAs, everything worked fine. shrug... I was going to use them to run a backup server, where performance doesn't matter. But ZFS eventually failed the drives attached to them. Onboard or LSI, the drives worked fine. I've read enough good reports from people about them that I believe they work for many setups, they didn't work for me though.

gea · Feb 24, 2017

The golden ZFS rule is (on Solarish but also on BSD/ZoL)

- use onboard Sata (AHCI)
- use LSI HBAs (raidless, IT mode, based on LSI 1068/max 2TB, 2008, 3008 chipset)

- avoid any other "cheap" HBA/disk controller
- avoid an SAS Expander with Sata disks (ok on home setups, can give problems to identify semidead disks)

- never use port multiplier
- never use hardware raid (ex LSI with Raid-5/6 capability)

TheBloke · Feb 24, 2017

Thanks for the replies, guys.

On the LSI in a x4 port: I tested this with my existing LSISAS2008 card, moving it from x16 slot to the second slot that's only wired for x4.

The result: good news and less good. Good news is that functionally it works fine, working identically to when it was in the x8-capable slot. Thanks again for the tip, @ttabbal !

The not so great news is that write performance seemed to be noticeably impacted. I tested this by creating again the 14 drive, 2 x RAIDZ2 I was testing earlier. 8 drives on the LSI, 6 drives on the onboard, all 2TB drives with a mixture of 5400 and 7200 RPMs. I ran 3x bonnie++ with the card in the x4 slot, then reverted and ran another three back on the x16.

Write performance was a good 20% lower in the x4 slot: average 600MB/s writes over three tests instead of 779MB/s. Rewrites was about 10% slower. But reading was actually 1% faster in the x4 slot (and so within margin of error of being identical.)

I must admit I don't fully understand this write discrepancy, both because half the bandwidth of the card should still be more than enough (about 250MB/s per drive I believe), and because of the read result being identical. Or maybe it's just a coincidence from fluctuations in benchmarks - I do see noticeable variations in these benchmarks, even when running them right after each other (the first benchmark after creating a pool in particular is always faster than the subsequent ones.) I did take care to make the tests as close to identical as possible: each test done after a reboot; the pool created new before the start of the test and destroyed at the end; the box completely idle throughout all tests; etc. So I'm not sure.

But anyway, even if this perf drop will exist, using the x4 slot is likely still viable, With my planned config, I wouldn't actually need to run 8 drives on the second LSI; six would be enough. And based on these benchmarks, using only 6 drives instead of 8 could well make up the performance drop.

So I think I'm inevitably going to end up ordering another LSI - either used from UK or OEM from China - to try it out. They're pretty affordable and even if there is reduced perf, it will probably still be better than Marvell. Especially as I may find, as you guys say, that the Marvell chips are effectively unusable for sustained loads, in which case having another LSI will be vital even if there was some reduced performance.

Once I get my new drives and have my full pool built I will be doing more benchmarking and testing, especially on the Marvells, to see where I currently stand.

On multipliers/expanders: thanks for the further info. I won't bother trying the cheap multipliers then if they're as bad as you say.

I did some quick searches on SAS expanders and while they generally appear to be quite expensive, I do see a number of used cards on eBay for affordable prices. I don't fully understand how they work yet - it confuses me a little that they're PCIE cards, for example, and maybe this means that a Solaris-compatible expander is required. I had originally assumed they talked only to the HBA, like a multiplier does. Unless the PCIE interface is just for power. Anyway, I'm sure it's straightforward once investigated properly.

But I don't think multi-port is going to be something I need to bother with now that I know I can use a second LSI on this mobo. 22 drives (inc rpool SSD and hot spare) is more than enough for me for a while

ttabbal · Feb 24, 2017

The expanders with pcie slots are usually just for power. But if you can get another HBA, and that's enough ports, that's the way I would go.

Search

ZFS: Mixture of 4k/512 drives, best ashift to use? (Solaris 11.3)

TheBloke

Active Member

gea

Well-Known Member

TheBloke

Active Member

gea

Well-Known Member

TheBloke

Active Member

ttabbal

Active Member

TheBloke

Active Member

ttabbal

Active Member

TheBloke

Active Member

ttabbal

Active Member

gea

Well-Known Member

TheBloke

Active Member

ttabbal

Active Member