Sync on=secure write.
I would not disable with databases or VMs beside home, lab or development use with an UPS and regular backups
Background:
If sync is disabled, all small writes are collected in RAM for 5s and written as one large sequential write.
One of the reasons that ZFS can be as fast.
If you enable sync, every write is looged to a ZIL (onpool or better a dedicated ZIL device).
This gives data security but write performance on small writes can go down to 10 or 20% of nonsync values
without a very good ZIL device.
see
http://napp-it.org/doc/manuals/benchmarks.pdf
Yes, that's exactly how I had it in my head - thanks for confirming!
SSD only pools and a dedicated ZIL
Non-enterprise SSDs are quite good compared to spindels but:
- write cycles are limited (but not a problem within 3-5 years on modern MLC SSDs)
- on small writes, quite large blocks must erased/written with the effect that they are slow with small writes
(immediatly or after some time of usage)
To overcome these weakness, a RAM-based ZIL (like a ZeusRAM) or a SSD-ZIL that is optimized for writes like
an Intel S3700 can improve overall performance and reliability.
You may also use Intel S3500. Thy are cheaper and perfect for a more read orientated workload,
combined with a S3700 Zil device.
The SSD DC S3500 Review: Intel's 6 Gb/s Controller And 20 nm NAND - Intel SSD DC S3500: Focusing On Read Performance
Write cycles on non enterprise ssd's are indeed a serious problem. We manage several hundred physical servers and nearly half contain an SSD. Older SSD's (talking intel 320 and such) last forever. Newer SSD's? Let's see, I've seen a 256gb OCZ vertex that was short stroked by 40% die with 50TB of host writes. A 256gb Samsung 840 pro short strokes by 25% die at 70TB host writes. Intel 520 240gb short stroked to 200gb showing MWI of 0 yet still running at 80TB host writes... which is cool but scary.
These figures are pretty low (I mean, the number of TB written is low) and in a SAN environment, it really is easy to blow through a few TB's written per day. All it takes is one VM to do something.. dumb.
We have some thoughts on how to mitigate this, such as running a dedicated swap only san, fully managed VM's so we can properly tweak system variables... maybe even a dedicated mysql/logging san with higher endurance drives...
But the endurance thing is no joke in a server or san environment. I really don't want to be replacing every ssd in the san every 3 months... Every 3 years I'd be happy with.
This is actually half the problem though, the consumer grade stuff is so low endurance it's scary. But the enterprise stuff is so freaking high endurance it's too expensive. While Intel claims the s3500 is an enterprise drive, it's still a little too low. 450TB written for the 800gb, $1150 drive? The seagate 600 pro 400gb drive has 1080TB written for $650. Then you have the s3700 800gb drive whihc has something like 15PB?
I don't understand why there is such a giant disparity, drive makers have to hear this complaint often. Wait, I understand the difference in the nand, I just mean, I don't understand why drive makers haven't worked on a method to offer some middle ground.
Anyways, it is good to know that my thoughts on the ram based zil helping with the write endurance of ssd's in the pool actually is somewhat true.
Checksums are written always as part of data. Does not matter if mirror or Raid-Z
Hmm, what would cause more writes to the ssd's in the pool, let's say 10 ssd's - a bunch of mirrored vdevs or a pair of raidz vdevs?
My thoughts were (and this is where my understanding is really weak) that the raidz vdevs would have more overall writes to the underlying ssd's for the same data written than the mirrored vdevs would.
Let's say a single 4kb write comes in, on the mirrored vdev setup, the write hits the two drives in the mirrored vdev zfs decides to write to?
In a 5 disk raidz, the same 4kb write has to hit every drive in the raidz? So 5 disks get hit with it?
Seems like the raidz would thus cause more write wear than mirrored vdev's?
I'd love to be completely incorrect here, so please let me know if I am!
my personal preferences
- I would not build one big server that is not allowed to fail
but a couple of smaller boxes in the range 32-64GB RAM and up to 10 VMs
This is actually the goal to a degree. Though more like 100 VM's per san. While I wish I could simply build out a dozen e3 1200 with 32gb ram and just 3 or 4 ssd's in the pool, the networking gear (10gbps) starts to be the pain point due to cost per port.
I suppose one could perhaps use something like the dell 6248's and pop in a bunch of 10gbps ports in the back and use 10gbps for the san's and 1gbps for the hv's. Would be much more affordable, though the cabling of the HV's would be a bit messy.
- I use All-In-Ones (ESXi with integrates multiprotocol Solarish SAN/NAS) where all storage traffic is internal in high speed and independent from external hardware. I have prepared a ready to use ZFS appliance for this.
To move VMS, you can either use ESXi (commercial) with a Vmotion/Storagemotion or with ESXi free
simply copy VMS via SMB or NFS or move a pool physically to another box (needs enough empty disk bays).
If you use 10 Gbe, a storage move is fast
You can virtualize any OS on ESXi with optimized drivers.
You can also try the AMPO stack on OmniOS (newest Apache, mySQL, PHPmyadmin, Owncloud)
http://forums.servethehome.com/solaris-nexenta-openindiana-napp/2357-owncloud-omnios.html
For this I currently write some menus to manage Apache vhosts with Include and Modulmanagement,
as well as Rsync and iSCSI as a ZFS filesystem property (share definitions are stored on ZFS) just like with SMB and NFS.
Well, vmware stuff is a problem for us. We have little experience using it (play around with free version for fun, that's it). When we try to figure out pricing for vmware, it quickly starts getting into the hundreds of thousands of $$$ for our needs. I mean, a little project like this would seemingly cost over $100k upfront and $25k/yr with just 10HV's as we apparently need cloudsuite enterprise deluxe super turbo or something. So it really doesn't seem feasible for a small biz like us to deploy vmware in a commercial/production setting.
We'd also have to redevelop all our tools and rebuild everything. Granted vmware has a great api that's seemingly well documented but it's a ton of work with a large learning curve.
And while I love the idea of storage local to the HV, if the HV dies then your VM's are offline. Copying the VM's to another HV when the original HV is offline isn't going to work well? If you had the HV's all syncing all the time or something like this I suppose it could work...
Love the feedback, really appreciate the discussion!