A few questions

lostmind · Oct 17, 2013

Hey All,

I've been doing a ton of research and am trying to ensure I have things straight...

Is it safe to disable sync? My understanding is that with sync off, writes aren't always committed to disk and in case of a power loss or something, the data could be corrupted. With redundant psu's spread across two circuits each backed by different generators this may not be a real concern... but curiousity.

If sync is enabled and one has a ram based ssd device for the zil log (ddrdrive, stec zeusram, etc.), with an all ssd pool, would the zil device help prolong the lifespan/reduce write wear on the ssd pool? My understanding is that sync writes are written to the zil log device and then eventually flushed to the underlying pool. I'm assuming this would be a nice big sequential write that would probably be easy to handle for the ssd's in the pool without any real ugly write amplification. Does that make much sense? Am I off base here?

If true, I'm still rather worried about ssd longevity in the pool - I suppose one could monitor this via some script in openindiana/omnios that would regularly read smart data and shoot off an email/log something... I do this on linux boxes, not very familiar with solaris but I figure I could find a way to hack it.

With an all ssd pool, would choosing raidzwhatever cause more write workload than simply mirroring vdev's? Due to writing the checksum data to all devices in the raidz?

Trying to figure out if I need to spend $36k on intel 3700 ssd's or if I can get away with spending less on a new san...

Thanks!

PigLover · Oct 17, 2013

ZFS uses a copy-on-write model that effectively guarantees the integrity of the file system itself, whether you use sync mode or not.

However, with sync disabled you remain at risk that the any writes that are still in memory or disk caches - or unstable on the Zil - might not survive an unplanned failure.

Together, this means your file system isn't at risk if you turn off sync (like ext3 or NTfS might be). But you could lose your last few moments of written data.

If you are using the ZFS for discreet file-based transaction this night not be a big deal. But if you are back-ending a SQL database or if you running VM images you might not be able to recover cleanly - losing the last write to your active image may leave the VM or your DB unstable on restart, etc.

So...the answers to your questions really depend a lot on the nature of your application. And since you didn't describe that application it's hard to answer your question intelligently.

lostmind · Oct 17, 2013

Hey PigLover!

Thanks for the fast feedback.

Usage is for a small cloud hosting project, mostly your average webhosting stuff, wordpress, drupal, magento, etc. Cloud will be centos with a mix of xen/kvm likely.

Jeggs101 · Oct 17, 2013

lostmind said:
Hey PigLover!

Thanks for the fast feedback.

Usage is for a small cloud hosting project, mostly your average webhosting stuff, wordpress, drupal, magento, etc. Cloud will be centos with a mix of xen/kvm likely.

$36K of S3700's is not too "small" especially since I am guessing there is a lot more disk behind that cache layer.

Is CentOS going to be the bare metal OS?

lostmind · Oct 17, 2013

Hey Jeggs,

Well, small in terms of san size. The $36k of 3700's is really what, 7tb of usable storage? Not really planning on l2arc or even zil log if we go the 3700 route.... 7tb of usable storage is not going to power a lot of vm's. 200-300? But that's the plan - 200 or so really high powered vm's, fully managed setups for our more demanding clients.

We also definitely do not need all the performance that the 3700's bring either. But I don't want to have 80 spinners to power this small cloud.

It's all very in the planning stages still. Likely all 10gbps ethernet, smaller hv's - just 32gb ram and a e3-12xx v3... possibly microcloud chassis. Far fewer vm's per box - 4-6?

lostmind · Oct 17, 2013

And yes, likely centos for the baremetal os.

Probably iscsi.

33_viper_33 · Oct 17, 2013

ahhhh... drool!

gea · Oct 18, 2013

lostmind said:
Hey All,

I've been doing a ton of research and am trying to ensure I have things straight...

Is it safe to disable sync? My understanding is that with sync off, writes aren't always committed to disk and in case of a power loss or something, the data could be corrupted. With redundant psu's spread across two circuits each backed by different generators this may not be a real concern... but curiousity.

If sync is enabled and one has a ram based ssd device for the zil log (ddrdrive, stec zeusram, etc.), with an all ssd pool, would the zil device help prolong the lifespan/reduce write wear on the ssd pool? My understanding is that sync writes are written to the zil log device and then eventually flushed to the underlying pool. I'm assuming this would be a nice big sequential write that would probably be easy to handle for the ssd's in the pool without any real ugly write amplification. Does that make much sense? Am I off base here?

If true, I'm still rather worried about ssd longevity in the pool - I suppose one could monitor this via some script in openindiana/omnios that would regularly read smart data and shoot off an email/log something... I do this on linux boxes, not very familiar with solaris but I figure I could find a way to hack it.

With an all ssd pool, would choosing raidzwhatever cause more write workload than simply mirroring vdev's? Due to writing the checksum data to all devices in the raidz?

Thanks!

Sync on=secure write.
I would not disable with databases or VMs beside home, lab or development use with an UPS and regular backups

Background:
If sync is disabled, all small writes are collected in RAM for 5s and written as one large sequential write.
One of the reasons that ZFS can be as fast.

If you enable sync, every write is looged to a ZIL (onpool or better a dedicated ZIL device).
This gives data security but write performance on small writes can go down to 10 or 20% of nonsync values
without a very good ZIL device.

see http://napp-it.org/doc/manuals/benchmarks.pdf

SSD only pools and a dedicated ZIL
Non-enterprise SSDs are quite good compared to spindels but:
- write cycles are limited (but not a problem within 3-5 years on modern MLC SSDs)
- on small writes, quite large blocks must erased/written with the effect that they are slow with small writes
(immediatly or after some time of usage)

To overcome these weakness, a RAM-based ZIL (like a ZeusRAM) or a SSD-ZIL that is optimized for writes like
an Intel S3700 can improve overall performance and reliability.

You may also use Intel S3500. Thy are cheaper and perfect for a more read orientated workload,
combined with a S3700 Zil device. http://www.tomshardware.com/reviews/ssd-dc-s3500-review-6gbps,3529.html

Checksums are written always as part of data. Does not matter if mirror or Raid-Z

my personal preferences
- I would not build one big server that is not allowed to fail
but a couple of smaller boxes in the range 32-64GB RAM and up to 10 VMs

- I use All-In-Ones (ESXi with integrates multiprotocol Solarish SAN/NAS) where all storage traffic is internal in high speed and independent from external hardware. I have prepared a ready to use ZFS appliance for this.

To move VMS, you can either use ESXi (commercial) with a Vmotion/Storagemotion or with ESXi free
simply copy VMS via SMB or NFS or move a pool physically to another box (needs enough empty disk bays).
If you use 10 Gbe, a storage move is fast

You can virtualize any OS on ESXi with optimized drivers.
You can also try the AMPO stack on OmniOS (newest Apache, mySQL, PHPmyadmin, Owncloud)
http://forums.servethehome.com/solaris-nexenta-openindiana-napp/2357-owncloud-omnios.html

For this I currently write some menus to manage Apache vhosts with Include and Modulmanagement,
as well as Rsync and iSCSI as a ZFS filesystem property (share definitions are stored on ZFS) just like with SMB and NFS.

yu130960 · Oct 18, 2013

I am using the all-in-one ESXi and an OmniOS VM with napp-it. I have 3 m1015 passthroughed to the OmniOS vm.

I am trying out the Crucial M500 240gb drive as a ZIL (passthrough to the VM). I couldn't get the S3700 as it was backordered in my area and I got the M500 for a good price. The M500 does have the power loss prevention with caps and historically crucial has far exceeded their endurance specs.

What is the best way to keep track of the endurance with the ZIL on OmniOS?

Don't mean to highjack thread, just thought that I would add some alternatives to the S3700 that may work.

lostmind · Oct 18, 2013

gea said:
Sync on=secure write.
I would not disable with databases or VMs beside home, lab or development use with an UPS and regular backups

Background:
If sync is disabled, all small writes are collected in RAM for 5s and written as one large sequential write.
One of the reasons that ZFS can be as fast.

If you enable sync, every write is looged to a ZIL (onpool or better a dedicated ZIL device).
This gives data security but write performance on small writes can go down to 10 or 20% of nonsync values
without a very good ZIL device.

see http://napp-it.org/doc/manuals/benchmarks.pdf

Yes, that's exactly how I had it in my head - thanks for confirming!

gea said:
SSD only pools and a dedicated ZIL
Non-enterprise SSDs are quite good compared to spindels but:
- write cycles are limited (but not a problem within 3-5 years on modern MLC SSDs)
- on small writes, quite large blocks must erased/written with the effect that they are slow with small writes
(immediatly or after some time of usage)

To overcome these weakness, a RAM-based ZIL (like a ZeusRAM) or a SSD-ZIL that is optimized for writes like
an Intel S3700 can improve overall performance and reliability.

You may also use Intel S3500. Thy are cheaper and perfect for a more read orientated workload,
combined with a S3700 Zil device. The SSD DC S3500 Review: Intel's 6 Gb/s Controller And 20 nm NAND - Intel SSD DC S3500: Focusing On Read Performance

Write cycles on non enterprise ssd's are indeed a serious problem. We manage several hundred physical servers and nearly half contain an SSD. Older SSD's (talking intel 320 and such) last forever. Newer SSD's? Let's see, I've seen a 256gb OCZ vertex that was short stroked by 40% die with 50TB of host writes. A 256gb Samsung 840 pro short strokes by 25% die at 70TB host writes. Intel 520 240gb short stroked to 200gb showing MWI of 0 yet still running at 80TB host writes... which is cool but scary.

These figures are pretty low (I mean, the number of TB written is low) and in a SAN environment, it really is easy to blow through a few TB's written per day. All it takes is one VM to do something.. dumb.

We have some thoughts on how to mitigate this, such as running a dedicated swap only san, fully managed VM's so we can properly tweak system variables... maybe even a dedicated mysql/logging san with higher endurance drives...

But the endurance thing is no joke in a server or san environment. I really don't want to be replacing every ssd in the san every 3 months... Every 3 years I'd be happy with.

This is actually half the problem though, the consumer grade stuff is so low endurance it's scary. But the enterprise stuff is so freaking high endurance it's too expensive. While Intel claims the s3500 is an enterprise drive, it's still a little too low. 450TB written for the 800gb, $1150 drive? The seagate 600 pro 400gb drive has 1080TB written for $650. Then you have the s3700 800gb drive whihc has something like 15PB?

I don't understand why there is such a giant disparity, drive makers have to hear this complaint often. Wait, I understand the difference in the nand, I just mean, I don't understand why drive makers haven't worked on a method to offer some middle ground.

Anyways, it is good to know that my thoughts on the ram based zil helping with the write endurance of ssd's in the pool actually is somewhat true.

gea said:
Checksums are written always as part of data. Does not matter if mirror or Raid-Z

Hmm, what would cause more writes to the ssd's in the pool, let's say 10 ssd's - a bunch of mirrored vdevs or a pair of raidz vdevs?

My thoughts were (and this is where my understanding is really weak) that the raidz vdevs would have more overall writes to the underlying ssd's for the same data written than the mirrored vdevs would.

Let's say a single 4kb write comes in, on the mirrored vdev setup, the write hits the two drives in the mirrored vdev zfs decides to write to?

In a 5 disk raidz, the same 4kb write has to hit every drive in the raidz? So 5 disks get hit with it?

Seems like the raidz would thus cause more write wear than mirrored vdev's?

I'd love to be completely incorrect here, so please let me know if I am!

gea said:
my personal preferences
- I would not build one big server that is not allowed to fail
but a couple of smaller boxes in the range 32-64GB RAM and up to 10 VMs

This is actually the goal to a degree. Though more like 100 VM's per san. While I wish I could simply build out a dozen e3 1200 with 32gb ram and just 3 or 4 ssd's in the pool, the networking gear (10gbps) starts to be the pain point due to cost per port.

I suppose one could perhaps use something like the dell 6248's and pop in a bunch of 10gbps ports in the back and use 10gbps for the san's and 1gbps for the hv's. Would be much more affordable, though the cabling of the HV's would be a bit messy.

gea said:
- I use All-In-Ones (ESXi with integrates multiprotocol Solarish SAN/NAS) where all storage traffic is internal in high speed and independent from external hardware. I have prepared a ready to use ZFS appliance for this.

To move VMS, you can either use ESXi (commercial) with a Vmotion/Storagemotion or with ESXi free
simply copy VMS via SMB or NFS or move a pool physically to another box (needs enough empty disk bays).
If you use 10 Gbe, a storage move is fast

You can virtualize any OS on ESXi with optimized drivers.
You can also try the AMPO stack on OmniOS (newest Apache, mySQL, PHPmyadmin, Owncloud)
http://forums.servethehome.com/solaris-nexenta-openindiana-napp/2357-owncloud-omnios.html

For this I currently write some menus to manage Apache vhosts with Include and Modulmanagement,
as well as Rsync and iSCSI as a ZFS filesystem property (share definitions are stored on ZFS) just like with SMB and NFS.

Well, vmware stuff is a problem for us. We have little experience using it (play around with free version for fun, that's it). When we try to figure out pricing for vmware, it quickly starts getting into the hundreds of thousands of $$$ for our needs. I mean, a little project like this would seemingly cost over $100k upfront and $25k/yr with just 10HV's as we apparently need cloudsuite enterprise deluxe super turbo or something. So it really doesn't seem feasible for a small biz like us to deploy vmware in a commercial/production setting.

We'd also have to redevelop all our tools and rebuild everything. Granted vmware has a great api that's seemingly well documented but it's a ton of work with a large learning curve.

And while I love the idea of storage local to the HV, if the HV dies then your VM's are offline. Copying the VM's to another HV when the original HV is offline isn't going to work well? If you had the HV's all syncing all the time or something like this I suppose it could work...

Love the feedback, really appreciate the discussion!

lostmind · Oct 18, 2013

yu130960 said:
I am using the all-in-one ESXi and an OmniOS VM with napp-it. I have 3 m1015 passthroughed to the OmniOS vm.

I am trying out the Crucial M500 240gb drive as a ZIL (passthrough to the VM). I couldn't get the S3700 as it was backordered in my area and I got the M500 for a good price. The M500 does have the power loss prevention with caps and historically crucial has far exceeded their endurance specs.

What is the best way to keep track of the endurance with the ZIL on OmniOS?

Don't mean to highjack thread, just thought that I would add some alternatives to the S3700 that may work.

Tracking endurance is done via smart info. Most ssd manufacturers put a field in there with a media wear indicator.

The Crucial drives are great, but at 70TB written for a 960gb drive... won't last long on a busy san. Personal or small office san it's probably fine.

yu130960 · Oct 18, 2013

I looked at the smart data but couldn't find a wear indicator. Am I missing it with my eyes? Do I have to run a short or long test on it?

Thanks for the great info and I guess I should have grabbed the Intel 320 160 gb for $149 rather than the M500 240gb for around the same price.

Code:

=== START OF INFORMATION SECTION ===
Device Model:     Crucial_CT240M500SSD1
Serial Number:    132409XXXXXXX
LU WWN Device Id: 5 00a075 109402c65
Firmware Version: MU03
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Oct 18 14:14:05 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		( 1115) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  18) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x0035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       1
  5 Reallocated_Sector_Ct   0x0033   100   100   000    Pre-fail  Always       -       1
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       17
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       4
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000    Pre-fail  Always       -       4065
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   076   073   000    Old_age   Always       -       24 (Min/Max 23/27)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       17
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Unknown_SSD_Attribute   0x0031   100   100   000    Pre-fail  Offline      -       0
206 Unknown_SSD_Attribute   0x000e   100   100   000    Old_age   Always       -       0
210 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
246 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       686107052
247 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       3443997
248 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       47290316

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

lostmind · Oct 18, 2013

Crucial smart info is a pain to decode. I *think* the m500 uses line 202 to track "rated life used". When that counter reaches 0, your drive is out of warranty/gone over the 72tb limit. But that's just what a quick google search dug up. I may be wrong.

gea · Oct 18, 2013

lostmind said:
Y

Hmm, what would cause more writes to the ssd's in the pool, let's say 10 ssd's - a bunch of mirrored vdevs or a pair of raidz vdevs?

I suppose one could perhaps use something like the dell 6248's and pop in a bunch of 10gbps ports in the back and use 10gbps for the san's and 1gbps for the hv's. Would be much more affordable, though the cabling of the HV's would be a bit messy.

Well, vmware stuff is a problem for us. We have little experience using it (play around with free version for fun, that's it). When we try to figure out pricing for vmware, it quickly starts getting into the hundreds of thousands of $$$ for our needs. I mean, a little project like this would seemingly cost over $100k upfront and $25k/yr with just 10HV's as we apparently need cloudsuite enterprise deluxe super turbo or something. So it really doesn't seem feasible for a small biz like us to deploy vmware in a commercial/production setting.

We'd also have to redevelop all our tools and rebuild everything. Granted vmware has a great api that's seemingly well documented but it's a ton of work with a large learning curve.

And while I love the idea of storage local to the HV, if the HV dies then your VM's are offline. Copying the VM's to another HV when the original HV is offline isn't going to work well? If you had the HV's all syncing all the time or something like this I suppose it could work...

If you compare a 4 disk raid 10 with a four disk raid-z1 (for same capacity) and you write 1 GB of data:
- every disk in the mirrors must write 500 MB
- every disk in the raid-z must write 333 MB

So less writes on Raid-Z paired with a better sequential performance but with I/O of one disk wheras
the raid-10 has a I/O read performance of 4 disks and a write I/O of two disks.

As SSDs are very good on I/O Raid-Z seems the better way to go

About cabling and local storage
With local storage, you do not need that redundand highspeed networks (although 10 GbE becomes affordable)
because all VM-SAN traffic is internal in software and highspeed without cables - no external dependencies.
If a Server goes offline, you can do a lot of HA things. (ESXI with a lot of $$).

If you can accept some downtime, you can start the VM from a backup on another box/backup.
Mostly your ZFS pool is intact. I usually move the pool physically to another box, import the pool and the VM
and restart. No problem when machines and vlan/nic settings are identical. This is where you can use ESXi free;
with newest 5.5 even with unlimited RAM.

Search

A few questions

lostmind

Member

PigLover

Moderator

lostmind

Member

Jeggs101

Well-Known Member

lostmind

Member

lostmind

Member

33_viper_33

Member

gea

Well-Known Member

yu130960

Member

lostmind

Member

lostmind

Member

yu130960

Member

lostmind

Member

gea

Well-Known Member