Take a look and help me verify my open indiana ZFS is ready for "production"?

bp_968 · May 18, 2013

I just finished setting up my open indiana VM inside ESXi 4.1U3 (i'll explain that) and wanted some of the ZFS guys to look at the zfs output (i'll post whatever) and make sure that I'm not missing anything obvious or doing anything stupid before I go moving data to it and making it "live".

Spec:
Supermicro X8DTE-F dual CPU MB with 7 8x PCI-E slots and 12 RAM slots
2x L5520 Xeon 2.4GHZ Quad Core / HyperThreaded 16 threads total
24GB RDIMM (another 24GB on the way and it will be full up with 48GB next week)
1 LSI 3081E-R 8 port 3gb SAS
1 M1015 8 port 6gb SAS (in IT mode)
8 10k RPM 300GB Raptor 2.5" drives in RAID10
8 2TB drives (mixed) in RAIDz2 (not hung up on this, will do RAID10 if needed)
2 port DDR infiniband for 10Gb/s networking

I'm seeing 170-200MB/s on VMs (NFS) on the RAID10 pool, 250-300MB/s copy's between RAM/SSD to the same pool using SMB. I'm seeing roughly similar speeds on file copies to the RAIDz2 pool using SMB. DD bench is showing 400-450MB/s writes and 600-650MB/s reads on both pools but I could be doing something wrong with it. Just wanting to make sure set it up correctly.

Here is my ZFS outputs from Napp-it. Let me know if you need to see anything else:

https://dl.dropboxusercontent.com/u/10627065/ZFS_server_stats.txt

ZFSinfos hash zfs:
RAIDz2/10TB_RAIDZ2_avail 9.13T
RAIDz2/10TB_RAIDZ2_compression off
RAIDz2/10TB_RAIDZ2_dedup off
RAIDz2/10TB_RAIDZ2_mountpoint /RAIDz2/10TB_RAIDZ2
RAIDz2/10TB_RAIDZ2_nbmand on
RAIDz2/10TB_RAIDZ2_quota none
RAIDz2/10TB_RAIDZ2_readonly off
RAIDz2/10TB_RAIDZ2_refquota none
RAIDz2/10TB_RAIDZ2_refreservation none
RAIDz2/10TB_RAIDZ2_reservation none
RAIDz2/10TB_RAIDZ2_sharenfs rw
RAIDz2/10TB_RAIDZ2_sharesmb name=10TB_RAIDZ2,guestok=true
RAIDz2/10TB_RAIDZ2_sync standard
RAIDz2/10TB_RAIDZ2_used 7.73G
RAIDz2/Business_avail 9.13T
RAIDz2/Business_compression on
RAIDz2/Business_dedup off
RAIDz2/Business_mountpoint /RAIDz2/Business
RAIDz2/Business_nbmand on
RAIDz2/Business_quota none
RAIDz2/Business_readonly off
RAIDz2/Business_refquota none
RAIDz2/Business_refreservation none
RAIDz2/Business_reservation none
RAIDz2/Business_sharenfs off
RAIDz2/Business_sharesmb name=Business,guestok=true
RAIDz2/Business_sync standard
RAIDz2/Business_used 341K
RAIDz2/Public_avail 9.13T
RAIDz2/Public_compression on
RAIDz2/Public_dedup off
RAIDz2/Public_mountpoint /RAIDz2/Public
RAIDz2/Public_nbmand on
RAIDz2/Public_quota none
RAIDz2/Public_readonly off
RAIDz2/Public_refquota none
RAIDz2/Public_refreservation none
RAIDz2/Public_reservation none
RAIDz2/Public_sharenfs off
RAIDz2/Public_sharesmb name=Public,guestok=true
RAIDz2/Public_sync standard
RAIDz2/Public_used 341K
RAIDz2/images_avail 9.13T
RAIDz2/images_compression off
RAIDz2/images_dedup off
RAIDz2/images_mountpoint /RAIDz2/images
RAIDz2/images_nbmand on
RAIDz2/images_quota none
RAIDz2/images_readonly off
RAIDz2/images_refquota none
RAIDz2/images_refreservation none
RAIDz2/images_reservation none
RAIDz2/images_sharenfs off
RAIDz2/images_sharesmb name=images,guestok=true
RAIDz2/images_sync standard
RAIDz2/images_used 341K
RAIDz2_avail 10.1T
RAIDz2_compression off
RAIDz2_dedup off
RAIDz2_mountpoint /RAIDz2
RAIDz2_nbmand off
RAIDz2_quota none
RAIDz2_readonly off
RAIDz2_refquota none
RAIDz2_refreservation 1.01T
RAIDz2_reservation none
RAIDz2_sharenfs off
RAIDz2_sharesmb off
RAIDz2_sync standard
RAIDz2_used 1.02T
allfs RAIDz2 RAIDz2/10TB_RAIDZ2 RAIDz2/Business RAIDz2/Public RAIDz2/images raptor10k-RAID10 raptor10k-RAID10/1TB_speed raptor10k-RAID10/VMware
allpools RAIDz2 raptor10k-RAID10 rpool
buffertime 1368820854
datapools RAIDz2 raptor10k-RAID10
properties type creation used available referenced compressratio mounted quota reservation recordsize mountpoint sharenfs checksum compression atime devices exec setuid readonly zoned snapdir aclmode aclinherit canmount xattr copies version utf8only normalization casesensitivity vscan nbmand sharesmb refquota refreservation primarycache secondarycache usedbysnapshots usedbydataset usedbychildren usedbyrefreservation logbias dedup mlslabel sync refcompressratio writtentype creation used available referenced compressratio mounted quota reservation recordsize mountpoint sharenfs checksum compression atime devices exec setuid readonly zoned snapdir aclmode aclinherit canmount xattr copies version utf8only normalization casesensitivity vscan nbmand sharesmb refquota refreservation primarycache secondarycache usedbysnapshots usedbydataset usedbychildren usedbyrefreservation logbias dedup mlslabel sync refcompressratio written
raptor10k-RAID10/1TB_speed_avail 1.02T
raptor10k-RAID10/1TB_speed_compression off
raptor10k-RAID10/1TB_speed_dedup off
raptor10k-RAID10/1TB_speed_mountpoint /raptor10k-RAID10/1TB_speed
raptor10k-RAID10/1TB_speed_nbmand on
raptor10k-RAID10/1TB_speed_quota none
raptor10k-RAID10/1TB_speed_readonly off
raptor10k-RAID10/1TB_speed_refquota none
raptor10k-RAID10/1TB_speed_refreservation none
raptor10k-RAID10/1TB_speed_reservation none
raptor10k-RAID10/1TB_speed_sharenfs rw
raptor10k-RAID10/1TB_speed_sharesmb name=1TB_speed,guestok=true
raptor10k-RAID10/1TB_speed_sync standard
raptor10k-RAID10/1TB_speed_used 3.25G
raptor10k-RAID10/VMware_avail 1.02T
raptor10k-RAID10/VMware_compression on
raptor10k-RAID10/VMware_dedup off
raptor10k-RAID10/VMware_mountpoint /raptor10k-RAID10/VMware
raptor10k-RAID10/VMware_nbmand on
raptor10k-RAID10/VMware_quota none
raptor10k-RAID10/VMware_readonly off
raptor10k-RAID10/VMware_refquota none
raptor10k-RAID10/VMware_refreservation none
raptor10k-RAID10/VMware_reservation none
raptor10k-RAID10/VMware_sharenfs rw=root=@192.168.3.0/24

192.168.2.0/24
raptor10k-RAID10/VMware_sharesmb name=VMware,guestok=true
raptor10k-RAID10/VMware_sync standard
raptor10k-RAID10/VMware_used 15.6G
raptor10k-RAID10_avail 1.05T
raptor10k-RAID10_compression off
raptor10k-RAID10_dedup off
raptor10k-RAID10_mountpoint /raptor10k-RAID10
raptor10k-RAID10_nbmand off
raptor10k-RAID10_quota none
raptor10k-RAID10_readonly off
raptor10k-RAID10_refquota none
raptor10k-RAID10_refreservation 27.4G
raptor10k-RAID10_reservation none
raptor10k-RAID10_sharenfs off
raptor10k-RAID10_sharesmb off
raptor10k-RAID10_sync standard
raptor10k-RAID10_used 46.3G
syspool rpool

------------------------------------------------------------------------------------
smart info
------------------------------------------------------------------------------------

- with Smartinfo if available (works mostly only with SAS Controller)
id diskcap pool vdev state error smart_model smart_type smart_health temp smart_sn smart_check
c4t5000C50046F1A085d0 2000 GB RAIDz2 raidz ONLINE S:3 H:0 T:0 ST2000DL003-9VT166 sat,12 PASSED 27 Â°C 6YD234N1 short long abort log
c4t5000C50046F2D8C1d0 2000 GB RAIDz2 raidz ONLINE S:3 H:0 T:0 ST2000DL003-9VT166 sat,12 PASSED 29 Â°C 6YD2434J short long abort log
c4t5000CCA221D3C439d0 2000 GB RAIDz2 raidz ONLINE S:3 H:0 T:0 Hitachi HDS722020ALA330 sat,12 PASSED 34 Â°C JK1130YAHDGZNT short long abort log
c4t5000CCA221D3CAAEd0 2000 GB RAIDz2 raidz ONLINE S:3 H:0 T:0 Hitachi HDS722020ALA330 sat,12 PASSED 34 Â°C JK1130YAHDJPZT short long abort log
c4t50014EE6007472CCd0 2000 GB RAIDz2 raidz ONLINE S:3 H:0 T:0 WDC WD20EARS-00MVWB0 sat,12 PASSED 32 Â°C WDWMAZA1193578 short long abort log
c4t50014EE655C9C2DCd0 2000 GB RAIDz2 raidz ONLINE S:3 H:0 T:0 WDC WD20EARS-00MVWB0 sat,12 PASSED 27 Â°C WDWMAZA0979429 short long abort log
c4t50024E90044F3EDCd0 2000 GB RAIDz2 raidz ONLINE S:3 H:0 T:0 SAMSUNG HD204UI sat,12 PASSED 28 Â°C S2H7JD6ZB00622 short long abort log
c4t50024E90044F3FAEd0 2000 GB RAIDz2 raidz ONLINE S:3 H:0 T:0 SAMSUNG HD204UI sat,12 PASSED 26 Â°C S2H7JD6ZB00624 short long abort log
c6t0d0 32.2 GB rpool basic ONLINE S:0 H:0 T:0 - disk n.a. - n.a. short long abort log
c7t14d0 300 GB raptor10k-RAID10 mirror ONLINE S:4 H:0 T:0 WL300GLSA16100 sat,12 PASSED 35 Â°C LP007105 short long abort log
c7t15d0 300 GB raptor10k-RAID10 mirror ONLINE S:3 H:0 T:0 WL300GLSA16100 sat,12 PASSED 34 Â°C LP007072 short long abort log
c7t16d0 300 GB raptor10k-RAID10 mirror ONLINE S:3 H:0 T:0 WL300GLSA16100 sat,12 PASSED 34 Â°C LP018886 short long abort log
c7t17d0 300 GB raptor10k-RAID10 mirror ONLINE S:3 H:0 T:0 WL300GLSA16100 sat,12 PASSED 34 Â°C LP018961 short long abort log
c7t18d0 300 GB raptor10k-RAID10 mirror ONLINE S:3 H:0 T:0 WL300GLSA16100 sat,12 PASSED 35 Â°C LP009085 short long abort log
c7t19d0 300 GB raptor10k-RAID10 mirror ONLINE S:3 H:0 T:0 WL300GLSA16100 sat,12 PASSED 33 Â°C LP018960 short long abort log
c7t20d0 300 GB raptor10k-RAID10 mirror ONLINE S:3 H:0 T:0 WL300GLSA16100 sat,12 PASSED 34 Â°C LP004010 short long abort log
c7t21d0 300 GB raptor10k-RAID10 mirror ONLINE S:3 H:0 T:0 WL300GLSA16100 sat,12 PASSED 33 Â°C LP009633 short long abort log

-------------------------------------------------------------------------------------
pools
-------------------------------------------------------------------------------------

Pool Version Pool GUID Vdev Ashift Asize Vdev GUID Disk Disk-GUID Cap Product/ Phys_Path/ Dev_Id/ Sn

RAIDz2 5000 2820520080993930436 vdevs: 1
vdev 1: raidz2 12 16.00 TB 5251948501379730193
c4t5000C50046F1A085d0 12333019810253413747 ST2000DL003-9VT1
/scsi_vhci/disk@g5000c50046f1a085:a
id1,sd@n5000c50046f1a085/a
6YD234N1
c4t5000C50046F2D8C1d0 17912313812557223752 ST2000DL003-9VT1
/scsi_vhci/disk@g5000c50046f2d8c1:a
id1,sd@n5000c50046f2d8c1/a
6YD2434J
c4t5000CCA221D3C439d0 3473539919622518624 Hitachi HDS72202
/scsi_vhci/disk@g5000cca221d3c439:a
id1,sd@n5000cca221d3c439/a
JK1130YAHDGZNT
c4t5000CCA221D3CAAEd0 4809642699822734449 Hitachi HDS72202
/scsi_vhci/disk@g5000cca221d3caae:a
id1,sd@n5000cca221d3caae/a
JK1130YAHDJPZT
c4t50014EE6007472CCd0 8577672279742323929 WDC WD20EARS-00M
/scsi_vhci/disk@g50014ee6007472cc:a
id1,sd@n50014ee6007472cc/a
WDWMAZA1193578
c4t50014EE655C9C2DCd0 16491910523771943534 WDC WD20EARS-00M
/scsi_vhci/disk@g50014ee655c9c2dc:a
id1,sd@n50014ee655c9c2dc/a
WDWMAZA0979429
c4t50024E90044F3EDCd0 11274540183612232788 SAMSUNG HD204UI
/scsi_vhci/disk@g50024e90044f3edc:a
id1,sd@n50024e90044f3edc/a
S2H7JD6ZB00622
c4t50024E90044F3FAEd0 1758903232926313746 SAMSUNG HD204UI
/scsi_vhci/disk@g50024e90044f3fae:a
id1,sd@n50024e90044f3fae/a
S2H7JD6ZB00624

raptor10k-RAID10 5000 7473019991373255881 vdevs: 4
vdev 1: mirror 9 300.06 GB 17862662818173334383
c7t14d0 8276855192894438065 WL300GLSA16100
/pci@0,0/pci15ad,7a0@16/pci1000,3140@0/sd@e,0:a
id1,sd@n50014eef00566ba7/a
LP007105
c7t15d0 4862475159679432941 WL300GLSA16100
/pci@0,0/pci15ad,7a0@16/pci1000,3140@0/sd@f,0:a
id1,sd@n50014ee0009abc00/a
LP007072
vdev 2: mirror 9 300.06 GB 15603671657220422076
c7t16d0 5131963312324972714 WL300GLSA16100
/pci@0,0/pci15ad,7a0@16/pci1000,3140@0/sd@10,0:a
id1,sd@n50014ee0ab454531/a
LP018886
c7t17d0 11290850630156887335 WL300GLSA16100
/pci@0,0/pci15ad,7a0@16/pci1000,3140@0/sd@11,0:a
id1,sd@n50014ee055efdd97/a
LP018961
vdev 3: mirror 9 300.06 GB 17779323401607039181
c7t18d0 13837708582704390866 WL300GLSA16100
/pci@0,0/pci15ad,7a0@16/pci1000,3140@0/sd@12,0:a
id1,sd@f008a434b5168cd0300028a560001/a
LP009085
c7t19d0 1366491286675847134 WL300GLSA16100
/pci@0,0/pci15ad,7a0@16/pci1000,3140@0/sd@13,0:a
id1,sd@n50014eef00aaed43/a
LP018960
vdev 4: mirror 9 300.06 GB 11482444555854192767
c7t20d0 5359622763800542043 WL300GLSA16100
/pci@0,0/pci15ad,7a0@16/pci1000,3140@0/sd@14,0:a
id1,sd@n50014eef00566e28/a
LP004010
c7t21d0 14346063480124713428 WL300GLSA16100
/pci@0,0/pci15ad,7a0@16/pci1000,3140@0/sd@15,0:a
id1,sd@n50014ee00098c5c5/a
LP009633

rpool 5000 14014268098405055842 vdevs: 1
vdev 1: disk 9 32.17 GB 1904727143742154576
c6t0d0 Virtual disk
/pci@0,0/pci15ad,1976@10/sd@0,0:a
id1,sd@n6000c290cff5bd35738d6f5b52a53dc0/a
6000C290CFF5BD3

------------------------------------------------------------------------
DD Bench on RAIDz2
------------------------------------------------------------------------
Test pool

Blocksize (default 2M)

Count (default: 8192)

You should wait about 40s for correct values
Wait in s between write and read test

Size of Testfile: 16.777216 GB

Memory size: 8192 Megabytes

write 16.777216 GB via dd, please wait...
time dd if=/dev/zero of=/RAIDz2/dd.tst bs=2048000 count=8192

8192+0 records in
8192+0 records out

real 34.9
user 0.0
sys 9.4

16.777216 GB in 34.9s = 480.72 MB/s Write

wait 40 s
read 16.777216 GB via dd, please wait...
time dd if=/RAIDz2/dd.tst of=/dev/null bs=2048000

8192+0 records in
8192+0 records out

real 25.9
user 0.0
sys 8.4

16.777216 GB in 25.9s = 647.77 MB/s Read

-----------------------------------------------------------------------
DD Bench on raptor10k-RAID10
-----------------------------------------------------------------------
Test Pool

Blocksize (default 2M)

Count (default: 8192)

You should wait about 40s for correct values
Wait in s between write and read test

Size of Testfile: 16.777216 GB

Memory size: 8192 Megabytes

write 16.777216 GB via dd, please wait...
time dd if=/dev/zero of=/raptor10k-RAID10/dd.tst bs=2048000 count=8192

8192+0 records in
8192+0 records out

real 39.1
user 0.0
sys 8.5

16.777216 GB in 39.1s = 429.08 MB/s Write

wait 40 s
read 16.777216 GB via dd, please wait...
time dd if=/raptor10k-RAID10/dd.tst of=/dev/null bs=2048000

8192+0 records in
8192+0 records out

real 26.0
user 0.0
sys 7.5

16.777216 GB in 26s = 645.28 MB/s Read

gea · May 19, 2013

This is a quite usual power-config for a ESXi or Solaris whitebox,
so nothing unexpected. I use the quite similar X8-DTH-6f for years in my All-In-One.

Your settings, whether you prefer performance (like mirrorrs for better I/O, I would use this always for spindle based datastores) or ZFS2 for higher capacitity depend on your use case. Especially the question if you disable sync write for NFS+ESXi for better performance or enable it for better security for VM's is a setting you should think about.

While ZFS is always consistent, even on a power outage, your virtual disk from ESXi view may not (depends on VMs and how crash resistent they are). If you concern about and enable sync write where every single write must be committed until the next one can occur, you have a massively reduced write (sometimes 1/10) performance unless you use a dedicated high speed ZIL disk (ZeusRam regarding your other hardware). I would enable/disable sync and compare performance and decide on results and needs.

mrkrad · May 19, 2013

Why not use ISCSI? How does esxi work multiple nic's with nfs? round-robin?

IE. My lefthand ISCSI has a Virtual IP which all lefthands can answer to. After connecting to that for discovery, it is then directed to each lefthand in the cluster, and each lun establishes a login for each nic. 4 nic's, 4 lun's equals 16 connections. Default is 1000 packets then switch to next nic. You can set this to 10 packets or 10K or both however it tends to give up linear bandwidth for random i/o. Setting it to switch every 1 packet is great for random i/o but reduces linear bandwidth greatly. I suspect with off-load nic's you could move more packets . At 900 megabit, mtu 1500 I could have sworn it was around 1.5million packets per second using many gigabit nic's.

What's good simple setup for ZFS?

48GB ram, 200gb slc, 1500gb MLC, 4 15K SAS 600gb, 8 7200 RE4 SAS 4TB ?

Can you suggest a good layout. I've got a carpton of older 50gb slc drives (slower samsung) but they are new and cost $1/gb new. It's cheaper to use these than to buy those $70 SLC usb sticks to boot off. I suspect 8 chips of SLC might be more reliable than 1 2gb nand of SLC on a usb port. even if you use a usb to sata converter to plug it in and power it.

Should I use nexenta? what's the most stable o/s. I figure i'd run 4 ports of 10gbe (2 nic's) and bare-metal? or hypervise with VT-d?

gea · May 19, 2013

There is no good or best config without knowing details

- is this a single-box config like a All In One ESXi server with integrated ZFS NAS/SAN
- what is your external data-transfer used for
- how many and what type of VMs do you plan to use
- and other info (medium class, high-performance, ha, support needs etc)

bp_968 · May 20, 2013

gea said:
This is a quite usual power-config for a ESXi or Solaris whitebox,
so nothing unexpected. I use the quite similar X8-DTH-6f for years in my All-In-One.

Your settings, whether you prefer performance (like mirrorrs for better I/O, I would use this always for spindle based datastores) or ZFS2 for higher capacitity depend on your use case. Especially the question if you disable sync write for NFS+ESXi for better performance or enable it for better security for VM's is a setting you should think about.

While ZFS is always consistent, even on a power outage, your virtual disk from ESXi view may not (depends on VMs and how crash resistent they are). If you concern about and enable sync write where every single write must be committed until the next one can occur, you have a massively reduced write (sometimes 1/10) performance unless you use a dedicated high speed ZIL disk (ZeusRam regarding your other hardware). I would enable/disable sync and compare performance and decide on results and needs.

Are you saying the RAID10 array/pool has a setting that risks the VM disks? Right now performance on disk is stellar and I don't feel i need to improve it so I'm more interested in making sure data stays safe. Right now the RAID10 array is setup to run all the VMs (except the openindiana one of course). VMware connects to it through NFS and it seems quite snappy and fine for what I'm doing so I think I'm good on performance there if you believe I'm not doing something obviously dumb somewhere in there.

the RAIDz2 pool/array is for data storage. Performance is fine at almost any level (and right now I can write to it at near 200MB/s-300MB/s so its plenty fast).

The ZFS machine is on a large APC rackmount UPS so it should shutdown before it runs out of juice on that so power outages should be well guarded against.

Right now I'm using a sync program to copy data from the wifes PC down to the SAN and over to my PC (so its in 3 locations). I have considered putting a 4TB USB drive on the ZFS SAN and having it rsync most of the stuff off the array onto that as a 2nd backup on the ZFS box to take advantage of the checksumming. Is that just being crazy overkill? Honestly the way I have it now its going to 3 different locations in the house which I think I prefer

MRKRAD Please consider taking your questions to a new post (though I'll shoot a few answers). I feel like a jerk saying this but your questions are super open ended and broad and this is a very specific request for my specific system. If you start getting answers its going to clutter up my thread. Not trying to offend, it would just help us both (yours will get more traffic, mine won't get derailed).

As to why no ISCSI? No real need for block storage at the moment. I prefer the ease of handling with NFS and its less crap to deal with. It's easier to share directly from the server as SMB/NFS then it is using ISCSI out to all the clients (and the VMs being stored on there is more for fun then anything. The primary reason it lives is for backing up those photos and SMB/NFS is perfect for that in my case)

No idea on the round-robin thing. Right now I'm using infiniband and a 1GbE connection. because of how my DNS is setup I can always get to the box using its name and if its available then the first path choice is IB, the second is GbE. I believe I'm seeing some poor performance using the IPoIB setup and its why I can't break 300MB/s. I'll be trying a point to point 10GbE connection at some point to see if thats the case. Though I guess I could probably copy data through a windows VM from one array to the other using the internal VM network and see what throughput I get. DD bench says I should see 400-500MB/s writes and the best I've seen is about 300MB/s. Honestly, its fast enough I haven't really been all that concerned yet

I tried Nexenta and actually prefer openindiana and napp-it. Nexenta just didn't do it for me (and I didn't like the 18TB RAW drive limit).

My fee for the answers is one of those 50GB SLCs...

bp

gea · May 21, 2013

bp_968 said:
Are you saying the RAID10 array/pool has a setting that risks the VM disks? Right now performance on disk is stellar and I don't feel i need to improve it so I'm more interested in making sure data stays safe. Right now the RAID10 array is setup to run all the VMs (except the openindiana one of course). VMware connects to it through NFS and it seems quite snappy and fine for what I'm doing so I think I'm good on performance there if you believe I'm not doing something obviously dumb somewhere in there.

This is not related to Raid-10 but a general problem.
Very simple Example. You have a large word document where you want to modify a single word like house -> building.
When you edit this on a ZFS SMB-share and you have a power-failure or crash during save, the modification is saved or not but your ZFS filesystem is in either case not corrupted with the word document intact thanks copy on write.

If your word-document is within a VM on a VMWARE datastore with some ESXi or guest-OS filesystem-speed-optimizations, it can happen, that the VMWARE filesystem or guest-filesystem gets corrupted (the ZFS filesystem is valid without errors but contains garbage). This is the reason, why ESXi requests sync-writes on your NFS share to have full control whether a datablock is commited from disk or not.

Without sync-write it may happen, that a write is committed (from RAM) and ESXi/guest-OS starts the next filesystem transaktion asuming everything is on disk. A failure at this state can result in wrong data or a damaged guest-filesystem. So sync-writes offers a much better data-security. The price is a poor write-performance without a very fast dedicated ZIL-Log device.

You can only check this behaviour with regular benchmarks or SMB when you set the ZFS sync property (napp-it menu ZFS filesystems) to always.

bp_968 · May 22, 2013

gea said:
This is not related to Raid-10 but a general problem.
Very simple Example. You have a large word document where you want to modify a single word like house -> building.
When you edit this on a ZFS SMB-share and you have a power-failure or crash during save, the modification is saved or not but your ZFS filesystem is in either case not corrupted with the word document intact thanks copy on write.

If your word-document is within a VM on a VMWARE datastore with some ESXi or guest-OS filesystem-speed-optimizations, it can happen, that the VMWARE filesystem or guest-filesystem gets corrupted (the ZFS filesystem is valid without errors but contains garbage). This is the reason, why ESXi requests sync-writes on your NFS share to have full control whether a datablock is commited from disk or not.

Without sync-write it may happen, that a write is committed (from RAM) and ESXi/guest-OS starts the next filesystem transaktion asuming everything is on disk. A failure at this state can result in wrong data or a damaged guest-filesystem. So sync-writes offers a much better data-security. The price is a poor write-performance without a very fast dedicated ZIL-Log device.

You can only check this behaviour with regular benchmarks or SMB when you set the ZFS sync property (napp-it menu ZFS filesystems) to always.

Excellent example and explanation, thank you. Currently the only VMs running are non-essential so their deaths wouldn't be all that big a deal. I'm setting up a backup solution right now (I'm trying veem out since I have a full license on ESX). I have a filesystem setup just for the VMs so I could easily enough make the change to the sync write setting but I guess I'm not terribly worried at the moment. This did lead me down the rabbit hole of google searches into ZIL capable SSDs and speed and all kinds of other nonsense. I'm at about 4-5 hrs of looking now (lol).

From what I have found it seems adding a SSD ZIL actually increases my data risk unless I very carefully select the right SSD. Sadly "right" SSD tends to have one of two issues, its fast (and expensive) or its cheap and its slow. Since I can copy data from one machine to another across the infiniband network at 250-300MB/s it seems like I'd actually "hurt" speeds if the SSD couldn't handle a throughput higher then 300MB/s right? It would increase IOPS, no doubt, but right now I just dont think thats a top priority. The 8 disk 10k raptor RAID10 was already a goofy level of nerdness on my part.

This did lead me to want to do some testing. It seems the SSDs firmware is pretty smart and does its thing even without OS help. If thats true then the only thing holding back fast/cheap consumer grade SSDs from being used as a ZIL is that pesky "kill your pool if you loose power/reboot/crash" issue. That got me looking at some simple and cheap ideas beyond what I already have (a UPS). Now, I have two UPSes so I could chain them together or I could buy a redundant power supply for the server and put one cable per UPS and be "mostly" safe. The problem is your still at risk if you kick a power cable or otherwise have the system go belly up. So why not use an external UPS (it could be downright tiny) and provide power to the ZIL/SSD through a PCI-plate with a DC power plug in it (they make them with a DC power plug and esata port. just ignore the data port and only use the power socket). If you did this then your SSD would get its power from externally and would stay powered up if the server went down. Theoretically this would let it write out its cache and be happy. There is a dutch guy who was making a inline adapter that plugged into the sata power cable and had a small l-ion battery on it and would keep the drive alive for 30 seconds or something if it lost power. Pretty elegant *if* the SSD is willing to finish its job when it suddenly looses data connectivity (which I'm not 100% sure it would do honestly).

Anyway, rambling. Just some interesting ZIL related things I found.

Right now I'm just going to toss another 24GB of RAM into the server and start doing nightly backups of the VMs I'm running. Any suggestions on a easy backup solution since I'm running the VMs on hybrid SMB/NFS shares? I did some digging and am still not sure if I can just pause/stop the VM and copy over the whole VM directory from the SMB/NFS share or if I have to do something else fancy to get it to work. Obviously being able to copy it while its running would be even better but Im not totally opposed to just pausing them or rebooting them now and again.

Thanks!

gea · May 22, 2013

Keep it simple!

If you have a UPS and can accept the minimal risk of loosing last 5s of writes, just disable sync on your ZFS filesystem,
otherwise use a ZIL with a supercap (cheapest but quite slow is a Intel 320 SSD, best is a expensive RAM based ZeusRAM)
(your ZFS pool is always save, only last writes may be lost)

If you want to copy a VM via SMB:
- shut down the VM and Copy, this is a perfect crash save backup/ clone (import via ESXi filebrowser, click on .vmx file -> add to inventory)

If you want to clone/backup a running VM
- do a ESXi hot-snap, do a ZFS snap and copy the VM snap via Windows previous version
(you can restore the hot-state if you restore the ZFS snap, then the ESXi snap)

You can skip the ESXi snap, if you do not need hot-snaps with memory state
State of the VM is similar to a sudden power-off

bp_968 · May 23, 2013

Thanks again Gea!!! I have a new problem, the mirror is in a degraded state because of "to many errors". How do I check it and make sure the drive is really bad (or good). Smart says its ok.. I posted the question in a new thread here:

Failed drive because of "to many errors"? Is it really failed?

bp_968 · May 23, 2013

So here is two questions:

First: Can I remove a vdev mirror set from the pool and "shrink" it or do I need to pull the data off it, destroy the pool and recreate it with 6 drives instead of 8? What kind of performance hit would I likely see dropping those two disks? My thought process here is I could drop 2 drives and end up with a 900GB RAID10 array and then put one of those two pulled drives back in as a hot spare and that would leave me a port open on my 6Gb/s SAS controller to put a SSD for an L2ARC. I have a 32GB/64GB OCZ Synapse I could use pretty easily. I was going to push it through the motherboard sata ports as a RAW device mapping or just us a VMDK image on it to give to the ZFS virtual machine. Problem is those ports are only 3Gb/s ports. Is that likely to hurt me?

Second: Hot spares. I'm expanding the server into a second box I'll be using as a DAS. I have 5 1TB drives laying around I'd like to put inside and make a pool out of. Lets say I make a 4TB RAIDz1 (we will call it media_pool) for movies out of those 5 drives and I already have the 12TB RAIDz2 built from 8 2TB drives (called RAIDz2). Could I then put in a *single* 2TB drive and mark it as a hot spare for both "media_poo"l *and* "RAIDz2"? Would it just pull that larger drive into the pool of 1TB drives and resilver and be all happy? I don't mind the "wasted" space since I'd just grab a replacement for the broken one and then resilver onto the new drive and return the hotspare back to its hotspare duties?

Not having to "waste" extra ports on the SAS controllers is why I am asking. If I can just toss in a "global" hot spare that was as big or bigger then the disks in any other pool yet use it for all the pools would be amazingly cool!

I'm going to do some google foo and see if I can find out about the hot spare bit.

Search

Take a look and help me verify my open indiana ZFS is ready for "production"?

bp_968

New Member

gea

Well-Known Member

mrkrad

Well-Known Member

gea

Well-Known Member

bp_968

New Member

gea

Well-Known Member

bp_968

New Member

gea

Well-Known Member

bp_968

New Member

bp_968

New Member