Poor man's diy ZeusRAM

ant

New Member
Jul 16, 2013
21
5
3
Was not sure whether to have the subject as "Poor man's diy ZeusRAM" or "What to do with that spare $20 second hard raid card you have lying around" or "one valid reason for using a ramdisk as zil (calm down - its only for temporary testing)"

Anyway, this will be a big post so lets get started. I'll assume you are fairly well versed in zfs but I hope to introduce a new and innovative idea to you. I hope you will consider the whole post before getting angry at one point (eg using ramdisk as zil) and not waiting for the explanation.

I've been working on putting together an all-in-one with omnios for zfs and an lsi 2308 passed through to the omnios vm with 6 x 2TB sata disks in raid z2. That is working great as are all my VMs. Its mostly pretty quick for most operations. Its only very slow for some operations involving copying lots of small files over nfs - but that does not happen often for my usage. In any case it is something I would like to be quicker. The typical fix for this is an ssd for zil as you all know.

I had been doing heaps of research on which ssd to get as a zil and eventually narrowed it down to a 100GB intel S3700, taking into account suitability to the task, performance, cost/value, and availability in my location. For all that it is a good choice. However when it came down to it, for me to use at home on a limited budget, it is still quite expensive. For it's high cost (to me on a limited budget) it is still not perfect and is nothing quite like a ZeusRAM for instance and I just could not justify spending so much on something that you only use say 8GB of. It just seems wasteful and I could not get past it. I also didn't like the idea of getting a cheaper ssd that was even less suitable to the task. Strangely I also didn't like the idea of the ZeusRAM as it is very expensive (to me) and I consider it still not perfect for the task as it is limited by its interface speed and iops. That probably does not matter but it seems like it should - in my mind anyway.

Then I started thinking - how much zil disk space do you actually really need, and could you put your zil disk on a hardware raid controller with say 512MB bbwc (battery backed write cache), with the zil volume at only 512MB, and set your cache ratio to 100% write, so all writes are cached. I'll rephrase this for clarity - a hardware raid controller, eg a cheap secondhand HP P400 with 512MB bbwc (or dell equivalent), with the whole 512MB cache dedicated to writes, and only a single disk or ssd on that whole controller (two drives mirrored if you like), and only a single 512MB volume on that disk used for zil. The rest of the disks from your zpool would be on a completely different hba. If this would work it has the potential to give you the full iops and bandwidth of the raid controller for your zil which is potentially more than a ZeusRAM for less than a couple hundred dollars. Maybe even free if you have the parts lying around.

I don't know the iops of a HP P400 but it is old so I assume it is > 50,000 iops, and its bandwidth is up to 2 gigabytes per second over pci-e to its cache. That is potentially better than a ZeusRAM. A more modern raid card should have even better specs and could have a bigger cache size

I have heard various numbers and calculations and the common 8GB zil for a 10 gigabit network interface sounded like a good estimate. But does it really need to be that much for my usage patterns? I tried using that zilstat.ksh script to help determine this but I couldn't really work it out from just that. I did not have a P400 to test with so I figured a good way to test what zil size I need would be to use a ramdisk zil of various sizes and test real world activities to see what differences I could get. Everyone knows it is pointless and unsafe to use a ramdisk as zil - so obviously don't use this for production. But it is a useful tool for temporary testing to help work out how big you zil needs to be for your workloads.

You can see how much disk space is used on your independent zil while you are testing with this command:

zpool list -v <poolname>

you will see these fields - just find the ones relating to your zil device:

SIZE ALLOC FREE


For my tests I found that I never used more than about 3 gigabytes of zil and I really had to push things to get it anywhere near there. I also found that even with a small ramdisk zil of 512 megabytes that got completely filled, I still had significantly better performance than standard zil (spread over the standard disks). It often uses much less than 512 megabytes in typical operations.

With those tests successful and out of the way (and the ramdisk zil gone), I had the confidence to go and buy a P400 and try it out for real.

I got a HP P400 with 512 megabyte bbrw and a cable to go from the raid card to sata and used a spare sata hard disk I had lying around. All up much less than $100. Probably will be cheaper for those of you in USA.

I only had a 4 channel pci-e slot available so that is 1 gigabyte per second - half the bandwidth the card is capable of. Still more than a sata ssd.

The card and disk are ncq capable. Some older P400 are not. I think this is important as I suspect the raid card and controller on the disk are smart enough that if something is written to cache and is then cleared before it has had time to be written to disk - then it does not actually need to write to disk. Or maybe if the disk is told to write something to some sectors, and then before it has had time to write them it is told to write something different to those same sectors, it does not need to waste time writing the first thing. If this is the case - you could maybe get away with having a spinning disk as the disk instead of an ssd.

I initially tried to do pci passthough of the P400 to my omnios VM. It sort of passed through and the OS saw it but could not make use of it. There were errors in dmesg. I did not try too hard to fix it and I disabled the pass though. I installed drivers and config tools in esxi for it (needed a workaround to install the config tool on a whitebox). I am currently using a 7200rpm sata disk. I created an array/ logical volume and considered passing through the volume to the omnios VM with rdm (Raw Device Mapping) but instead I just set up the volume as a vmware datastore and created a 512 megabyte virtual disk for omnios to be used as a zil. I am hoping that vmware handles writes to this safely but I don't know the internals of vmware so for now I am trusting them.

Results: for my initial tests, speeds are pretty much as good as the equivalent sized ram disk. An example:

before - with standard distributed zil. clone a 3.88GB vm from a template on nfs datastore to the same nfs datastore:

487 seconds
3973.12 / 487 = 8.16 megabytes per second


after - with 512MB zil on hp P400 512MB bbrw 512MB 100% write cache. vmdk on 7200rpm sata disk - clone 3.88GB vm from a template on nfs datastore to the same nfs datastore:

162 seconds
3973.12 / 162 = 24.53 megabytes per second


To me that is a massive improvement and I am much happier with spending this smaller amount on this solution, then spending much more on a S3700 ssd.

Actually sync writes are quicker than that. I was getting up to about 100 megabytes per second a lot of the time using the P400. Just that the above numbers are averages.

I've broken two of the golden rules for zfs:

- It is completely pointless to use a ramdisk as zil - I have found one valid reason - testing how small a zil you can have.
- zfs likes direct access to disks rather than using raid controllers - I think in the case of an independent zil - it can be good to use a raid controller. This specific case only.

There are potential problems however but they are probably all fixable. Here is a list of pros and cons as I see it:

pros:
- cheap - you might even have all the parts spare already
- good safe performance. well suited as a zil. Sync writes are accepted pretty much instantly and are written permanently pretty much instantly (to battery backed ram). The raid controller lets you turn off the write cache on the disk and just use the battery backed write cache on the controller. You can also turn the write cache on the disk on if you know it is protected by a capacitor eg intel S3500 ssd
- if you want more performance - possibly more than an ZeusRAM, you could spend a heap and get a modern raid controller that handles >300000 IOPs and has 2GB flash backed write cache and 8 x pcie2 or pcie3 for lots of bandwidth and put your ZeusRAM on this. Set your zil volume size to the full size of the ZeusRAM in this case. You would get the benefit of the higher bandwidth and IOPs in most cases where operations fit within 2GB.
- if you are using an ssd with this solution, you could use a larger volume on the ssd than your raid cache size and in most cases get the benefit of the higher bandwidth and IOPs of the raid card cache.
- if you get a power outage or hardware failure, with bbwc you get a couple of days to fix things and when the power is returned, the contents of the raid cache will get written to the attached disk. With flash backed write cache, the contents of the cache are written to flash on the controller as soon as the power is gone and when power is restored the data on the flash is written to the attached disk. You have much longer to fix things in this case.
- this is much better performance than using just a consumer ssd (without power loss data protection) as a zil with the write cache on the ssd turned off. For example it is not safe to use a Samsung 840 pro with is internal write cache turned on as a zil. So to be safe you would need to disable the internal write cache and that makes them quite slow. If you disable the ssd internal write cache and put the ssd on a raid card with bbwc and only set a volume about as big as the cache for zil - you get performance and safety.
- people who bought $10 P400's on ebay just because they were cheap and thought they might one day have a use for them - now have a use for them
- unconfirmed, but I expect there is no performance drop over time such as you get with most ssd's once the whole ssd has been written to once. I expect you get the full performance of the raid card cache regardless of the spinning disk or ssd backing the raid card cache.

cons:
- complicated to set up
- increased complexity
- for an-all-one potentially not easy or possible to migrate your zpool in the event of a hardware failure - eg if you can't pci passthorough the controller to the VM, you might not be able to get your zfs pool working on another server after a motherboard failure. It is still possible - but complicated. You would need to set up an equivalent virtual environment. I expect that this is not a problem if you are _not_ using an all-in-one and are using bare metal omnios or similar
- in the case of some hardware failures, the 2 or so days of battery backup of the raid cache may not be enough for a home user without a vendor support contract. In this case, if the raid controller and disk have not yet failed, you could potentially move them to a spare pc/server temporarily and power them up without booting an operating system just so it can write the cache to disk. Then you have plenty of time to get your repairs done.
- if you are using a spinning disk for the zil - you are writing to a small portion of the disk over and over again - that can't be good. You might need to move the volume to a different part of the disk from time to time or use a modern ssd instead that moves sectors around to stop them wearing out (wear levelling?).
- You are writing to the attached drive all the time whereas in a ZeusRAM it only writes to flash on power loss I think and to ram the rest of the time. So a ZeusRAM should last a lot longer than a drive in this configuration.
- HP P400 only has sata 1 speed to the drive (sas is faster though). This should not matter as we would get the speed of the cache rather than of the drive in this set up.
- batteries on the old cards are getting old and might fail. If they do the write cache is auto disabled and your zil performance will drop until you replace the battery. Probably not a big deal for home servers but probably unacceptable for some business purposes. Businesses can splash out and get a modern raid card with flash backed write cache instead.
- uses more electricity
- if you do tests with zil on a ram disk for data pools with important data - you risk causing data loss and corruption. Preferably do your tests with an unimportant pool that does not matter if it loses or corrupts data. Up to you if you want to test this with important data - its your data after all and you are a grown person right. UPS is recommended if you intend to try this. Also I wouldn't try this if your server is not rock solid and is prone to crashing. Keeping your tests brief is also recommended. Even with all that there are still risks.
- For ramdisk zil testing you need a modern zfs implementation so that you can remove the unmirrored zil device. You don't want to be stuck with a ramdisk zil.
- you need spare ram to test zil on a ram disk.
- people will tell you that you are doing it wrong. They may or may not be correct. It depends on the case.

If there is much interest in this topic I might supply more detail on the commands I used to set up and remove the ramdisk zil and how I got the raid card driver and tools working on esxi on a non-hp whitebox server. No promises though.

Ant
 
Last edited:

mrkrad

Well-Known Member
Oct 13, 2012
1,244
52
48
p420/1gb FBWC is like $269 and with a key :) you get HP SMARTCACHE (cachecade 1.0 equivalent). PCI-E 3.0 instead of 1.0, ddr3 instead of ddr.

works good in well cooled servers. They had to replace it with a newer chip since it was plagued with heat issues hence the p430 which uses a newer driver
 

kroem

Active Member
Aug 16, 2014
245
40
28
35
Anyone else tried this? I have some raid cards I'm not using, could be fun to try :)
 

kroem

Active Member
Aug 16, 2014
245
40
28
35
I tried setting something up, with a 512mb logical drive on my Adaptec 52445 - but is there any way to benchmark it properly?
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
Create a Windows VM on the NFS shared datastore and do a Chrystaldisk benchmark with sync=disabled vs sync=always with and without the dedicated ZIL.
 

Entz

Active Member
Apr 25, 2013
269
62
28
Canada Eh?
I use this on my file server to great effect. In my case it was a older 3ware 9750-4i , 512MB and a M4 128GB SSD (10GB Partition) as the back-end store. The write speeds went up ~7x over just the SSD (which in itself was pretty terrible at 4k writes). Very similar to sync=disabled on the array (4x2TB 7K3000s).
 

kroem

Active Member
Aug 16, 2014
245
40
28
35
Create a Windows VM on the NFS shared datastore and do a Chrystaldisk benchmark with sync=disabled vs sync=always with and without the dedicated ZIL.
I would, but now it seems like I cant start NFS... hmm;

Sep 4 06:07:19 napp-it-opteron nfs4cbd[480]: [ID 867284 daemon.notice] nfsv4 cannot determine local hostname binding for transport tcp6 - delegations will not be available on this transport
Sep 4 06:08:09 napp-it-opteron /usr/lib/nfs/lockd[484]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : No such file or directory. Exiting
Sep 4 06:08:09 napp-it-opteron svc.startd[10]: [ID 652011 daemon.warning] svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed with exit status 1.
Sep 4 06:08:59 napp-it-opteron /usr/lib/nfs/lockd[539]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : No such file or directory. Exiting
Sep 4 06:08:59 napp-it-opteron svc.startd[10]: [ID 652011 daemon.warning] svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed with exit status 1.
Sep 4 06:09:49 napp-it-opteron /usr/lib/nfs/lockd[543]: [ID 491006 daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : No such file or directory. Exiting
Sep 4 06:09:49 napp-it-opteron svc.startd[10]: [ID 652011 daemon.warning] svc:/network/nfs/nlockmgr:default: Method "/lib/svc/method/nlockmgr" failed with exit status 1.
Sep 4 06:09:49 napp-it-opteron svc.startd[10]: [ID 748625 daemon.error] network/nfs/nlockmgr:default failed: transitioned to maintenance (see 'svcs -xv' for details)


root@napp-it-opteron:~# svcs -xv network/nfs/nlockmgr:default
svc:/network/nfs/nlockmgr:default (NFS lock manager)
State: maintenance since Thu Sep 4 06:09:49 2014
Reason: Start method failed repeatedly, last exited with status 1.
See: SMF-8000-KS
See: man -M /usr/share/man -s 1M lockd
See: /var/svc/log/network-nfs-nlockmgr:default.log
Impact: 2 dependent services are not running:
svc:/network/nfs/client:default
svc:/network/nfs/server:default

;\

this is however a test box, nfs is running fine on the other one. Found this illumos gate - Bug #4518: /usr/lib/nfs/lockd: [daemon.error] Cannot establish NLM service over <file desc. 9, protocol udp> : I/O error. Exiting - illumos.org

EDIT. NVM, I just cleared it and now it's up.. :)
 
Last edited:

kroem

Active Member
Aug 16, 2014
245
40
28
35
Well... there's some difference alright, but the low point of the setup from the start is so low I'm not sure what to make of it. Might move it to the other box and try there.
zil OFF ------------------ sync: Always ----------------------------------------------- sync: Disabled --------------------------
test-zilOFF-syncALWAYS.png test-zilOFF-syncDISABLED.png
zil On ----------------- sync: Always -------------------------------------------------- sync: Disabled --------------------------
test-zilON-syncALWAYS.png test-zilON-syncDISABLED.png
 

J-san

Member
Nov 27, 2014
67
42
18
40
Vancouver, BC
Just purchase a poor man's ZeusRAM:

100GB Intel S3700.
$225 CDN

100GB Intel S3700 slog attached to
4 Sata WD RE4 drives in Raid 10 (striped mirrors):



That's with sync=standard over NFS (= always).
zfs compression was off. Thin provisioned VMDK on NFS over 10GB virtual vswitch.

I ran into an issue with the 1000MB test in that the 7200 sata vdevs couldn't absorb the 4k QD32 writes from the S3700 slog fast enough.
 
  • Like
Reactions: Entz and Patrick

vikingboy

New Member
Jun 17, 2014
29
6
3
ThankS...I'm going to have to give that a go, I've got a 100gb s3700 lying on my desk and was looking for something to use it for.
 

J-san

Member
Nov 27, 2014
67
42
18
40
Vancouver, BC
I thought I would answer my own question:

100GB S3700 slog - 4x 2TB Re SATA drive pool.
Sync=standard over NFS

no modification for sd.conf:


Adding settings to sd.conf for S3700 non-volitile, etc:


As a side note, either my cpu usage went down or patching ESXi 5.5 to the latest patch release (dec 1) fixed the reporting with hyperthreading enabled.
 
  • Like
Reactions: MiniKnight

J-san

Member
Nov 27, 2014
67
42
18
40
Vancouver, BC
Snagged a couple of Dell branded 400GB Intel S3700 SSDs for $300USD each off eBay - were used for half a year before the server was upgraded.

I'll post a benchmark for them used as a slog once they arrive.
They are spec'd to do 460 MB/s sequential writes vs 200MB/s for the 100GB.

The 4k@32QD iops are better for the larger drive:
100GB S3700 Random 4k write ops: 19,000
400GB S3700 Random 4k write ops: 36,000
(reads are the same, 75,000 iops @ 4k-32QD)

Beats my original plan of buying one 200GB s3700 for more that those 2 combined!
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
6,992
1,568
113
CA
I use this on my file server to great effect. In my case it was a older 3ware 9750-4i , 512MB and a M4 128GB SSD (10GB Partition) as the back-end store. The write speeds went up ~7x over just the SSD (which in itself was pretty terrible at 4k writes). Very similar to sync=disabled on the array (4x2TB 7K3000s).
I've seen some write-ups on using the onboard LSI CACHE for this too, so for those of us like me who've been converted to ZFS for some things I have an onboard 1GB sitting there :)

Awesome to hear writes went up ~7x.

Anymore testing done?
 

Entz

Active Member
Apr 25, 2013
269
62
28
Canada Eh?
Haven't done any more testing on that in a long time. 7x sounds like a lot but the numbers were still pretty bad (M4 was a crap drive). Sequential writes went from 53MB to 135 and 4K from .988 to 6.86 . I don't have the same base disks to try but a S3700 can hit 224 / 11 on 6x2TB reds vs 4x2TB 7K3000s..
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
6,992
1,568
113
CA
Hmmm, I'm going to have to try to use my 1GB onboard cache + 1GB on a cheap controller to create my SLOG pool, I wish a RAID card with 4GB Cache/RAM weren't so much $$$ that would be ideal, throw 2x4GB RAID in there for an 8gb RAM/CACHE SLOG :D