High performance local server build - starting advice

danwood82

Member
Feb 23, 2013
58
0
6
I'm looking to rethink my existing server build to best suit my requirements, but there are a few too many things I'm unclear on, and I hope some advice from you experienced bunch might just set me on the right track.

I previously built a combined dual-E5-2687W workstation/8x3TB RAID-10 server, with another two workstations attached to it by dedicated 10GBE connections, and everything running Windows 7 Pro.
This is for heavy fluid-simulation and rendering work, which is typically reading and writing files of 2-3GB per frame. That setup works pretty well, but it's far from ideal. It gets bogged down as soon as two workstations access at once, and I have to be very careful to reserve 1-2 of the 16 cores on the server-workstation to manage disk access and network, or it will bottleneck everything.
It also has the downside that I have to run a rather power-hungry workstation all day and all night even when I'm not actually using the processing power, and only need the file server running.


What I'm planning to do, is to shift the hard drive array into a 12-bay 2U enclosure, get something like a Xeon-E3 or low-end E5 uniprocessor motherboard/cpu, whatever has enough PCIE lanes to run two 10GBE cards and whatever disk controller cards I might need. I'd either expand from 8 to 12 drives now, or at least plan to expand to that later - turns out even a 12TB Raid-10 gets eaten up in no time with this workload!

I'm also wondering whether to get my hands dirty and try to set up CentOS rather than relying on Windows.


The parts I'm unclear on are:
- If I'm running 12x3TB SATA drives in RAID 10 - is there any real performance benefit to having a dedicated hardware RAID card or other host adapter, or would I be just as well served using software RAID handled through CentOS?
- From what I've read, I get the feeling that setting up "NFS Over RDMA" could give a significant boost to performance working with large files over a 10GBE connection, but information seems pretty limited on this. Would this be the case, or am I getting the wrong end of the stick? I gather I would need to be running Linux at both ends to take advantage of RDMA, or is there some way I could do it with Windows based clients?
(I noticed Windows Server 2012 supports RDMA for SMB shares now, but I'd rather avoid having to pay out for Windows Server licenses, and my workflow should be transferable to Linux based workstations.)
- If I did run software RAID with these specs, would there be a minimum CPU capability I should be looking at, or would even something like the lowest-end dual-core Haswell Xeon be enough to keep up?

If anyone has any other suggestions for the best way to wring mostly-sequential-read-and-write performance out of a 10GBE-networked server, I'd love to hear them.
 
Last edited:

nitrobass24

Moderator
Dec 26, 2010
1,083
127
63
TX
Hey Dan - a couple of things you said made my ears perk up and I think we need to get a better understanding of your workload(s).

That setup works pretty well, but it's far from ideal. It gets bogged down as soon as two workstations access at once
Even if you are accessing the same file from two computers your IO is no longer sequential, because you are sending multiple requests and they are probably not for data that is in memory.

it also has the downside that I have to run a rather power-hungry workstation all day and all night even when I'm not actually using the processing power, and only need the file server running.
If you are not processing why do you need the file server? What else are you using it for?

My initial gut feeling is that you probably have a need to split out file server config/array based on workload. You probably aren't even realizing the benefits of 10g using that setup because you are limited by random IO.

Ill wait for your answers but maybe having two arrays on the same file server is a better idea. Use your existing array for bulk storage and get a couple 1tb SSDs for a smaller array for your rendering jobs.

Using SSDs will greatly reduce the load on your file server because it can handle the requests quicker. If you are not up for that you could try migrating to a Raid5 and see what kind of performance you gain with that, but I would suspect its marginal at best.
 

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
What's your actual throughput with the current setup? Maybe 350MB/s writes with both workstations writing and 275MB/s when writing from a single workstation? A bit less than double that for reads? For your workload, what is the ratio of reads to writes? Share any data you have and then we can give you some suggestions.
 

TallGraham

Member
Apr 28, 2013
143
23
18
Hastings, England
Great answers here so far.

I would check what the throughput is on your 10Gb network connections as people have suggested. 1Gb connections theoretically max out at 125MB/s. I've never managed to hit that though, only 100MB/s ish on very large files being transferred between systems with solid state drives.

I would always go with a hardware RAID controller as that then handles all the disk stuff for you freeing up CPU cores. For just running a fileserver I would go with a hardware RAID card and a lower spec CPU. I have just set up a box using the Xeon E3-1220LV2, it tops out at 17W, and I am really please with it. I am running 3 x Adaptec hardware RAID cards in there, 16 x 1TB 2.5" drives and 4 x SSDs, and it the whole thing runs under 100W.

It is very bespoke and intensive the types of work you are doing though, what do the other 2 workstations do if you don't mind me asking?

I can't help wondering if you could run a hardware RAID card on the workstation for you heavy duty work, possibly with SSDs, and then copy your file over there to work on it, rather than across the network. Copy it back to the fileserver once you are done. Would this be feasible? I don't know how big the files get that you are working on.

Might seem like a lot of questions from everybody but let me assure you that the people on this forum are A1 diamond people. They have all helped me massively so far with my build and saved me a fortune from buying the wrong stuff.
 
Last edited:

clonardo

New Member
Sep 17, 2013
6
0
1
You're in a bit of a pickle, storage-wise. I have similar needs, and have settled on fast PCIe SSDs (currently running an LSI WarpDrive in one box and OCZ Z-Drive R4 in another). Magnetic media just won't cut it for this sort of thing.

While the I/O performance of these drives is just ridiculous, so is the price- you're looking at about $2k for 300gb. If you're in a position where you can do this for active files and archive everything else to magnetic media, it's quite doable. I was actually able to score the WarpDrive slightly used for $700 recently, and have seen lots of fusion-IO cards in the same range.
 

danwood82

Member
Feb 23, 2013
58
0
6
Thanks everyone. Okay, I'll try to give an idea of my usage.

The main two reasons I want to split the server/workstation out, is firstly because when I'm not in crunch time, I quite often want access to my files, while I sit at my laptop and work on there, while the workstation just sits there idling. It's also my ftp server, so I quite often want to leave it on 24/7 so that clients/colleagues can download/upload files directly.
The other reason is that when things are busy, I have to be careful to always reserve cores on the server/workstation so that whatever jobs I'm running on it don't steal away 100% CPU from the storage and network systems, which grinds the rest of the workstations to a halt... so I've got a couple of useful cores sitting idle no matter what I'm doing. It would be a lot more straightforward if they were a couple of cores purpose-allocated to the task.

Including the workstation that's currently acting as a server, the three machines are typically running either:
- fluid simulations, turning out a ~2-3GB cache file per frame, every few minutes.
- secondary simulations, loading those 2-3GB files back in, as source to run a driven simulation, which will usually kick out their own ~1GB cache files every couple of minutes.
- meshing, mostly just the time to load the main sim files, as the resulting meshes tend to be no more than a couple-hundred MB per frame.
- rendering, at its lightest may only be loading a single mesh per frame, but at its heaviest may be loading 10-20GB of data over the course of rendering a single frame.
(edit: also, due to the quirks of NUMA, I actually find it's far more efficient to launch two separate jobs on each workstation, affinity-bound to each CPU, so in full-crunch-time-mode, I can be kicking out up to 6 of the above simultaneously)

...so I guess aside from the initial sim, the workflow is significantly more read-heavy than write-heavy, although writes seem to bottleneck a lot easier, and I often feel like I'm waiting around for file writes more than reads.

The heaviest shots I work on can eat up anything up to 6-8TB for all passes of a single iteration of a shot, over the span of 4-5 days. I'd love to kit everything out with SSDs, but as I mentioned, even with 12TB (24TB RAID-10), I'm still finding it's not enough. SSDs are pretty much a complete no-go.
That's pretty much the issue with all this - it's not a case of having multiple TB of long-term storage and a smaller fast working drive. Those 12TB *are* my scratch disk! :p
I really couldn't resort to a manual process of copying data to a local SSD, processing it, and copying it back. I'd need to do it almost on a pass-by-pass basis, and the time it would take to wrangle all that data around by hand wouldn't be worth the time saved - I used to work that way, and my current workflow is a vast improvement already.

I'll have to test exact figures, as I'm really not sure at the moment, but I'm definitely getting decent use out of the 10GBE connections... At its best I recall getting over 300MB/s reads and writes on the array locally, and a single workstation can easily manage the same speeds across the ethernet cards.

Write speeds seem to be hugely dependant on the program doing the writing - straight file-copy operations are as fast as if they were local, but writing out certain types of file seems to crawl along at 20-30MB/s over Ethernet. This is where I was hoping RDMA might help, as it sounds like it sidesteps at least some of the network packet negotiation overhead.

I'm running RAID 10 as my experiences with RAID 5 in the past have all taught me that it's best avoided unless you're either more concerned with cost than performance, or are willing to spend a vast sum on a very very high-end host adapter.
I know RAID 5 is effectively crippled without a powerful storage controller, but I've read a few times that RAID 10 can actually benefit more from being run in software and being allowed to make use of the CPU. Is there any truth in that?
 
Last edited:

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
A Raid-5/ZFS1 can deliver better sequential read/write values than a Raid 10. Your experiences are a clear indication that your workload is more I/O related. In such a case a Raid-5 has the I/O of a single disk wheras a Raid-10 has the write I/O of two disks. A multiple Raid-10 Pool that you can create with ZFS has the I/O of the number of Mirrors in the pool.

I would always prefer software Raid like ZFS over hardware Raid because you have in nearly all cases a better performance because you use fast CPUs and system RAM instead of limited CPU and RAM capabilities of your controller - if you can use a modern and fast CPU. There is also no write hole problem like with Raid 5/6 where you need controller cache and BBUs. Beside that software Raid is controller independent and less costly. Data security with ZFS is also much better than with NTFS or ext filesystems.

The best solution for good I/O are SSD only pools. But with Raid you do not have trim so you need very good SSDs like the Intel 3500/3700 series. The only problem is mainly the price of these SSDs. With SSD only pools, you can use Raid-6/Z2 based Raid levels where you do not need that many disks.

Traditionally you can use pools with as many fast spindels as possible in a multiple raid-10 pool. For example if you use a datapool build from 10 Mirrors (20 disks), you have the write I/O of 10 disks and the read I/O of 20 disks. This can be quite ok but is not better than a single Enterprise SSD like a Intel 3700.

What you can do is to reduce the concurrent I/O load of several clients by creating a dedicated pool for every workstation. You should also consider that performance lowers with fillrate of disks or pools. If possible, stay below a 70% fillrate.

You can reduce the I/O needs also if you use a lot of RAM for read caching on a filesystem that use this like ZFS eventually paired with additional SSD read caches. If your clients are creating concurrent small writes you also benefit from the ZFS write cache that collects 5s of writes and writes them then as a single sequential write.

What I would recommend is a ZFS appliance build on something like a Supermicro X9SRH-7TF that gives you a higend SAS controller and a 10 GBe interface. Add any Xeon with as much RAM as needed for caching. Use a Netgear 8 x 10 GBe switch to connect all machines with 10 GBe.

Use a case that can hold 16 or 24 disks with a backplane (Norco, SuperMicro etc), eventually use a 2,5" case like a Super Micro Computer, Inc. - Products | Chassis | 2U | SC216A-R900LPB if you intend to use SSDs pools now or later and add one or two SAS controllers when needed like LSI 9207 or IBM 1015. Without SSDs, use as many disks as possible for performance.

You can use SMB to connect them. Such appliances can deliver up to 300 MB/s over 10 GBe. Values with iSCSI or NFS are better so this may be an option as well.

Fastest and easiest option with SMB, iSCSI and NFS are ready to use/web and Solaris based appliances (NexentaStor or OmniOS/ Oracle Solaris 11.1 with my napp-it). With NFS you can use BSD web-based appliances like FreeNAS as well (SMB and iSCSI are mostly slower compared to Solaris). Linux with ZOL is a newer option but I would not (yet) declare it as an alternative for a storage only box if you do not set Linux as the first demand.
 
Last edited:

TallGraham

Member
Apr 28, 2013
143
23
18
Hastings, England
Hi Dan

I still have loads more questions. I understand better what you are doing now. My only experience with anything remotely similar would be SQL database servers with very large databases and logs files etc. Sorry if this sounds like a huge list, or a bit rude it isn't meant to be. Unless I bullet point it then I will waffle on for pages and likely blow up the STH forum ;)

Questions about your 12TB RAID10 that you use on the workstation/server.
- Is it just 1 giant RAID10 using all of the disks or do you have multiple arrays?
- Does it have multiple partitions?
- Are you running the operating system on this array as well?
- What is your RAID stripe size and what NTFS cluster size are you using?
- Can you do a benchmark of the workstation/server RAID10 while it is doing nothing, using CrystalDiskMark or something like that and post results please?

When creating SQL servers in the past Microsoft always said you should use 2 x disks in a RAID1 for the operating system. Then the another 2 x disks in a RAID1 for the database logs. Finally a multiple disk RAID 5 array for the database files to live on. This was so the OS is on totally different disks you are not slowing down the database disks whilst you are waiting for the OS to do something. In some cases people even have another 2 x disk RAID1 for the pagefile.

As you are writing huge files, and I'm guessing the app you are using writes to them in huge lumps of data too, then you really want the largest RAID stripe size and the largest NTFS cluster size you can get. I am basing this again on my experience from SQL servers with huge databases. Check out my build thread below

http://forums.servethehome.com/diy-server-builds/1766-new-home-nas-hypervisor-setup-6.html#post24937

See the difference is disk throughput between Array 2 and Array 3. Totally identical hardware in both arrays, just different RAID stripes and NTFS cluster sizes.

Questions about your 10Gbe network
- What switch are you running for this?
- Do you have Jumbo Frames or anything like that enabled?

I only run 1Gbe network so I can't help too much here. I have seen lots of people on the forum talk about using Jumbo Frames on 10Gbe to get a benefit though.

Finally there was a really good thread on here about simply changing the power management options on Windows in Control Panel to High. A forum member was running tests on SQL server and just by doing that alone increase their IO throughput massively. So these are a few simple things that may help you out without having to spend lots of money on SSDs and such.
 

MiniKnight

Well-Known Member
Mar 30, 2012
2,987
890
113
NYC
I can see why 300GB would not cut it.

Looking at this I would try to have say 3-10 TB of SSDs in front of the disk storage array. Price is overing at $450 +/- $20: http://www.amazon.com/gp/product/B0...eASIN=B00BQ8RGL6&linkCode=as2&tag=servecom-20

EVOs are $100 more. Expensive solution, but if it increases your speed significantly that would be a great thing.

Another idea is I wonder if you need to build to like 48TB if 12TB is getting eaten up quickly. Higher disk IOPS the less full the array is. For performance you do not want to be using full capacity of disks for read/ write. Maybe a 4U would make more sense since you could get more drive bays.
 

Salami

New Member
Oct 12, 2012
31
1
0
" It gets bogged down as soon as two workstations access at once"

Is the second workstation accessing the same file on disk, or different files?
 

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
Hi Danwood82,

Summarizing from your writeup, you have eight spindles and 12TB of SATA storage (growing to 12 spindles and 18TB) with six writers (three workstations with two processes each) occasionally spitting out 1-3GB files and then reading them back as needed. You want to move it all to a home-grown NAS, connected via 10GbE, and you need better performance.

I’ll hypothesize that your processes are generating logically sequential or mostly sequential writes (the 1-3GB files are spit out all at once, not slowly over time) but the fact that you have multiple readers and writers, combined with the RAID implementation, is causing the IO to look more random than sequential. This would explain why some tasks speed along at ~300MB/s while others slow to 20-30MB/s.

Your current eight disk setup of SATA drives in RAID10 should be good for ~800MB/s read throughput and ~400MB/s write throughput under ideal conditions - which means sequential access from a single source. Those disks should also be good for ~800 read IOPS and ~400 write IOPS, exclusive of caching and not dragged down by other bottlenecks. By comparison, your dual 10GbE links are good for a combined ~2,000 MB/s and at least 100K IOPS. Without RDMA, we can de-rate the network to say 1,200 MB/s and 20K IOPS, which is much lower than with RDMA, but still far greater than your disks will deliver. And so, given the above, my strong guess is that you are disk limited. Adding four more disks will help by 50%, but you’ll still be very IOPS limited.

The solution? First, I like your idea of moving storage to a separate box. It'll be more useful for you, as you pointed out, and will also give you a chance to improve performance as well. Adding four more disks will bring your peak throughput closer to the capabilities of your network, but you’d still benefit from even more; your dual 10GbE could take advantage of around 24 spindles worth of sequential read throughput, if not more.

Your biggest need, however, is cache for reads and writes. Your lovely sequential disk access will go random when you have more than one processes reading or writing, which will destroy performance. A non-volatile write cache larger than the biggest simultaneous writes (say four writers writing 2GB each = 8GB cache) would dramatically improve performance. The small battery-backed cache on a RAID card would certainly help, but not as much as a larger one. A very large and fast read cache (hundreds of gigabytes, >1GB/s) would help to reduce the load on the disks.

So here are some ideas for you:

1) Deploy software on your new storage box that gives you plenty of read and write caching. One example is ZFS with plenty of RAM plus ~4 SSD drives as read caches (L2ARC) and a few fast, high endurance SSD ZIL devices to cache writes. Lots of people will recommend this solution, and it’s a good one. Another example is Windows 2012R2 storage spaces, which can use SSD devices to cache reads and writes.

2) Use caches local to the workstations. As an alternative to adding huge caches to the storage server, you can add smaller "transparent" caches to the workstations. I’d try Velobit, which is designed for server applications. They even have a free version that will give you up to 32GB of cache per workstation. If you can stand to lose data if something crashes, you could even use RAM disks (via the free version of StarWind) as your Velobit cache devices instead of SSDs. Add an extra 32GB of RAM to each workstation, configure as RAM disks, and present to Velobit. Done, and good for 4-10GB/s.

Personally, I’d do both: Use ZFS or Win2012R2 cached storage spaces on my new storage box, adding some cache devices, and also trying Velobit on the workstations. If Velobit didn’t work out, I’d then just add bigger caches to the storage server. After that, I’d add RDMA to the mix to kick it up a notch further. With Windows, SMB3 with RDMA would be my choice - it’s extremely simple to use and massively fast. On Linux, you have a number of protocols to choose from, only some of which work with Velobit. Since RDMA is of secondary important compared to caching, I’d defer that particular decision until after you test Velobit.
 
Last edited:

bds1904

Active Member
Aug 30, 2013
271
76
28
Have you considered a fileserver using ZFS with an SSD ZIL and l2arc?

The zil will be your write cache and the l2arc will be your read cache.

So write will go Machine>ZIL>Spinning drives
Immediate subsequent reads will go ZIL>l2arc>machine
2nd read would be l2arc>machine

accessing files that are old(er) will go spinning drives>l2arc>machine
2nd read will be l2arc>machine

Basically you would have a really big write/read cache. In combination with a good array of drives like 4x RaidZ1(3 drive) arrays or even drive mirroring (in groups of 4 2x drive mirrors) would be a killer setup. Lots and lots of IOPS.

With this setup everything that gets presented to the spinning disks pretty much is a sequential read/write.

It'll be expensive up front, but if done right you won't outgrow it anytime soon.

Just to show you the idea of random IO/sec here is a test run from a VM of mine. My ZFS server only has a 1gbit link to the esxi box, the ZFS server is 10+yr old hardware (just regular old dual opteron 270 with DDR1 memory), 4GB ZIL on memory and no l2arc. Drive array is 4x 3 Drive RaidZ1's (10k RPM 146GB drives) via dual loop 2gb fiber channel.

Even with this old crap and 17 VM's running from the array:


If I can get that IOPS out of that old crap $300 hardware (server, memory, disks and everything) with 17 VM's running, imagine what you can do with a new setup.
 
Last edited:

danwood82

Member
Feb 23, 2013
58
0
6
Fantastic advice, massive thanks everyone!

So, from what dba and bds1904 are saying, it sounds like my best bet is cache, cache, and more cache. I've always gotten the feeling that what I really needed was some way to hold back magnetic-disk-writes until they could be done quietly and sequentially in the background... I just never seemed to find a solution that sounded like it would automate the process.
It looks like ZFS could be that solution... it's entirely new territory for me though - I've been googling away to find out what the hell ZIL and L2ARC even are :p

I get the feeling my best course of action at this point would be to rig myself up a test-bed server out of spare parts/old drives/ssds, and play about with the softward until I'm familiar with how to set it up. Hopefully I can avoid making too many stupid purchasing decisions that way :)

So, a couple more questions:
- What's my best OS bet for setting up ZFS? I'd always assumes it was just another Linux-native FS, but I see it's actually only native to Solaris, and Linux support seems a tad confusing. Would you advise going for Solaris (and what are the licensing implications there?), or would it be better to go for a Linux distribution, and if so, which one?
- What would be the implications of mixing a Solaris or Linux server with Windows clients? Should I avoid that?
- If I wanted to go down the Windows Server 2012 R2 route, what would be the minimum required version for this application? (Essentials/Standard/Datacenter?)
- Would I need Windows Server or Windows 8 based clients to get the most out of this, or would Windows 7 clients be fine?
- What hardware is required/recommended for ZIL and L2ARC? Some stuff I've read seems to suggest you'd need an SLC-based drive for the ZIL cache, but MLC would be fine for L2ARC. Would I need drives tailored to the task, SAS drives, or would ordinary SATA-based SSDs work fine?

Alas, I may have to park this thread soon and resurrect at a later date, as I have some work coming in that will probably tie me and my hard drives up for a month or two. Hopefully I can do some studying-up on the stuff you guys have suggested in the meantime though.

Thanks again for all the help!
 

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
I'm a Solaris 11.1 fan, but you can also use Omnios, which is fully open source.

If you decide to go with ZFS, you may wish to make administration easier by adding on a GUI like napp-it. See napp-it // webbased ZFS NAS/SAN appliance for OmniOS, OpenIndiana and Solaris downloads.

Your ZIL will run a write heavy workload, so endurance can be an issue. Calculate your writes per year, decide how many years you want to keep your SSD drives, and buy drives that are able to handle at least that much write load. SLC drives of course have the most endurance, but eMLC is good enough for most workloads, and many people can get away with plain old MLC, especially if they format to leave quite a bit of free space, e.g. buy 256GB drives and format them to just 128 to 200GB. Whatever drive you use, you'll probably need multiple ZIL devices to meet your performance goals.

ZFS handles Windows clients just fine, but you won't be able to use SMB Direct (RDMA and multi-channel). In fact, you'll need Windows 2012 on your server and Windows 2012 or Windows 8/8.1 on your clients to use SMB3 RDMA at all.

Fantastic advice, massive thanks everyone!

So, from what dba and bds1904 are saying, it sounds like my best bet is cache, cache, and more cache. I've always gotten the feeling that what I really needed was some way to hold back magnetic-disk-writes until they could be done quietly and sequentially in the background... I just never seemed to find a solution that sounded like it would automate the process.
It looks like ZFS could be that solution... it's entirely new territory for me though - I've been googling away to find out what the hell ZIL and L2ARC even are :p

I get the feeling my best course of action at this point would be to rig myself up a test-bed server out of spare parts/old drives/ssds, and play about with the softward until I'm familiar with how to set it up. Hopefully I can avoid making too many stupid purchasing decisions that way :)

So, a couple more questions:
- What's my best OS bet for setting up ZFS? I'd always assumes it was just another Linux-native FS, but I see it's actually only native to Solaris, and Linux support seems a tad confusing. Would you advise going for Solaris (and what are the licensing implications there?), or would it be better to go for a Linux distribution, and if so, which one?
- What would be the implications of mixing a Solaris or Linux server with Windows clients? Should I avoid that?
- If I wanted to go down the Windows Server 2012 R2 route, what would be the minimum required version for this application? (Essentials/Standard/Datacenter?)
- Would I need Windows Server or Windows 8 based clients to get the most out of this, or would Windows 7 clients be fine?
- What hardware is required/recommended for ZIL and L2ARC? Some stuff I've read seems to suggest you'd need an SLC-based drive for the ZIL cache, but MLC would be fine for L2ARC. Would I need drives tailored to the task, SAS drives, or would ordinary SATA-based SSDs work fine?

Alas, I may have to park this thread soon and resurrect at a later date, as I have some work coming in that will probably tie me and my hard drives up for a month or two. Hopefully I can do some studying-up on the stuff you guys have suggested in the meantime though.

Thanks again for all the help!
 

gea

Well-Known Member
Dec 31, 2010
2,485
837
113
DE
If you want to try a ZFS filer, you do not need a ZIL. This is a log device to improve performance with secure sync writes. Usually a filer with SMB, NFS or iSCSI does not request sync writes, so you do not need.

A L2ARC is good if you are low on RAM as it caches last read requests (blockbased) on SSDs and extends RAM/Arc -caching. If you have the option, buy more RAM (much faster for caching) and skip the slower L2Arc SSD. (RAM is 100x faster than SSDs but they are faster than disks). 64 GB RAM without a Cache SSD is always faster than 32 GB with Cache SSD at the same price.

You should only care about hardware with Solaris/OmniOS. Best are the Intel based mainboards with serverchipsets, Intel Nics and LSI HBA controller. I would prefer SuperMicros as noted above.
 
Last edited:

bds1904

Active Member
Aug 30, 2013
271
76
28
Forgot to mention ZFS does end to end error checking too, so bonus there.

You are sure to come across the "to ECC or not to ECC debate" also.

Personally, if you are using this is a production environment for business then you should use server grade hardware which includes ECC memory.

For ZFS you want to use a JBOD controller, no hardware raid here. Personally, I really like the LSI Logic MegaRAID 9205-8e. Pick up a Rackable SE3016 for $200. Add in something like a HP 2U G5 or G6 dual quad core with 16-32GB memory and you'll have the hardware you would need. Add in you SSD's and you are good to go.

The reason why I recommend the 2U HP's is because I find their 2.5" drive bays built into the chassis very convenient for use with SSD's.
 

dba

Moderator
Feb 20, 2012
1,478
181
63
San Francisco Bay Area, California, USA
If you want to try a ZFS filer, you do not need a ZIL. This is a log device to improve performance with secure sync writes. Usually a filer with SMB, NFS or iSCSI does not request sync writes, so you do not need...
I don't use VMWare for IO intensive workloads, so I've never done any detailed testing, but my understanding is that VMWare defaults to all sync writes with remote storage, which is why a low latency network and a fast SLOG (a type of ZIL) is a good idea... not that it's a panacea.
 

33_viper_33

Member
Aug 3, 2013
200
2
18
ZFS handles Windows clients just fine, but you won't be able to use SMB Direct (RDMA and multi-channel). In fact, you'll need Windows 2012 on your server and Windows 2012 or Windows 8/8.1 on your clients to use SMB3 RDMA at all.
You are referring to just the SMB3 protocall correct? It still will use iSER when setting up ISCSI correct?
 

Chuntzu

Active Member
Jun 30, 2013
383
98
28
You are referring to just the SMB3 protocall correct? It still will use iSER when setting up ISCSI correct?
Correct zfs with either solaris or linux will not be able to use SMB3 RDMA, it will be able to use ISER or SRP RDMA but Windows does not have ISER or SRP baked into Server 2012 or windows 8 or 8.1. I have only read one post on OFED indicating server 2012 running SRP with some older drivers loaded in safe mode, but it sounded very hacking and a terrible pain to get up in running. So the only RDMA access for Windows in with SMB3 no block level RDMA protocals, where as Other OSs (linux and Solaris) have the RDMA Block protocals and NFS o RDMA. Hopefully this is what you were looking for?
 

danwood82

Member
Feb 23, 2013
58
0
6
Hey all, cheers again for the suggestions. I've been attempting to swat up on setting up ZFS servers by temporarily turning one of my workstations into a testbed.

I've tried out OpenIndiana with napp-it, FreeNAS and NAS4Free so far. I'm certainly seeing the benefits - in all cases the caching massively improves 512k and 4k read and write benchmarks. It's definitely looking like the way to go.

I could do with some pointers though. For the moment, I'm trying to get my head around things and work out how to get the best performance out of just the 32GB RAM caching, so I'm testing with a single mirror of a couple of old 500GB drives, with no SSDs for ZIL or L2ARC.
To that end, FreeNAS seems to outperform the other two considerably "out of the box". I'm testing using CIFS/SMB copying directories of sim files over, and by running CrystalMark. With FreeNAS I get over 300MB/s sequential read and write speeds, which seem to effectively ignore magnetic storage limitations until the ram gets filled up (it seems the same whether I try a mirror or a three-drive stripe). The only problem is, FreeNAS stalls completely every 15-20 seconds or so when I'm writing large quantities of data, and can take anything up to a full minute to kick back into action. Occasionally it goes so long that the file transfer gives up and a file fails to write, leaving occasional zero-byte files. This happens regardless of how full the RAM is, but it happens slightly less frequently if I use the 1GBE NIC instead of 10GBE.

NAS4Free and OpenIndiana both avoid that issue entirely, but while they both help with caching reads and writes, their average transfer rates seem bound to the speed of the magnetic storage pool from the moment a transfer begins, long before the RAM is even half-full.

Is that expected behaviour? Should the caching be 'pacing' to match the long-term throughput of the drive pool, or should it be able to max the ethernet connection so long as there is free space in the cache?
I've found FreeNAS to be far and away the simplest and most user friendly of the three so far, but I worry that this odd glitch may be indicative of things being unreliable down the line. Would a Solaris-derived OS with true native up-to-date ZFS be a safer bet?

I'd welcome any top-tips on how to go about understanding/optimizing the ZFS caching performance on any of these (or others if you think they might be better).