That certainly looks like it would work. I would think the main reason to not use the secondary port on the existing card would be if you're overrunning the PCI-e capability of that single card/slot.How is SAS intended to be expended when using external enclosures?
The JBOD chassi is Supermicro | Products | Chassis | 3U | SC837E26-RJBOD1
Now if we want to scale this with another enclosure do i add yet another LSI 9207-8e in each machine (if i dont want to use the free SFF-8088 port on the first LSI card) like this:
Certainly on paper, you're getting 50% less performance. But in the real world, I suspect you're not going to see that as a 'limit'. Your original post suggested you're currently on 1GbE NFS, even 2x10GbE NFS would still be less than 4x6Gbit.The other way seems to be using sas chaining but that would seriously limit the performance since one SFF-8088 port only have 4x6Gbit/s?
Aren't most 7.2K SATA disks 3Gbit/sec not 6Gbit/sec like SAS? So you'll be at 4x3Gbit? And right now you're running 8xM500 SSD's but are considering going to 24x7.2K HDD?Using the JBOD chassi above i can only connect to it using one SFF-8088 port thus from each server node i am limited to 4x6Gbit/s (or 3GByte/s) to the array. Using 24x7200rpm spinners (each with 180MB/s) in a 12 vdev setup i have a theoretical max of 2160MB/s.
So in my case the arrays theoretical max of 2160MB/s is below the SFF-8088 max of 3 GByte/s. Sounds correct?
Hi, Thank you for your detailed response. We will be going with 10GbE for now but may move to 40GbE using switches that have 24x 10GbE ports and 2x 40GbE ports. In that scenario the bottleneck would occur in the SAS chain.That certainly looks like it would work. I would think the main reason to not use the secondary port on the existing card would be if you're overrunning the PCI-e capability of that single card/slot.
Certainly on paper, you're getting 50% less performance. But in the real world, I suspect you're not going to see that as a 'limit'. Your original post suggested you're currently on 1GbE NFS, even 2x10GbE NFS would still be less than 4x6Gbit.
We are looking at 7200rpm SAS drives since we require HA capabilities for this build. sadly SAS SSDs are to expensive so the only route forward is hdds in a shared JBOD enclosure (for OmniOS+RSF-1, Solaris+RSF-1, Nexentastor, Quantastor, Microsoft Storage Spaces)Aren't most 7.2K SATA disks 3Gbit/sec not 6Gbit/sec like SAS? So you'll be at 4x3Gbit? And right now you're running 8xM500 SSD's but are considering going to 24x7.2K HDD?
Yes they are dual-ported SAS drives (i.e. Seagate Constellation SAS or Ultrastar 7K6000 SAS).My general (legitimate) questions, as I'm curious how this build works out:
* The current picture shows 1 JBOD being shared by two head units - are the 7.2K Spinners going to be dual ported NL-SAS? Can this be done with single ported SATA disks, I wouldn't think so.
Yes we use 24 drives in a raid10 config (12 vdevs each using 2 disk mirror). We use 2 extra as hot-spares and 2 slots for mirrored zeusRam (a total of 28 3.5'' slots).* You mention 24 disks, but a 28 disk JBOD - so assuming 4x SSD to handle the caching? I just didn't see the SSD's mentioned. Same dual port question applies.
Yes the density is low due to licensing issues for the software running on the VMs (we prefere few VMs with much resources in favor of many VMs with less resources available).* Your original post suggests you're using this for 1GbE NFS, and it's shared to 10 XenServer hosts, running ~ 20 VM's (really 1-2 VM's per host? That seems like horrible density...). But you point out your desired speeds in sequential MB/sec throughput - which has nothing really to do with running VM's. Largely you're probably going to care a ton more about IOPS and Latency of 4K'ish workloads for the virtual machines, unless you're streaming large data, which is probably never going to be sequential on a shared SAN.
The budget was increased to ~25 000$ We realized this was impossible with the previous budget. We have a separate budget for switches. If we manage to get this SAN build for 20 000$ we will have 5000$ extra for the switches which would be nice.* Does your $15K budget include just the SAN/storage, or does it also include a bump up to 10GbE on the hosts, and SAN, and switching? Even with home lab pricing and equipment, the NIC's, Cabling, and Switches are likely to eat up ~ $5000+ of that (~$200 2x10GbE NIC's x 12, with 2x cabling each, 2x 24x10GbE switching)
I have not yet picked the switches for this so i do not know if we can do active/active or active/passive :/* If you're providing 2x10GbE out to the hosts (and can you do the full 20 or will you be doing 10 with failover?) should you be worried about theoretical limits of 12 or 24 or 48Gbit/sec to the JBOD's, when you're not able to push that much out through the front hole to network? Granted, internal SAN tasks don't traverse that, but if you're honestly running at peak that often, you need to build for a size up, no?
Do you have some recommendations on how i should focus my work? Using the 12 vdev raid-10 array + zeusRAM i think i have maximized the IOPS, but again i do not know how the IOPS will be affected if we are bottlenecked during sequential throughput.I'm speaking all of this from a VMware background, and where I use storage appliances (eg: NetApp, Nimble, etc) vs build my own. So I have an honest curiosity in some of the above answers and outcomes, as I'm not sure I know the answers.
I think the biggest one I have is the "why are you so focused on sequential read/write, when that's likely _never_ going to be what you actually need?".
I belive a ZFS based solution without an SSD logdevice will give you terrible write performance , unless you configure it with sync=disabled . But if you go with sync=disabled the HA will not work because if you have a failover for any reason the cached write data in memory not commited to disk yet will be gone. so the VMs will not notice because the NFS volume is still available but up to 5 seconds of data is gone (= a bad day).We are looking at 7200rpm SAS drives since we require HA capabilities for this build. sadly SAS SSDs are to expensive so the only route forward is hdds in a shared JBOD enclosure (for OmniOS+RSF-1, Solaris+RSF-1, Nexentastor, Quantastor, Microsoft Storage Spaces)
We have been using the Cisco Nexus 5010 as 10Gbit ToR switches for 5 years now and i love them, rock solid . so if i would build a low cost environment now i would buy used Nexus 5010's ( or 5020's ) from E-bay (have seen them as low as 1200 EUR).The budget was increased to ~25 000$ We realized this was impossible with the previous budget. We have a separate budget for switches. If we manage to get this SAN build for 20 000$ we will have 5000$ extra for the switches which would be nice.
Sounds like a good idea.So I'm just one guy, doing it one way, I'm sure YMMV. But...
- I just about NEVER see anyone maxing out 10GbE for ISCSI/NFS, even big shops. Maybe during a boot storm or "patch day" or something, but even then it's never maxxed out. I don't know you'll ever 'realistically' see 10GbE or 40GbE be your bottleneck. Spend more money on SSD/Caching.
You can check geas post on page 2 about how the ZIL works. Basically ZFS will cache 5 sec of writes in RAM thus transforming the writes from small random writes to sequential writes. The only thing that should be hitting my disk array with a dedicated ZIL would be the cached writes that exist in RAM- VM's seldom do anything that looks like sequential IO - I'm not sure still why you're focused on that. Unless you're cloning a VM or backing up the entire thing (both of which the hypervisor usually throttles to the preference of VM based IO) and not using differential/CBT type solutions, there's no reason you'd really need sequential IO for anything, so designing to hit that is usually the wrong answer. I see this typically in folks who run an Atto style benchmark and go for the maximum 2MB block size numbers, completely ignoring the 4-16KB block range where they actually operate 99% of the time.
- While I'm not a ZFS guy and I'm not sure how the cache plays in, 12 mirrors of 7.2K drives is still only going to yield a maximum of around 1920 IOPS (80 IOPSx24) - and the mirroring will eat up half of that. So you're building all of this for ~ 1000 IOPS based on disk - or 100 IOPS/host. That's giving each host the equivalent of a desktop SATA disk of performance (assuming they're all balanced, all the time, etc)
I am unsure about the workload block size distribution. I will try to find some DTRACE scripts that can show me this. Then i will sample for a day or two. I can use zpool iostat for iops and sequential bandwidth but need to find something that shows be the block size hitting my current SAN.If your workload is 4KB blocks (you don't mention, I'm just making an assumption) then you're going to get about 4MB/sec 4K Random throughput or about 500mbit/sec - half of 1GbE. Your problem won't be 6 or 12Gbit SAS or 10/40GbE any time soon. It looks like there is a long way to go before you even max out multi-NIC 1GbE on the SAN side. Especially if you have say 10x hosts with 2x1GbE and 2x SAN head units with also 2x1GbE - that'll be your choke point. Adding another 4x1GbE NIC to the SAN would probably help more than changing the infrastructure to 10GbE across the board.
Let me get back to you on this when i have found some way of measuring the workload distribution in omniOsDo you know what your workload profile looks like? IS it all streaming/sequential, even though it's 20 VM's on 10 hosts, virtualized, on a common SAN in an "IO blender"? Even if the VM's _are_ sequential, is that what the SAN sees/feels?
Could anyone enlighten me on the more ZFS focused aspect of this, and if it changes the theory much? As stated before I don't typically roll my own storage, but the concepts seem similar. It seems like this is being built for capacity/throughput and not performance?
Ah yes. We will require sync writes and with the zeusRAM this should not be an issue. Good point though that HA will require the ZIL and sync=always (which is the default from the xenserver NFS mounts).I belive a ZFS based solution without an SSD logdevice will give you terrible write performance , unless you configure it with sync=disabled . But if you go with sync=disabled the HA will not work because if you have a failover for any reason the cached write data in memory not commited to disk yet will be gone. so the VMs will not notice because the NFS volume is still available but up to 5 seconds of data is gone (= a bad day).
We did actually look into several other but the zeusRam is in a league of its own . Its a shame that Solaris is now closed since it gets a lot of new nice features.Have you only looked at the ZeusRam SSD's ? Because one nice feature with Solaris 11.2 is the ability to do parallel writes to multiple log devices , if you need more write iops just add another SSD and that makes it possible to use cheaper SSD's (E-bay FTW ! ) . and also from Solaris SRU 11.2.8.4.0 the L2ARC is persistent and survives reboot.
Thanks for the tip. Will put the nexus on the list of possible candidates!We have been using the Cisco Nexus 5010 as 10Gbit ToR switches for 5 years now and i love them, rock solid . so if i would build a low cost environment now i would buy used Nexus 5010's ( or 5020's ) from E-bay (have seen them as low as 1200 EUR).
Yeah, I was trying to explain this to a friend the other day. I'm willing to live (if needed) with sync=disabled in my current setup if I need better performance (the s3700 SLOG device limits me to 180MB/sec writes over 10gbe, whereas sync=disabled can hit 500MB/sec). But I am not in the HA space at this time. The idea that you could silently lose several seconds of writes (which might very well be client filesystem metadata - e.g. corruption anybody???) is a complete showstopper for me.I belive a ZFS based solution without an SSD logdevice will give you terrible write performance , unless you configure it with sync=disabled . But if you go with sync=disabled the HA will not work because if you have a failover for any reason the cached write data in memory not commited to disk yet will be gone. so the VMs will not notice because the NFS volume is still available but up to 5 seconds of data is gone (= a bad day).
I think i have to explain how the zil and zfs handles writes in more detail (how i understand it). If we saturate the 10GbE link for a long time we have a constant rate of 1.25GByte/s cached in RAM and to the 8GB ZeusRAM. Every 5 seconds this data is written sequentially to disk. Even lots of small random writes will go to RAM and be limited only by the zeusRAM (which has very fast writes)."EDIT: I might add a question. What would you say is a good solution? I am cost-wise limited to HDDs when going the SAS shared JBOD way which is required by MS storage spaces, Zfs+RSF-1, Quantastor and nexentastor."
(Damned phone client...)
So again my 'not a ZFS guy' disclaimer...
It doesn't make a lot of sense to plan for 10gbe and then put on the other end shared storage with an expectation of 1000 IOPS. Yeah, the ZIL will help but it's pretty small. That 'few seconds of writes' may cache but what happens when you have minutes and minutes of steady activity? Seconds of cache won't matter. Your current specs sound like they'd feel a lot like the NetApp FAS2040 I have in the lap - 12 SATA or SAS disks and 2Gb of memory. Doesn't take much to overwhelm that cache and start being disk bound.
You mention being budget limited to HDD vs SSD. I did a search on eBay for enterprise SAS SSD as I couldn't find a quick post here on the forums for a current deal, and I'm finding 400gb SAS SLC Enterprise stuff for $400-600 CAD. Cheaper if I'm picky or careful. I suspect you can fit the SAS SSD in the budget?
You are absolutely correct and i overlooked the fact that it is capped to 400MB/s Single-port and 800MB/S dual-port.How does your ZuesRAM handle 1.25Gbytes/Second ??
Sorry i misunderstood your earlier post when you said SAS SSD's are to expensive , i thought you meant no SAS SSD's at all , not even for ZILAh yes. We will require sync writes and with the zeusRAM this should not be an issue. Good point though that HA will require the ZIL and sync=always (which is the default from the xenserver NFS mounts)
We did actually look into several other but the zeusRam is in a league of its own . Its a shame that Solaris is now closed since it gets a lot of new nice features.
We still "only" go with 2 ZeusRAM drives. We will do some experiments with not using sync writes on some workloads and try to "only" force i.e. database VMs to use the ZIL. Will also try the performance using them in mirror vs raid-0.How many ZeusRAM do you plan to use now?