What is the best option for creating a NFS for my entire network from ESXi 8?

aah · Jan 1, 2023

Hi,
I am not even entirely sure this is the best sub-forum for posting my question. Regardless, here it goes:
I have a HP ML350 Gen10 at home that's running 8 SSDs in RAID 6, and 8 12TB HDDs in RAID 6 both managed via a P816i-a Raid Controller. I have installed ESXi 8.0 on the sever. All my VMs are on the SSDs. But, I want my HDDs to be presented both to the server and my LAN as an NFS (similar to a NAS shared folder or folders).
For instance, one way is to install TrueNAS (or UnRaid) as a VM and assign the HDD RAID array to the TrueNAS. But that means no passthrough of the drives, and also TN throwing all sort of fits about seeing one (logical) drive with no actual serial number.
Do you guys have any suggestion that would solve my problem with minimal overhead and hassle? If you need additional info do let me know. Many thanks in advance.

aah · Jan 1, 2023

Any suggestions ?

itronin · Jan 2, 2023

For instance, one way is to install TrueNAS (or UnRaid) as a VM and assign the HDD RAID array to the TrueNAS. But that means no passthrough of the drives, and also TN throwing all sort of fits about seeing one (logical) drive with no actual serial number.

You've really answered it yourself and I think you are leaning towards don't do it that way but...

Do you guys have any suggestion that would solve my problem with minimal overhead and hassle? If you need additional info do let me know. Many thanks in advance.

I don't know what is or isn't minimal overhead and hassle to you.

And I have no idea what else is installed in your ML350 and I am NOT an expert on HPE gear (especially modern gear) so no idea whether this is even feasible...

I think that if you want to do this and run TrueNAS to present your "bulk storage" whether to ESXI or LAN clients then you are looking at adding another HBA: an IT mode HBA.

stepping away from your question for a moment:

When I build an ESXI system (prod, home prod, semi-prod), I use an HBA capable of at least Raid 1 or 10. Sized for boot and enough performant & sized datastore for the critical must be always on VM's plus some for growth (say 400GB to 1TB). If I will deploy a software raid NAS OS as guest (like TN) then I'll add an IT mode HBA. Running a NAS vm guest is done all the time and works well so long as you adhere to the rules of your NAS OS.

8 SSDs in RAID 6, and 8 12TB HDDs in RAID 6 both managed via a P816i-a Raid Controller.

The ESXI portion you have in your current configuration.

And back to your request for suggestions.

The big challenge or what you may be asking is I think - Can I use what I have and get what I want that is both reliable and data safe? I think the answer is YES or NO and it all depends on whether you consider the HW raid controller sufficient to protect your data and defining that pesky overhead and hassle conundrum.

Yes
If you feel the HW raid controller is "good enough" then build a nice guest linux vm with any of the variously available web gui management frameworks and go to town with a big honking VHD (or multiple VHD's if you can logically break up your data) and serve it out via NFS. You're probably getting 80% of what you get with a purpose built software distribution albeit with a bit more effort on your part.

NO
A quick check of the ML350 HPE makes me think you have 2 LFF 8 bay units installed. Whether a SAS expander is in play there or not I do not know. Assuming that your -16i is direct wired to the backplanes (a logical guess) then my suggestions is make sure your SSD's are all in the same bay and connection your RAID controller to the one bay with the SSD's.

Add an IT mode HBA, maybe something simple like an LSI 3008 based card (does HP make one?) and wire it up to the other 8 bay with your large cap spinners.

Build your NAS OS VM (if UNR you'll have to pass through a USB root or USB drive), (if TN then build an appropriate sized VHD for the boot pool - no point in 2 VHDs for a ZFS boot mirror since you're already raiding your SSD's) and pass through the IT mode HBA with the attached spinners and follow your NAS OS best practices for the volume.

Note on growth.

Assuming growth will be large cap spinners & If you have the slots in your system to add another IT mode HBA then an -8i is probably sufficient as you can always add another, if not then you probably want a -16i (9305-16i, 9400-16i) at the outset because you'll either add that third LFF 8 bay or you already have it.

Serving a guest vm's storage back to ESXI.

Simplest way is to just to use the network and make sure you don't create a resource deadlock for any vm's or resources you store on the guest vm's storage. This assumes that the ESXI's network connection bandwidth is sufficient for your need.
There is another way though I consider this advanced for most homelabbers. At a high level it involves vswitches without a physical nic associated, a vmknic for ESXI and a virtual nic on the guest and everything being in the same subnet. This configuration worked under 6.5, 6.7, 7. I have not recreated it under 8 but I imagine it will work there too. This technique has been talked about here a couple of times - I think I may have even written it up and posted a description once upon a time - just would need to find the thread. but we can go there if that seems the path you want to go down.

oneplane · Jan 2, 2023

Like itronin wrote: trash the RAID controller stuff, make sure all disks are behind a PCIe HBA or root complex, forward the PCIe devices to a TrueNAS VM, everything else comes after that. Any other way that is not that is probably not worth your (or anyones) time.

If you had some spare cash to buy lots of hardware, there would be extra options, but I don't think that's the case here (i.e. a 5-node HCI Ceph cluster and 40GbE networking).

itronin · Jan 2, 2023

oneplane said:
If you had come spare cash to buy lots of hardware, there would be extra options, but I don't think that's the case here (i.e. a 5-node HCI Ceph cluster and 40GbE networking).

This!

aah · Jan 2, 2023

OKay! Well, first things first, this answer is very detailed, and I assume it took a while for you to write it. So, thank you for that.

itronin said:
The big challenge or what you may be asking is I think - Can I use what I have and get what I want that is both reliable and data safe? I think the answer is YES or NO and it all depends on whether you consider the HW raid controller sufficient to protect your data and defining that pesky overhead and hassle conundrum.

Yes
If you feel the HW raid controller is "good enough" then build a nice guest linux vm with any of the variously available web gui management frameworks and go to town with a big honking VHD (or multiple VHD's if you can logically break up your data) and serve it out via NFS. You're probably getting 80% of what you get with a purpose built software distribution albeit with a bit more effort on your part.

NO
A quick check of the ML350 HPE makes me think you have 2 LFF 8 bay units installed. Whether a SAS expander is in play there or not I do not know. Assuming that your -16i is direct wired to the backplanes (a logical guess) then my suggestions is make sure your SSD's are all in the same bay and connection your RAID controller to the one bay with the SSD's.

You guessed correct. I have 3 bays installed on this server. 1 SFF, and 2 LFFs. The RAID card doesn't allow for mixing SFFs and LFFs, neither does it allow for mixing SSDs and HDDs, nor SAS and SATA. In BAY1, where my 8 SSDs are sitting, I have configured RAID 10 for max. performance and parity. This bay is purely for deploying my VMs. The two LFF bays, as you put it aptly, are for my bulk storage. Right now, these two bays are combined together in a RAID 6 configuration.

As for the P816i-a card, it has 4 SATA connectors. 2 are connected to the SFF on bay 1. And, 1 SATA connector is occupied for each of the LFF bays. That mean all 4 connectors are engaged. The card also has the capability of delivering physical drives in HBA mode (no idea of the IT mode) simultaneously in conjunction with RAID arrays.

Personally, given the option I tend to shy away from software RAID. This might be due to having limited knowledge base. Overall, to answer the above question: Yes. I do believe this RAID card is more than capable to deliver performance and protect data. Obviously, it comes with its own limitations, such as not having pools, or being able to reconfigure or restructure RAID arrays, etc. But, those are not my concerns for the time being.

All I am looking for is a software that just receives the logical drive from the ESXi datastore, and overlays it with NFS protocol and introduces that drive to the network. This, however, seems to have many little intricacies to it. Having that said, and with taking your inputs into consideration, it appears I have only a couple of options:

1) Easiest: I can tell TN to shush up and deal with having only one (logical) drive. Inside TN: No RAID. No parity. Nothing. Just one pool. One network drive. Implications?
2) Midway: I could delete the RAID 6 array. Take 4 drives back into a RAID 5 on the P816i and hand it back to ESXi for my other purposes. Additionally, have ESXi deliver the other 4 free drives as RDM in passthrough mode to TN. I will not have as much space, what I'd be left with is going to be sufficient for about 2 more years.
3) Go the softRAID route: No HW RAID at all on the LFFs. Deliver all the LFF drives in PT mode to TN. This scenario irks me. As I am dumping all the RAID load processing on the CPUs while I have a dedicated chip sitting inside the machine waiting to take over that very responsibility. Maybe you would shed some light on this?

gea · Jan 3, 2023

aah said:
3) Go the softRAID route: No HW RAID at all on the LFFs. Deliver all the LFF drives in PT mode to TN. This scenario irks me. As I am dumping all the RAID load processing on the CPUs while I have a dedicated chip sitting inside the machine waiting to take over that very responsibility. Maybe you would shed some light on this?

A modern CPU is capable of realtime 3d processing. No need to care about simple raid processing with modern software raid. You should only prefer to offer the storage (disk controller and disks or NVMe) directly to a storage VM via passthrough. With a sufficient CPU and a comparable RAM to a barebone setup you will see nearly the same performance on a virtualized storage VM. There is only one item where a virtualized storage suffers a little and this is encryption on regular write. If you need sync write (this means that every commited write must be on disk at least after next reboot, comparable to the BBU security of hardware raid) encryption is something that you must avoid but this affects a barebone storage setup as well. Without encryption a software raid is faster and in case of ZFS more save than any hardware raid as it is not affected by the write hole phenomenon (defect filesystem or corrupted raid after a crash during write), "Write hole" phenomenon in RAID5, RAID6, RAID1, and other arrays.

When I came up with the software raid idea and AiO (ESXi with a ZFS storage VM) more than 10 years ago, some said what a stupid idea to virtualize storage but within a short time it became the usual approach (All-In-One (ESXi with virtualized Solarish based ZFS-SAN in a box)) to include highend storage in ESXi. I use a minimalistic Solaris based storage VM as this is the most resource efficient ZFS server but you can use any storage VM that can offer its storage via NFS.

Howto, see https://www.napp-it.org/doc/downloads/napp-in-one.pdf
Performance barebone vs virtualized on a highend system, see https://www.napp-it.org/doc/downloads/epyc_performance.pdf

A very good storage performance can be achieved with a modern 4-8 core system and a storage VM with 2-4 vcpu and 6-16 GB RAM.

oneplane · Jan 3, 2023

I would recommend selling the RAID card on eBay as soon as possible and then getting a ZIL SSD for that money, and passing all disks except the boot disk for the hypervisor and TrueNAS through to the TrueNAS VM.

As for the "what if I present a single logical drive to ZFS" question: you would be losing performance and losing durability/reliability since ZFS won't know what data is or isn't written to disk yet.

If you intend to keep the RAID array and the RAID controller, don't even bother with TrueNAS, and if you just want NFS: start a random Linux VM, install NFS and off you go. If you want iSCSI too, add LIO and you're golden. You don't need TrueNAS for NFS, and unless you run ZFS in a configuration where it makes sense, it's better to not run it at all (heck, even LVM would make more sense in a RAID scenario).

RAID cards are bad, they are essentially small computers with their own CPU, RAM, OS, application etc, but instead of being as exchangeable and introspect able as a computer, they are closed and full of bugs that pop up at the worst possible time.

Back when RAID hardware was the only way to get redundancy for availability (not durability, you have backups for that) and speed, it was great because there was no alternative and all other storage methods were essentially less available/less durable. This hasn't been the case for over 15 years.

Regarding experience: when you think you have RAID experience, you actually only have TUI/GUI experience in whatever ROM the card came with ;-) they won't actually let you get any experience, just like diskmgmt.msc won't.

gea · Jan 3, 2023

While I would avoid a hardware raid with ZFS as ZFS + softwareraid is by far superiour, you can use.

The problem is:

- ZFS can detect any errors due checksums but cannot repair as it sees the pool like a single disk without redundancy
- The raid controller cannot repair as it does not know anything about the ZFS checksums

- On a crash during write you are in danger of a corrupt filesystem or raid due incomplete atomic writes.
You MUST use a hardware raid with BBU protection to avoid cache problems. On ZFS you can simply enable sync for cache protection.

- ZFS softwareraid can use several GB of fast mainboard RAM for caching and the mainboard CPU that is faster than any "raidadapter cpu".

In the end, ZFS on hardwareraid is superiour over older filesystems on same hardwareraid but ZFS on softwareraid is better in any way.

aah · Jan 3, 2023

This raid controller does have BBU. Surely, there must be a study somewhere as to attest to the superiority of zfs+ softraid over thousand dollars plus hardware raid controllers as you suggest. Logically, by that notion, all datacenters would have switched to opensource free softraids by now.

MBastian · Jan 3, 2023

The issues I had with bad BBUs and dead controllers over the last 20+ years...
Edit: Datacenters tend to go for for an enterprise SAN setup. Not because they are superior but for the support and ease of use. The really big guys, like Google and Amazon go for inhouse and open source solutions.

gea · Jan 4, 2023

aah said:
This raid controller does have BBU. Surely, there must be a study somewhere as to attest to the superiority of zfs+ softraid over thousand dollars plus hardware raid controllers as you suggest. Logically, by that notion, all datacenters would have switched to opensource free softraids by now.

Such a study cannot exist as there are many software raid solutions and many hardwareraid products (both from crap to highend) . In the end there is even no technical difference. Each raid option needs some software (firmware, driver, operating system), a way to process raid operations (specialiced raid cpu, mainboard cpu) with cache to improve performance (raidadapter ram, mainboard ram), a method to protect data on a crash during write and optionally additional methods to improve data security.

20 years ago every decent solution was only possible with hardwareraid and a specialiced raid cpu, adapter ram and a BBU to protect data as the mainboard cpu was too slow, mainboard ram too low and a crash protection only thinkable with a hardware raid+BBU.

Today situation is different. A current mainboard cpu is much faster than any of the specialiced raid chips, mainboard ram is much higher than any available option on a hardware raid and much faster. Crash protection like ZFS sync write where missing writes after a crash are done after a reboot is not only as secure as a BBU protection but better and does not fail like a BBU over time and can offer performance improvements via a dedicated log device (Slog).

Main reason for superiority of ZFS are not related to software raid vs hardware raid but are filesystem related. If cpu would be as slow as 20 years ago you would see hardware raid today with similar features. The two core improvement of ZFS are Copy on Write and realtime checksums on data and metadata. The first avoids a damaged filesystem after a crash by design as an atomic write (incomplete data + update metadata or raid stripes) is done completely or discarded in the whole. Copy on Write also offers snap creation without initial space consumption without delay as a snap does not require a write but is a freeze of older datablocks.

Checksums allow to decide on any read whether data is valid or corrupted due silent data corruption over time or due failures in the chain memory, driver, controller, cabling, disk. A nice addon ist the option to check data validy and repair online (an ancient fschk or chkdsk requires a array offline over hours or days to check without a real repair option for data only metadata). On a degraded raid-5 array each additional read error means an array lost as there is no way to guarantee data validity then. On a degraded ZFS Z1 an additional readerror only means a corrupted file as ZFS checksums can validate remaining data.

If you go back in time 20 years when Sun started ZFS development, some may remember an incident for the leading German web provider at that time that had used Sun storage (the absolute best at that time). After a sudden crash most of German websites were offline. Some fschk effords failed to restore availability and after quite a week they started to restore data from backup (not all data and not always newer state). I am convinced, Sun developped ZFS with this incident in mind with the specification to avoid such a scenario by filesystem design.

btw
Leading enterprise storage solutions ex Oracle Solaris (ZFS) or NetApp (WAFL, quite similar to ZFS in many aspects) are software raid.

oneplane · Jan 4, 2023

aah said:
This raid controller does have BBU. Surely, there must be a study somewhere as to attest to the superiority of zfs+ softraid over thousand dollars plus hardware raid controllers as you suggest. Logically, by that notion, all datacenters would have switched to opensource free softraids by now.

Only the single/duo node small time "I don't really know what I am doing" stand-alone hosts are still on local hardware raid, especially when on Windows because it sucks at raid booting natively (so it all has to be moved to the controller).

Everyone else is either on a COTS SAN, or on a bespoke SAN, and both are either using compute to do all the work, or offload some calculations or data paths to ASICs, but nothing like the legacy RAID cards that you get with random HP and Dell boxes people buy in small quantities.

If you look closely, this is also why your entire thread exists, getting NFS storage for the hypervisor, which would classically be supplied by a SAN (instead of iSCSI or FC). Granted, NFS doesn't really give a crap about the underlying storage as long as the filesystem semantics are compatible, but the idea is the same: compute and virtualisation in one box, storage in another. In this case the box is virtual, but the separation of concerns is still there.

As for 'superiority', there is no such thing, only the right tool for the job. ZFS on an MTD flash store would be an example of the wrong tool for the job. Just like ZFS would not be great for multi-node Ceph storage. For single node (be it active-passive or truly stand-alone), there really isn't much you can get that is anything like ZFS, except ZFS itself.

Edit: don't get me wrong, the tone of my wording might suggest "haha raid bad, zfs go brrrrrrr", but what I'm aiming for is "sales will try to sell you RAID with BBU because that's all they have". They usually don't have anything else for you because there is a huge gap between local storage and a SAN when it comes to guaranteeing some product. This is also why HP for example would rather sell you a skanky 3PAR system than a ZFS system because they can warrant the former, but have no say over the latter. (And Dell will sell you a PERC or an EMC, but not much in between)

So when you get hardware, it tends to either be designed as a solo classic windows server or as part of a much larger integrated system, and when you are somewhere in between that, practically all vendors just suck at making products that do the right thing by default

Nearly all of this is because of historical reasons, the CPUs used to be so weak that when you have a single application running there wouldn't be any resources left for anything else. You'd "need" lots of controllers elsewhere to do things like SSL, IPSec, DES, TCP/IP etc. because if you'd let the OS run those operations on the CPU your application would perform really bad (or not at all). Right now there is a resurgence of this (in the shape of DPUs), but that is more for density and tenancy optimisation than anything else.

itronin · Jan 4, 2023

@aah, I'm going to add a pair of real world examples counterpoint to each other that show why I primarily use software raid and only use hardware raid when I have two and for limited use cases (like ESXI boot).

1) I've experienced many basically simultaneous multi-disk drive failures attached to hardware raid and here for your entertainment are two horror stories: In one instance, a drive went out, then 15 minutes later a second drive went out. I was on a trip, came back, saw the fault, replaced the drives and recovered from back up and all was well until I discovered I had some bit rot in some files. whoops. In another instance at a private hosting MSP, again near simultaneous multi-disk drive failures (3 drives out of 6 in a Raid 6). MSP goes to restore backup and discovers the backups were bad ( bad MSP stopped performing monthly backup verification to save costs ). In both cases a post mortem analysis of logs showed NO imminent or pending drive failures, and "smart" data looked fine.

2) I have a pair of medium sized (200TB) baremetal TNC systems. One backs up the other. In the past 12 months, I have had 2 drives in the backup server flagged by TNC as bad (checksum calculations). SMART data was perfect. I replaced the drives anyway and then popped them into another system and ran badblocks. Both drives died < 6 hours into BB.

As @oneplane said this is more about using the right tool for the right job.

@gea has (once again) really provided well written and thorough discussions of the pros/cons.

and now I have shared some real world examples.

aah · Jan 4, 2023

Thank you gea, oneplane, itronin, and everyone else. This forum, so far, has turned out to be one of the best that I have joined in a long loong time! Not only haven't I got any of the mundane and generic responses - i.e why donchya search?, ask google., etc - but also, I have received thorough, detailed, and well-composed replies. I tip my hat to you all! Thank you very much.

raw physical drives in passthrough mode it is.

Connorise · Jan 17, 2023

I would only add that both UnRAID and TrueNAS rely on their embedded RAID mechanisms (like ZFS). They do require direct disk access in order to operate with them. But nothing really stops you from deploying a pure Debian/Ubuntu VM on the SSD array and passing either RDM or large VMDK (thick eager zeroed) into the VM and configuring NFS sharing.

Dark · Feb 13, 2023

If the storage is being managed by the host...you could just run a linux or windows(server) VM and create your NFS share to hand out. Seems pretty basic. (edit, as mentioned above)

As already mentioned, truenas/unraid would be ideal if you were passing the drives through to the VM.

BoredSysadmin · Feb 13, 2023

itronin said:
If you feel the HW raid controller is "good enough" then build a nice guest linux vm with any of the variously available web gui management frameworks and go to town with a big honking VHD (or multiple VHD's if you can logically break up your data) and serve it out via NFS. You're probably getting 80% of what you get with a purpose built software distribution albeit with a bit more effort on your part.

I feel like this would be the simplest and most straightforward solution.

oneplane · Feb 13, 2023

BoredSysadmin said:
I feel like this would be the simplest and most straightforward solution.

But it would also put you in hardware raid hell, so that's a downside.

mattventura · Feb 13, 2023

Old but good presentation that explains why you want ZFS over hardware raid.

What is the best option for creating a NFS for my entire network from ESXi 8?

New Member

New Member

Well-Known Member

Well-Known Member

Well-Known Member

New Member

Well-Known Member

Well-Known Member

Well-Known Member

New Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

New Member

Member

Active Member

Not affiliated with Maxell

Well-Known Member

Well-Known Member