ZFS without a Server Using the NVIDIA BlueField-2 DPU

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

i386

Well-Known Member
Mar 18, 2016
4,217
1,540
113
34
Germany
That's one of the best/most interesting articles :D
Will there be a succesor trying to get the most performance out of the given storage?

Will there be dpus with plp to protect the ram? :D

I don't know why exactly but on page 2 I started to think about security and how this could be another vector for attacks...
 

nickf1227

Active Member
Sep 23, 2015
197
128
43
33
@Patrick
I am trying to understand what this would look like from a reference architecture or best practices deployment guide for a datacenter.

A "traditional datacenter" would look something like this:
1651711744969.png

Or something like this:
1651711803628.png


I guess what I am having trouble grasping is how you would design this type of compute/storage/networking and what benefit it actually has? Like, We have gone from having traditional SANs to having software-defined storage in multiple different ways. However, the concept you showed in your video takes it a step further, whereby the storage is done in the same physical servers but using the co-processors on the NICs. This is a cool idea, but I don't know how to deploy a system that viably uses this in production that benefits over existing deployment models.

I know you showed us what you yourself admitted was not something you would recommend doing by exposing a bunch of iSCSI LUNS to the NICs to build a ZFS pool. While a cool demo, I'm not sure if I can think of another way to do what you did that would be a production-ready...

On the other hand, The AIC JBOX or something more akin to what Liqid is doing seems like a way better way of accomplishing a similar goal but in a better way, because they are just natively using PCI-E. The latency is going to be inherently lower. speeds faster and scalability is plenty good. All that, and you don't need additional CPU cores to accomplish the same goals. For small deployments all you need is a PCIE shelf and some HBA cards...you don't even need a special switch until you scale past 3 nodes.

1651713044569.png

I just don't get the DPU trend????? Offload cards, Smart NICs, VIC cards all make sense and have use cases with dedicated purpose. DPUs give you general purpose compute as a value add on a card...BUT WHYYYYY lol

There just seems like more compelling reasons to stick with traditional deployment models or adopt a PCIE fabric technology....
 
Last edited:

klui

Well-Known Member
Feb 3, 2019
824
453
63
It's CDI (composable disaggregated infrastructure). The resource box typically connects to a PCIe switch and orchestration software can dynamically bind/unbind DPUs, GPUs, FPGAs, etc. together with other resources in the ecosystem like other NVMe drives in a JBOF for a workload and release them back into the pool when the job is done. In this case, your target would be a SmartNIC's CPU instead of a traditional server allocated as a composed node.

What is unclear is how the tray of the resource box allow the DPU access to the AICs on that tray. Typically they are available through the MiniSAS HD connectors in the back. Probably all devices connected to that portion of the box shares the same PCIe bus and the DPU is configured as the root and other cards installed are leaf devices.
 

Patrick

Administrator
Staff member
Dec 21, 2010
12,511
5,792
113
With RDMA latency is not too bad over the network.

Here is the big thing - when you manage via a DPU, then VMs containers on the host can be untrusted (think public cloud, or enterprise cloud where people are running code from untrusted sources.) Services can be provisioned as needed whether those are accelerators, storage, or network bandwidth. All of the VMs can have an encrypted connection to other infrastructure (think AWS VPC.) The other big part is that none of this is running on the host x86 cores, so those are all freed up to allocate to workloads.

Think about it more like later this year we will start seeing 200GbE/ 400GbE to nodes. At those speeds, just to encrypt traffic you need an accelerator or it will eat most of your CPU resources.

I recorded a demo of the Intel IPU Monday that will go live in a few weeks. The host system just sees block NVMe devices and has no idea that they are actually being delivered over 100GbE fabric as it is completely transparent.
 
  • Like
Reactions: nickf1227

nickf1227

Active Member
Sep 23, 2015
197
128
43
33
With RDMA latency is not too bad over the network.
Sure, but without the specific data comparisons, it's difficult to compare to native PCI-E fabrics with HBA cards (which have PLX switch chips that I'm sure also introduce latency?) to RDMA. Perhaps that's an opportunity for you to explore some day for us dear readers :) Still, native PCI-E is still going to be better, we just don't have the data to know how much better from anyone other than the vendors themselves. I don't like non-independent numbers.

Here is the big thing - when you manage via a DPU, then VMs containers on the host can be untrusted (think public cloud, or enterprise cloud where people are running code from untrusted sources.) Services can be provisioned as needed whether those are accelerators, storage, or network bandwidth. All of the VMs can have an encrypted connection to other infrastructure (think AWS VPC.)
So like, I get what you are saying, but don't we already have that? I have a sever at home that has a Cisco VIC-1225. This is tech from 2013. Cisco UCS Virtual Interface Card 1225 - Cisco
It has ports with physical MAC addresses like any other NIC
1651806711657.png

But it also has vNICs that have their own:
1651806739983.png

Apparently these are on like firesale now. I can get them for $15 lol
1651809022673.png

So, theoretically, I can have trusted and untrusted networks riding over the same physical fabric but are totally isolated in hardware. So when I say I hear you I really do, but doing that type of security doesn't require a dedicated ARM CPU. Speaking of, would you classify this is a SmartNIC or as exotic in the continuum?

Moving on from my home setup...In the professional side of my life, I am doing this in production without anything even that fancy. Leveraging CAPWAP on Cisco APs all of my "Guest" and "BYOD" traffic is sent through a GRE tunnel to a WLC that terminates the client traffic at L2 on a switch that's entirely air-gapped from my network.

From there, I have a VMWare cluster with a couple of dedicated Intel X540-T2s (dirt cheap!) on a separate vSwitch plugged into that physical switch.
1651808972961.png

I use pfSense as a router and NAT and have it's WAN IP in my DMZ. Granted, this is all only 10GBE, but I have a relatively high performance network with tons of traffic riding on this solution all the time. No extra ARM CPUs are required.

Untitled Diagram.drawio.png


The outlined solution doesn't even require anything more than maybe than the bread and butter Intel X540...which just had its 10 year anniversary. Scaling up doesn't require anything exotic, it would just require an XL710. Anything past that and the whole system would need to be rearchitected anyway in favor of a hardware firewall so it's sorta irrelevant what NICs my servers have.

Granted, I am using PCI-E Lanes and slots in our design, but more and more that is less of a problem with modern hardware. Plus, I could always adopt PCI-E fabric tech. Maybe it's different in the cloud vs on prem in the datacenter and the use case makes more sense there. But for me in "Cloud Last" land...I don't understand the benefit?
1651808208420.png
The other big part is that none of this is running on the host x86 cores, so those are all freed up to allocate to workloads.
Also to your point about using the non-x86 cores. Sure, point taken. But what about licensing? Why would I use a VMware license on a PCI-E card's CPU when I can use it on a 128 Core Epyc?

Licensing aside, I think it would make more sense to take the price difference between a Bluefield 2 AIC and a normal NIC and instead invest it into a higher end CPU. If you do the math and compare the CPU performance to $ ratio, I think it probably ends up in my favor here.

MSRP from Dell for a 100GBE card:
1651809287217.png
MSRP from nVidia for Bluefield 2:
1651809464259.png

I can go from a 7313P to a 7502 and get double the relative CPU performance in my server node and still save $500.
1651809511076.png
1651809571420.png


Think about it more like later this year we will start seeing 200GbE/ 400GbE to nodes. At those speeds, just to encrypt traffic you need an accelerator or it will eat most of your CPU resources.

I recorded a demo of the Intel IPU Monday that will go live in a few weeks. The host system just sees block NVMe devices and has no idea that they are actually being delivered over 100GbE fabric as it is completely transparent.
I hear and agree with what you are saying here. But then why expose the ARM CPU to customers to use at all? Surely dedicated ASICs would be better suited to that task. You yourself proved that when Intel released QAT Intel QuickAssist Technology and OpenSSL - Benchmarks and Setup Tips (servethehome.com).

Thank you as always for humoring me @Patrick. I always appreciate your insight and I know I am a pain in the butt :)
 
Last edited:
  • Like
Reactions: Aluminat

Patrick

Administrator
Staff member
Dec 21, 2010
12,511
5,792
113
I guess it looks quite a bit different in practice. Like the upcoming Intel IPU demo, in practice, while you can iSCSI boot and so forth, it actually looks like a NVMe SSD, even if it is pulling from 24 SSDs in different chassis across the data center. The data to and from those SSDs gets encrypted and decrypted on the card. Compression happens on that card also to minimize data transfer over the wire and storage on disks. To the server, all it sees is a normal NVMe device.

On the crypto side, it is not just adding a few cores, it is like 100% CPU utilization to push 400GbE speeds. 200GbE is the 2022 generation DPUs/ IPUs, 400GbE we will see in 2023, and I expect by 2025 we will see 800Gbps DPUs.

The other issue/ challenge is who is operating the NIC. With DPU/ IPU, the infrastructure provides this to be able to isolate each VM since everything goes to the DPU, network/ storage, and then gets encrypted there. The DPU manages the server's firmware to ensure nothing is tampered with and so forth. This is going beyond the networking by quite a bit and is going to fundamentally separating infrastructure and compute planes.
 

nickf1227

Active Member
Sep 23, 2015
197
128
43
33
I guess it looks quite a bit different in practice. Like the upcoming Intel IPU demo, in practice, while you can iSCSI boot and so forth, it actually looks like a NVMe SSD, even if it is pulling from 24 SSDs in different chassis across the data center. The data to and from those SSDs gets encrypted and decrypted on the card. Compression happens on that card also to minimize data transfer over the wire and storage on disks. To the server, all it sees is a normal NVMe device.
The other issue/ challenge is who is operating the NIC. With DPU/ IPU, the infrastructure provides this to be able to isolate each VM since everything goes to the DPU, network/ storage, and then gets encrypted there. The DPU manages the server's firmware to ensure nothing is tampered with and so forth. This is going beyond the networking by quite a bit and is going to fundamentally separating infrastructure and compute planes.
So let me try and unpack this. Much like in your demo, Intel is literally making the network cards into a SAN at the same time as they are shipping iSCSI LUNs to your compute? How do the cards talk to the NVME devices, are the cards themselves the PCI-E root? Is the DPU/IPU literally taking over the PCI-E bus for the entire server? Wouldn't the DPU then need an insane amount of PCI-E lanes?

Doing compression and encryption of data in flight is a cool value-add ESPECIALLY when you are going this fast. But it sounds like you are talking about the IPU literally doing it for the data at rest as well??? Like the SAN is the network and the network is the SAN, there is no difference?

On the crypto side, it is not just adding a few cores, it is like 100% CPU utilization to push 400GbE speeds. 200GbE is the 2022 generation DPUs/ IPUs, 400GbE we will see in 2023, and I expect by 2025 we will see 800Gbps DPUs.
I don't doubt that at all, but why is a general purpose ARM processor the solution and not a dedicated ASIC? That would be inherently faster?
 

kedzior

Active Member
Mar 21, 2018
121
27
28
49
Poland
I know that it is old post, but I'm just wondering to get PCIe board to power DPU and add two FPGA without any x86 processor, I was wondering about GIGABYTE RISER CPBG8A0 2OZ REV 1.0 10x PCIe x16 like this : GIGABYTE RISER CPBG8A0 2OZ REV 1.0 10x PCIe x16 | eBay
but I do not know how to get this smaller/narrow power cable. Maybe @Patrick can send me ant tips about it? I'm not sure if something like this have direct connection between slots (so working on without x86 proc). 1703874834508.png