How 'switchy' can PCIe switches be?

abufrejoval · Sep 7, 2022

On Ethernet everybody expects that ten lines of 1Gbit ports will easily fit and flow through a 10Gbit uplink and packets just naturally get accelerated (or slowed) down as they pass one way (or the other).

So with PCIe now out in the wild with versions 1-5 and a single PCIe 5.0 lane delivering the equivalent of 16 PCIe 1.0 lanes I wonder if PCIe 'switch chips' switch packets or lanes?

I guess the main difference would be buffering vs. cut-through processing.

PCIe data really does seem to be passed in packets, which also seem to be limited in size, which should make buffering possible. Yet my impression is that lanes are negotiated end-to-end and not hop-to-hop, even if speeds seem negotiated hop-to-hop. There doesn't seem to be any facility to switch between lane counts and data rates, aggregation and disaggregation even if facilities for oversubscription clearly exist.

The background is mostly my frustration with how the few precious PCIe v4/5 lanes on modern SoCs get wasted by older PCIe v2/3 hardware, when a 'true' switch should make that far more efficient.

E.g. an Aquantia AQC107 10Gbase-T Ethernet NIC (4 lanes of PCIe 2) only requires the bandwidth of a single lane of PCIe 4.0 (as evidenced by the AQC113, which unfortunately nobody seems to sell as an add-on card), but without a switch chip would grab 4 lanes of PCIe 5.0 on a modern mainboard for a quarter of its potential.

Or similarly a quad M.2 to PCIe adapter, which currently tends to just grab 16 bifurcated lanes and pass them on trace-by-trace, when a switch based approach would be far more useful. E.g. a PCIe 5 capable x4 link to the slot, whilst you reuse a set of older PCIe 3.0 based NVMe drives for aggregate capacity and performance (I'm aware of Highpoint-Tech's NVMe RAID adapters, but those also just seem to do lane switching).

In a way having X670 chips, downstream devices and expansion 'opportunities' as add-in boards on a Ryzen 7000 backplane seems more sensible that the current static allocations.

I've been trying to find out what's going on searching the web and reading specs, but articles never go to that level of detail, while specs drown into detail before they outline concepts.

So please can someone enlighten me?

NablaSquaredG · Sep 7, 2022

Yeah, if I understand you correctly, it is possible.

I'm not an expert, but I have done some research a while ago

Take a theoretical 5.0 x8 to 5.0 x16 switch, connect 4 4.0 SSDs and enjoy full bandwidth

As far as I know, there are no specific "Cross Generation" switch chips (e.g. 5.0 x4 to 3.0 x16), because it wouldn't make sense. Instead you would use a 5.0 x4 to 5.0 x16 switch and connect the 3.0 devices, the switch chip would negotiate down.
The cost producing a specific cross-generation chip wouldn't be worth it. You just take a big one (like 5.0 to 5.0) and let it negotiate down

Is that what you wanted to know?

BTW, look at CXL: It's PCIe5.0 based and you will have switches just like network switches with many ports

abufrejoval · Sep 7, 2022

NablaSquaredG said:
Yeah, if I understand you correctly, it is possible.

I'm not an expert, but I have done some research a while ago

Take a theoretical 5.0 x8 to 5.0 x16 switch, connect 4 4.0 SSDs and enjoy full bandwidth

That's what I'm less sure about, I'm afraid you'd only get 4.0 x8 total bandwidth as the lowest common denominator in lanes and data rate.

As far as I know, there are no specific "Cross Generation" switch chips (e.g. 5.0 x4 to 3.0 x16), because it wouldn't make sense. Instead you would use a 5.0 x4 to 5.0 x16 switch and connect the 3.0 devices, the switch chip would negotiate down.
The cost producing a specific cross-generation chip wouldn't be worth it. You just take a big one (like 5.0 to 5.0) and let it negotiate down

Of course, cross generation makes little sense, unless there were to be huge benefits in terms of trading data rate for cable length e.g. in SAN appliances, which seem to see far more PCIe switches than anything else. If you look at the vendor sheets, they seem to cater for just about any variation, but only in terms of port counts, which is one of the reasons I am so afraid it's a fundamental issue and less because of SKU cost.

Is that what you wanted to know?

BTW, look at CXL: It's PCIe5.0 based and you will have switches just like network switches with many ports

I am looking very much at CXL and UCIe and in a way that's also why I wonder if 'true' switching is part of the picture or something that Infinity Fabric (or even NVLink) does much better. AFAIK IF is somewhat similar to CXL in that it's a protocol spoken on top of the same physical layer as PCIe, except that IF seems to replace PCIe wheras CXL is spoken on top of PCIe.

Perhaps Patrick would like to publish a good primer on the topic?

i386 · Sep 7, 2022

abufrejoval said:
How 'switchy' can PCIe switches be?
So please can someone enlighten me?

If I understand pcie switches correctly they can be setup anyway you want.
A 96 port pcie switch can be 96* x1 up to 6* x16 pcie ports. The application and form factor for the switch determines how you would use the ports (eg x16 add-on card with 20* x4 ports for nvme ssds won't work because of space on a pcie add on card, but the switch would theoretically support it)

abufrejoval · Sep 7, 2022

i386 said:
If I understand pcie switches correctly they can be setup anyway you want.
A 96 port pcie switch can be 96* x1 up to 6* x16 pcie ports. The application and form factor for the switch determines how you would use the ports (eg x16 add-on card with 20* x4 ports for nvme ssds won't work because of space on a pcie add on card, but the switch would theoretically support it)

Please note that this is lane switching, only. It's not trading say 2 lanes of PCIe 2.0 for 1 lane of PCIe 3.0.

TRACKER · Sep 7, 2022

Aquantia AQC107 10Gbase-T Ethernet NIC has PCIe 3.0 x4

i386 · Sep 7, 2022

This would require more logic -> more expensive
(Expensive is something that people using "ryzen servers" probably don't like to hear.)

abufrejoval · Sep 7, 2022

Highpoint Tech is one of those companies that has always been rather fast and creative in offering niche products in the SATA/SAS/NVMe space so given that PCIe 5.0 is becoming mainstream, but people may still have lots of PCIe 3.0 NVMe M.2 drives left over that a) are still too good to throw away (or sell at a pittance) b) not what you want in those scarce direct connected PCIe 5.0 M.2 slots, I'd have expected them and others to jump at the opportunity to offer a quad M.2 to PCIe 5.0 x4 (or even PCIe 4.0 x4) adapter to collect and recycle all those lesser M.2 SSDs on a typical desktop board, that only has a quad slot to spare without resorting to bifurcation.

None of that happened and they only offer x16 products for each generation. And while you can put those in one of those x4 slots, I'm afraid that a collection of 4x PCIe 3.0 M.2 drives will only deliver the bandwidth of a single M.2 drive, even if configured into a RAID0--because it's doing only lane switching and will downgrade, not aggregate the data rate.

But these days you never know if lack of product is because it's

technically impossible (issue here)
politcally unwanted (market segmentation)
too niche a market
supply chain issues

E.g. I have no idea why nobody is selling ACQ113 PCIe 4.0 x1 10GBase-T adapters, when the chips are clearly sold on motherboards, while ACQ107 PCIe 2.0 x4 are easy to get.

...and I can't think of anything more useful to put into all these PCIe 4 x1 slots... which are incidently disappearing from the latest generation of boards PCIe 5.0.

abufrejoval · Sep 7, 2022

TRACKER said:
Aquantia AQC107 10Gbase-T Ethernet NIC has PCIe 3.0 x4

I know that it supports a PCIe 3.0 data rate, but as a 10GBase-T card it neither requires nor benefits from PCIe 3.0 at four lanes. I didn't want to get into that level of detail, while the desire to connect all my machines to my NBase-T network near optimal speeds is a constant.

I believe all ACQ10x chips have a PCIe 3.0 IP block and just like their ACQ113 brethren (PCIe 4.0 IP block), they offer flexible PCIe version and lane combinations as required to support the Ethernet data rate. The 5Base-T variant is offered with a PCIe 3.0 x1 slot, I haven't seen a 2 lane SKU.

The ACQ113 quite officially supports PCIe 4.0 x1, PCIe 3.0 x2 and PCIe 2.0 x4 configurations and it's that type of flexibility I'd like to see at each level.

abufrejoval · Sep 7, 2022

i386 said:
This would require more logic -> more expensive
(Expensive is something that people using "ryzen servers" probably don't like to hear.)

I won't try to correct your bias.

What I want to know if additional logic could do it or if that type of flexibility wasn't designed into the PCIe standard, when it was created years ago.

And when it wasn't designed in back then, does that mean it's also missing from its heirs CXL and UCIe?

Which for me would be a much bigger issue than parsimonous users preferring AMD.

i386 · Sep 7, 2022

abufrejoval said:
What I want to know if additional logic could do it

Definitely possible, there is nothing in the pcie specifications to "prevent" something like this.

Edit: but it would probably not be called "pcie switch"

RolloZ170 · Sep 7, 2022

abufrejoval said:
What I want to know if additional logic could do it or if that type of flexibility wasn't designed into the PCIe standard, when it was created years ago

the slow data from the PCIe3 M.2 must be converted to PCIe5 and send to the MB Slot, what if all 4? M.2 drives like to send same time ?
In the other direction the fast data must be buffered and slowly delivered to the target M.2 device.
this is possible of course, but i this needs a very complex ASIC or something similar.
a PCIe5 to 16x SATASSD card is more easy to implement because they use one bidirectional lane. M.2 PCIe use 2/4 lanes which are bond in parallel.

i386 · Sep 7, 2022

Forget everything that was posted so far, it should be possible with a pcie switch:

https://www.servethehome.com/wp-content/uploads/2016/09/Microsemi-Switchtec-PCIe-switch-architecture.jpg

(stolen from servethehome.com: https://www.servethehome.com/a-new-pcie-switch-option-the-microsemi-switchtec-pcie-3-switch/)

Your nic (or any other pcie device) could connect to a stack and would be processed in the core to the configured pcie version/speeds and lanes.
Example with the switch from the picture:
Nic with 4 pcie 3.0 lanes connects to stack0.
From stack3 1 pcie 5.0 lane could be used to conenct the switch to the host.

abufrejoval · Sep 7, 2022

RolloZ170 said:
the slow data from the PCIe3 M.2 must be converted to PCIe5 and send to the MB Slot, what if all 4? M.2 drives like to send same time ?
In the other direction the fast data must be buffered and slowly delivered to the target M.2 device.
this is possible of course, but i this needs a very complex ASIC or something similar.
a PCIe5 to 16x SATASSD card is more easy to implement because they use one bidirectional lane. M.2 PCIe use 2/4 lanes which are bond in parallel.

Well, arbitration and oversubscription aren't new topics in PCIe and when your switch buffers packets (between 128 and 4096 bytes on PCIe, up to 9000 bytes on entry level Ethernet), there is no fundamental technical issue in aggregating/disaggregating data between different combinations of lanes and data rates: Ethernet does it all the time and with NBase-T any combination of 1/2.5/5/10 Gbit/s packet switching with fully transparent buffering is quite normal. By 1990 standards, that is a complex ASIC, thirty years later it's a workload of very few Watts (without the PHYs) on an 8x10Gbit chip costing a few bucks for Ethernet.

PCIe should not be a lot more complex, but prices there remain high after several waves of vendor mergers.

Actually SATA is a completely different protocol requiring the full effort of a PCIe interface and a SATA controller. It's cheap now, because the IP blocks are old and commodity from various vendors, but if you had to implement a PCIe switch chip and a PCIe SATA controller on a green field, the PCIe switch chip might be cheaper.

RolloZ170 · Sep 7, 2022

abufrejoval said:
the PCIe switch chip might be cheaper.

a switch is like a electr.switch: 4x AC-230V in, 4x4x 230V out.
you want 4xAC-230V in but 4x4x +12VDC out. note you have to convert the +12VDC back to AC-230V

RolloZ170 · Sep 7, 2022

abufrejoval said:
PCIe should not be a lot more complex

it is like it is. M-2 PCIe?x4 its like to use 4x 10G LAN in paired config.not single wires.

abufrejoval · Sep 7, 2022

i386 said:
Forget everything that was posted so far, it should be possible with a pcie switch:

https://www.servethehome.com/wp-content/uploads/2016/09/Microsemi-Switchtec-PCIe-switch-architecture.jpg
(stolen from servethehome.com: https://www.servethehome.com/a-new-pcie-switch-option-the-microsemi-switchtec-pcie-3-switch/)

Your nic (or any other pcie device) could connect to a stack and would be processed in the core to the configured pcie version/speeds and lanes.
Example with the switch from the picture:
Nic with 4 pcie 3.0 lanes connects to stack0.
From stack3 1 pcie 5.0 lane could be used to conenct the switch to the host.

Everything I see in that article is PCIe 3.0 (obviously backward compatible) and pure lane switching.

It allows oversubscription of lanes via the switch chips, so you can connect say six switch chips with 4 NVMe x4 drives via an x16 interface to a single x16 interface to the SoC, but it doesn't allow you to trade that PCIe 3.0 x16 port with a PCIe 5.0 x4 port (or PCIe 4.0 x8), that offers the same bandwith.

In Ethernet, that's trivial, you can have four 2.5Gbit ports feed a single 10Gbit upstream port and everybody would be able to talk to an upstream server at full bandwidth. With PCIe it seems that a transfer from any of the 2.5 Gbit nodes to the server switches the server's lane/port to a 2.5Gbit rate while they are talking and if the four of them want to really talk to the server at the same time, the aggregate bandwidth is stuck at 2.5 Gbit/s.

abufrejoval · Sep 7, 2022

RolloZ170 said:
a switch is like a electr.switch: 4x AC-230V in, 4x4x 230V out.
you want 4xAC-230V in but 4x4x +12VDC out. note you have to convert the +12VDC back to AC-230V

"Switch" unfortunately doesn't have unique well defined semantics in IT, even less so in English: that's the root of the issue.

Ethernet easily switches across lane rates and ports and such switches correspond much more to my understanding of the term, either because they came much earlier in IT (started when 100MBit with RJ45 cables was the hot new tech after 10MBit thick and thin Ethernet), or because I'm German and use "Schalter" to describe an old fashioned binary light switch, while an Ethernet switch is "switch" in all the four langauges that I need to deal with every day.

A train switch ("Weiche" in German and only used in the context of train tracks), would be an example of cut-through line switch operation: once the first pair of wheels has passed it, you'd rather not change the lane configuration.

But once that "conversation" has finished (and the train has passed), the topology can be changed almost completely (in a setup with lots of switches and lanes). In a way trains might best resemble PCIe, even if the individual wagons (packets) could physically be switched to different lanes, they are intrinsically linked (for routing and propulsion) in a manner that precludes topology changes within a series of packets.

(And when you go into AC/DC and voltage conversion, you're talking transformers, some of which these days are actutally 'switching' (and no longer inductive), but that's a very muddy metaphor, so let's please let that slink into the abyss for now...)

What enables Ethernet to switch packets independently across line rates is independent packets and the ability to buffer them.

PCIe uses packets and those are also limited in size; far smaller on average than Ethernet incidentally, although not quite as small as ATM packets also quite common in long distance networking.

Hence my puzzlement and orginal question: can PCIe support 'real' Ethernet type packet switching including line rate adaptation or is it limited to lane switching, even if it is packet based, but connection oriented, somewhat like X.25 was?

And is it a true technical limitation that carries forward into CSL and UCIe or is it mostly an implementation issue that could be resolved with better negotiation protocols on those successors?

RolloZ170 · Sep 7, 2022

abufrejoval said:
Switch" unfortunately doesn't have unique well defined semantics in IT

sure, but in this case you need a PCIe2NVMe Host-controller, not a switch.
the switch chip is not a IT device, its only a chip on a PCB.

abufrejoval said:
Hence my puzzlement and orginal question: can PCIe support 'real' Ethernet type packet switching including line rate adaptation or is it limited to lane switching, even if it is packet based, but connection oriented, somewhat like X.25 was?

the PCIe protocol does not provide that. you need some intelligence between the two worlds.

abufrejoval · Sep 7, 2022

RolloZ170 said:
sure, but in this case you need a PCIe2NVMe Host-controller, not a switch.
the switch chip is not a IT device, its only a chip on a PCB.

PCIe<->NVMe is only running traces between different form factors. An M.2 or U.2 connector is really just 4 PCIe lanes and the controller always sits on the SSD itself.

On one of my Xeon-D based systems I've basically swapped the standard allocations and moved the 10Gbit NIC that used to sit in the main PCIe x16 slot from there two the single M.2 connector via an adapter that is little more than a €20 ribbon cable connected to an M.2 2280 form factor PCB, whilst the x16 PCIe slot now has a primitive €50 PCB that offers mounts for up to 4 M.2 NVMe SSDs in bifurcation configuration. The Xeon-D is responsible for splitting the single x16 slot into four x4.

Swapping the M.2 and the PCIe x16 allowed me to add 3 additional NVMe SSDs and maxing out all available slot lanes on that SuperMicro board, but that's PCIe 3.0 throughout. Perhaps you can see how I'd want to have a rate adapting combination of those two, to put into a Ryzen 7000 M.2 PCIe 5.0 slot, instead of underutilizing that with a PCIe 3.0 variant or having to repurchase the capacity in a PCIe 5.0 variant: four 970 Evo+ I already own would deliver 8TB of SSD at pretty near PCIe 5.0 speeds with a proper switch chip on a single M.2 port.

I'd consider PCIe switches IT devices, beause they speak a protocol and need to be programmed/configured into a topology, typically by the BIOS at boot. But since PCIe supports hot-plug and reconfiguration, there also needs to be run-time support. Even more so with Thunderbolt, which also includes PCIe, just in yet another different form factor. Dolphin ICS from Denmark has produced external rack-mount PCIe switches for many years and host adapters to connect to them, they just haven't seen a lot of commercial success against Infiniband and Ethernet outside niche applications.

(I've always wanted them to produce a variant with TB ports and USB economy because I operate clusters based on Intel NUCs that come with TB3 ports. Direct connect TCP/IP networks over TB work and are much cheaper than the TB3 based Aquantia NICs I eventually used, but are a bit of a mess for lack of static MAC addresses)

PCIe topologies can be quite complex, include redundance, span multiple hosts (root ports) and contain dozens if not hundreds of devices: in a solid storage SAN box they will reach that level of complexity, but switch chips have largely disappeared from server and desktop mainboards, partly because of the price hikes caused by vendor mergers.

I still use a FusionIO 2.4 TB card which is internally composed of two 1.2TB units connected via a PCIe switch on the card. Each unit is PCIe 2.0 x8 and connects to the host via PCIe 2.0 x8 for a 2:1 radix or oversubscription, essentially sharing bandwidth.

the PCIe protocol does not provide that. you need some intelligence between the two worlds.

It seems 90% there, since it must already provide for topology changes, redundancy/alternate paths, lane count negotiations and data rate adjustments.

And I guess essentially PCIe switches could just lie and fake the ability of supporting 2x/4x lane counts to a device that can't support the higher data rates and internally carry x8 or x16 PCIe lanes across a physical x4 link at PCIe 4.0 or 5.0 data rates. If PCIe had been designed from ground up to translate the doubling of data rates for reach revision into higher virtual lane counts, we wouldn't have all these issues (but surely others).

I guess mostly I'm just astonished that given the broad range of PCIe revisions and their 1:16 bandwidth spread/gap that's not even being discussed: all sorts of technically impossible things get all kinds of discussion and this one doesn't really seem so far out, especially since PCIe is becoming quite universal now with CXL and UCIe.

How 'switchy' can PCIe switches be?

Member

Layer 1 Magician

Member

Well-Known Member

Member

Active Member

Well-Known Member

Member

Member

Member

Well-Known Member

Well-Known Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Member

Member

Well-Known Member

Member