How 'switchy' can PCIe switches be?

NablaSquaredG · Sep 8, 2022

I'm gonna post two links:

How does PCIe switch handle Gen4 to Gen3?

I'm going to make a JBOD, the SATA HBA is 2X PCIe Gen3 and my new Mobo only support new PCIe Gen4 8x I want to fully utilise a Gen4 8x to connect many 2x Gen3 HBAs, so I've found this Gen4 swich from

electronics.stackexchange.com

https://docs.broadcom.com/doc/bandwidth_bridge_using_pex_8604_v1.0_31mar10.pdf

A PCIe switch as a Bandwidth Bridge

In the example above, it was determined that the PCIe device connecting to the USB device is integral and cannot be replaced. However, sacrificing the performance is not a valid option either. In this case, a PCIe switch can be used as a bandwidth bridge. Consider the PEX 8604- a 4-lane; 2-port PCIe switch with 5.0 GT/s SerDes. Each port implements two PCIe lanes. The PCIe switch can be used to match the bandwidth between both devices by connecting two lanes from one switch port to two 2.5GT/s lanes from the one device and a single lane on the second interface to the USB device. The bandwidth for the single 5.0GT/s lane is matched by the aggregate bandwidth of the dual 2.5GT/s lanes.

Conclusion

PCIe switches are typically used in fan-out applications where IO expansion is desired. However, flexibility in the configuration of PCIe switches available from PLX Technology allow system vendors to overcome bandwidth limitations introduced by other components in the system. When the PCIe switch is used as a bandwidth bridge, system designers can take advantage of the full performance capabilities of all the PCIe devices.

-> What I said in my post (e.g. x4 4.0 to 2x x4 3.0) is possible

i386 · Sep 8, 2022

The second link describes the same thing that I tried to explain with that microsemi pcie switch

NablaSquaredG · Sep 8, 2022

Well the point is that it's still called a PCIe switch, because that's exactly what it does. Like an Ethernet switch, all ports are able to negotiate independently

RolloZ170 · Sep 8, 2022

NablaSquaredG said:
Like an Ethernet switch, all ports are able to negotiate independently

units that do have extra feature., doubt that this auto negotiation is part of the definition of "switch".
if so all Ethernet switches should do auto neg. from 10m/100m/1000m/1G/10G....(or a specific range to be fair)
to compare Ethernet with PCIe x4 port you have to Team 4 LAN, then we can talk further.
required is a bridge not a switch.

to proof that we just to get "the switch chip" out of a ethernet switch unit.

ericloewe · Sep 8, 2022

RolloZ170 said:
if so all Ethernet switches should do auto neg. from 10m/100m/1000m/1G/10G....(or a specific range to be fair)

And they have, for close to two decades now.

NablaSquaredG · Sep 8, 2022

RolloZ170 said:
units that do have extra feature., doubt that this auto negotiation is part of the definition of "switch".

I do - And I even believe it is mandated by PCIe spec (devices need to be able to negotiate down)

RolloZ170 said:
negotiation is part of the definition of "switch".
if so all Ethernet switches should do auto neg. from 10m/100m/1000m/1G/10G....(or a specific range to be fair)

Ethernet switches do that?

RolloZ170 said:
to compare Ethernet with PCIe x4 port you have to Team 4 LAN, then we can talk further.
required is a bridge not a switch.

Compare SFP with QSFP:
Almost all switches can split QSFP into 4x SFP (40G to 4x10G), so Ethernet switches are perfectly capable of doing that

Look at it from this perspective:
What is the difference between multiplexing 4x4 3.0 lanes to 8x 3.0 lanes or multiplexing 4x4 3.0 lanes to 4x 4.0 lanes?
There isn't really one - in each case upstream bandwidth is less than the downstream bandwidth.
I doubt a switch chip that can only do the same version on all ports would sell well

abufrejoval · Sep 8, 2022

RolloZ170 said:
units that do have extra feature., doubt that this auto negotiation is part of the definition of "switch".
if so all Ethernet switches should do auto neg. from 10m/100m/1000m/1G/10G....(or a specific range to be fair)
to compare Ethernet with PCIe x4 port you have to Team 4 LAN, then we can talk further.
required is a bridge not a switch.

to proof that we just to get "the switch chip" out of a ethernet switch unit.

Negotiating lane speed is part of any PCIe device, not just switches. Anything else would break interoperability and backward compatibility.

And it's even being done dynamically, not just to match capabilities, but also for energy savings. Most GPUs seem to fall back to 2.5GT/s when they are idle or doing 2D only.

My questions are more like:

Are rates negotiated/adjusted hop-to-hop or end-to-end? (there wouldn't be any benefit of running intermediate hops at higher link speeds when lane counts are static)
Are lane counts negotiated/adjusted hop-to-hop or end to end? (bandwidth bridge potential)

And then perhaps differentiating betwen what the spec supports vs. what specific switch chips might be capable of.

i386 · Sep 8, 2022

pcie is a serial connection, so point to point (or hop-to-hop)

abufrejoval · Sep 8, 2022

NablaSquaredG said:
...
Look at it from this perspective:
What is the difference between multiplexing 4x4 3.0 lanes to 8x 3.0 lanes or multiplexing 4x4 3.0 lanes to 4x 4.0 lanes?
There isn't really one - in each case upstream bandwidth is less than the downstream bandwidth.
I doubt a switch chip that can only do the same version on all ports would sell well

Switch chips just like any other PCIe device will obviously negotiate lane speed for any port and independently for each if they have more than one.

But PCIe doesn't seem to be purely packet based in the sense that it also implements a higher level link layer end-to-end.
That link layer may be conversation based rather than static (similar to X.25), or it could be a bit more static, because device and topology discovery is expensive and it's modelled after PCI after all, which was much more static.

And that link layer may include rate and lane data or it doesn't. In the latter case that would open up to more flexibility. In the first case you may need to start a new conversation or essentially go through a plug event.

I've bought a PCIe bible of 1000 pages and I am trying to scan through it...

RolloZ170 · Sep 8, 2022

NablaSquaredG said:
I do - And I even believe it is mandated by PCIe spec (devices need to be able to negotiate down)

we talked about 'switch' not PCIe specs.
if a PCIe device neg.down. both ends are slow.

NablaSquaredG said:
Ethernet switches do that?

NablaSquaredG said:
Like an Ethernet switch, all ports are able to negotiate independently

you sayd that.

NablaSquaredG said:
Almost all switches can split QSFP into 4x SFP (40G to 4x10G), so Ethernet switches are perfectly capable of doing that

and there is "one switch chip" inside ?

abufrejoval said:
Switch chips just like any other PCIe device will obviously negotiate lane speed

a switch chip is not a device, its a switch which can not negotiate anything with both PCIe Host or PCIe dest.device.
what you mean is called bridge.

NablaSquaredG said:
What is the difference between multiplexing 4x4 3.0 lanes to 8x 3.0 lanes or multiplexing 4x4 3.0 lanes to 4x 4.0 lanes?

translate between diff. PCIe gens need a data-buffers/stacks/transceivers.
4x PCIe4 is ONE device not four.

i386 · Sep 8, 2022

A pcie switch is a device and will show up as such (example is from microchip):

abufrejoval · Sep 8, 2022

Here in an interesting article that Charlie Demerjian wrote in 2015, that describes what's possible in terms of topologies and dynamic changes thereof: Avago’s PEX9700 turns the PLX PCIe3 switch into a fabric

PLX has gone through various hands and is now part of Broadcom and Semimicro is now Microchip and the other major vendor that I can see: I'm afraid that being in the switching business has these companies executing M&As at blinding speeds ;-)

But while there is tons of marketing material out there, the easy answer to the ability of trading line rates vs lane counts is still evading me...

abufrejoval · Sep 8, 2022

And another more recent article from Ian Cutress, that's still not shedding any light on the ability to transparently trade/route between lanes vs data rates:

Microchip’s New PCIe 4.0 PCIe Switches: 100 lanes, 174 GBps

www.anandtech.com

i386 · Sep 8, 2022

Transparatenly? no. the pcie switch add latency.

Another try to explain how it could work

NablaSquaredG · Sep 8, 2022

RolloZ170 said:
we talked about 'switch' not PCIe specs.
if a PCIe device neg.down. both ends are slow.

What? A PCIe switch needs to conform to PCIe specs

RolloZ170 said:
you sayd that.

Yes, because you only stated that Ethernet switches *should* do that (and hinting that they don't), while in fact, they do it for sure

RolloZ170 said:
and there is "one switch chip" inside ?

Yes. Look e.g. at Broadcom Tomahawk 4

The Broadcom® BCM56990 family is a class of high-radix, high-bandwidth network switching devices supporting up to 64 × 400GbE, 128 × 200GbE, 256 × 100GbE, 256 × 40GbE, 256 × 25GbE, or 256 × 10GbE ports.

Each port can be split on its own (although because of physical limitations, you wouldn't be able to split a QSFP28 port that's wired as 100G into 8x10G, you can only split it into 4 lanes, but that could also be 4x 1G)

RolloZ170 said:
a switch chip is not a device, its a switch which can not negotiate anything with both PCIe Host or PCIe dest.device.
what you mean is called bridge.

That is not true. Most switch chips have various functionality, programmability and adding non-transparent bridges, etc...

RolloZ170 said:
translate between diff. PCIe gens need a data-buffers/stacks/transceivers.
4x PCIe4 is ONE device not four.

You need the same if you don't translate between generations. 4x 3.0x4 > 1x 3.0 x8, so even if upstream and downstream are using the same generation, you need data-buffers
Re transceivers: They are all able to negotiate down

abufrejoval · Sep 8, 2022

Transparent and undetectable aren't quite the same. All of virtualization is about lying to software to the point where it doesn't care being lied to.

Of course, you'd need to do some work on the control and data plane that will cost some latency, but nothing like copper Ethernet PHYs.

Buffer sizes seem to be short in PCIe, early generation chipsets only supported packets in the 128 byte range, the theoretical max is 4k and I don't have the foggiest if that's common today. Unless cut-through processing is possible, buffer size will have the largest latency impact, but it would be no different to what chipset connected slots have to suffer already.

What I mean by transparency is if the devices on both ends of the conversation need to know the lane/rate data for all hops or should be happy just knowing them for their immediate up/downstream partner.

E.g. You have an Nvidia RTX 2080ti GPU that maxes out at PCIe 3.0 x16, but prefer not to waste 16 lanes on your PCIe 5.0 capable Ryzen 7000.

Without a switch chip the link betwen the two would be negotiated to the lowest common denominator PCIe 3.0 x16, no bandwidth constraints for the GPU, but very few lanes left.

If you put a 100Gbit Infiniband or Ethernet SmartNIC onto 'the other' x16 slot on your mainboard, bifurcation will reduce the link to PCIe 3.0 x8, halving the GPU bandwidth and impacting the 100Gbit NIC, if that's also running PCIe 3.0.

If you have a 'bandwidth trading' switch chip in the middle, that could operate at PCIe 3.0 x16 towards the GPU and at PCIe 4.0 x8/PCIe 5.0 x4 towards the CPU SoC, leaving the 2nd set of 8 PCIe 5.0 lanes unaffected. With bits of ribbon cables and creative mounting of the switch chip PCB, neither the mainboard nor the GPU would even need changes and you could in fact use an M.2 connector (PCIe 5.0 x4) to run your RTX 20 GPU without losing bandwidth (only a bit of latency).

But for that to work, the Nvidia GPU shouldn't be aware that you're using PCIe 4.0 or 5.0 data rates in combination with a reduce lane count in the switch to SoC hop, nor should the SoC try to talk to the GPU on 16 lanes.

So if the two exchange on end-to-end topology at the start of their conversation (or at discovery), full slots and full lane switching is the only way that the GPU will have full bandwidth and the SoC will have some bandwidth left for similarly high speed devices which puts it out of a typical desktop price range.

Break-out adapters with PCIe switches on a small PCB that either plug into M.2 or x4 or x8 PCIe slots and translate to wider/lower rate slot connectors for older generation GPUs and xPUs could be a hot sell in today's transition environment, even if they pose some mounting challenges.

i386 · Sep 8, 2022

abufrejoval said:
Break-out adapters with PCIe switches on a small PCB that either plug into M.2 or x4 or x8 PCIe slots and translate to wider/lower rate slot connectors for older generation GPUs and xPUs could be a hot sell in today's transition environment, even if they pose some miner like mounting challenges.

https://www.servethehome.com/aic-jbox-review-j5010-02-limitless-potential-pcie-gen4-box/

acquacow · Sep 8, 2022

As someone who has used PLX switches in the past to connect 64-128 individual PCI-e cards to a single host, pci-e switches are very "switchy"

In the case above, I was using external pci-e expansion chassis from One Stop Systems, which includes the use of a x16 pci-e HBA on the host side, which can be directly connected to an expansion chassis, or split out 4 ways to 4 chassis/etc... Add more HBAs for more chassis/etc...

In the case of a home PC, your good pci-e slots are going to be direct-wired to the CPU in most cases, so there's no switching unless you put an add-in card into a slot that has switches on it to split bandwidth between m.2 or other cards. There are single pci-e cards that include an aquantia 10gig nic and dual M.2 on the same board from Synology and Drobo...they are a good example.

In most gaming systems, you aren't working with an HEDT platform with 40 lanes, so you only get one or two slots with dedicated lanes, the rest come off the I/O hub that splits a few PCI-e lanes (2 or 4) out to USB/Sata/pci-e/etc... so depending on what you are doing, you can put your 10gig nic into one of those slots and it'll share bandwidth with your USB and sata devices, which is probably fine.

http://imgur.com/a/BxFScrl

-- Dave

salotelsingha · Sep 14, 2022

I believe all ACQ10x chips have a PCIe 3.0 IP block and just like their ACQ113 brethren (PCIe 4.0 IP block), they offer flexible PCIe version and lane combinations as required to support the Ethernet data rate. The 5Base-T variant is offered with a PCIe 3.0 x1 slot, I haven't seen a 2 lane SKU.
install vidmate get-mobdroapk.com

abufrejoval · Sep 14, 2022

So here's what I've found after starting to read a book on PCIe and digging a bit into HWinfo details on my systems:

The striping across the PCIe lanes is happening at the very lowest physical level of the protocol, far removed from any software control. And since congestion control/resends etc. (somewhat like TCP for networks, but hop-to-hop, only) are also happening below the software layer, it simply doesn't care or determine what rate or number of lanes to use
PCIe really is fully packet switched and buffering. And that's not just the switches, but any device (root, bridge or target) who have to maintain PCIe buffers to enable resends. Those buffer packet sizes tend to be rather small. I've seen 128 Bytes/packet on my Xeon E3 (basically a Haswell desktop), 256 Byte on my Xeon E5 (Haswell workstation) and 512 Byte on my Ryzen 5950X (modern workstation)
Packet sizes need to match, too, so it's typically the lowest common denominator that's used. And I believe that's not software transparent, so you won't have packet disassembly/reassembly, which might happen on an IP network. E.g. I have an LSI hardware RAID that reports support for 2k packet size, but since the Xeon E3 only allows for 128Bytes, that's what's being used. GPUs also tend to be in the 128-256 Byte area (GTX1080 to RTX3090).

So there should be no technical obstacle for anyone selling a quad M.2 to single M.2/PCIe x4 adapter that can operate say four Samsung 970 Evo+ NVMe drives (PCIe 3.0) near their full sustained bandwidth via a PCIe 5.0 x4 connection (slot or M.2 with ribbon extender) on the host using a 20 lane PCIe 5.0 switch chip. Likewise 2 PCIe 4.0 NVMe drives should experience few constraints and it's only when you really start oversubscribing (e.g. using four PCIe 4.0 NVMe drives), that things could be piling up (which still doesn't make it useless).

And in a way that would be little different from an external Thunderbolt chassis--for the PCIe parts at least.

Take all of the above with a shaker of salt, because I've only scanned the first 100 pages of the book and might have misunderstood aplenty.

I'd say right now that could be an attractive transitional niche, if a 20 lane PCIe 5.0 switch chip was affordable enough in terms of money and power consumption.

And of course, there'd be nothing to keep you from only using 3 out of 4 M.2 slots for storage and sticking an 10Gbase-T NIC like the AQC 107 into the fourth via one of those ribbon cable adapters which offer a PCIe x4 socket to M.2 NVMe drive form factor conversion (I use that to plug my 10Gbit NIC into an M.2, using the x16 for a bifurcating 4xM.2 board).

It would be a nice thing to use on say Ryzen 7000, where hard wired slot allocations make it a bit difficult to really get all the bandwidth from the chip its 24 lanes of PCIe 5.0 support.

But only if the price is right---which could be the real issue with those switch chips.

How 'switchy' can PCIe switches be?

Layer 1 Magician

Well-Known Member

Layer 1 Magician

Well-Known Member

Active Member

Layer 1 Magician

Member

Well-Known Member

Member

Well-Known Member

Well-Known Member

Member

Member

Well-Known Member

Layer 1 Magician

Member

Well-Known Member

Well-Known Member

New Member

Member