SXM2 over PCIe

gsrcrxsi · Mar 9, 2024

yeah. P100 isn’t quite half, more like 2/3. 18.8 (4xP100) vs 28 (4xV100)

“Meh” tensor > 0 tensor (P100)

my work doesn’t “need” Volta. But the Volta implementation of MPS is better. MPS defaults to the older version on Pascal and older. So I greatly prefer it Multi-Process Service :: GPU Deployment and Management Documentation

My workload “mostly” fits in ~12GB on the Titan Vs I have now, but I get some 5% of failures that end up running out of memory. The 16GB V100 will solve that. The workload doesn’t use the GPUs in nvlink, it treats them individually. I’m not better off with 2x 3090 because it sucks for FP64. The job takes ~5 mins on a TitanV and like 30-40mins on a 3090.

bayleyw · Mar 9, 2024

The problem with Volta is it is on the verge of being dropped from mainstream frameworks and libraries. I would expect the sunsetting to begin when Blackwell launches, the GPU shortage somewhat eases, and the last V100 clusters get dismantled. There already aren't any V100's in any public clouds which really limits motivation to continue developing libraries, plus GV100 is missing so many mixed precision features - it lacks support for int8, bf16, and fp8, and has much less SRAM than A100 or H100.

Libraries should continue to run for quite some time, but the Volta-Ampere performance gap is only going to grow over time - it started at 1.7x and is now almost 4x for transformer models (and yes, you could fix this by writing Volta-optimized libraries yourself, but that's a lot of work!)

gsrcrxsi · Mar 10, 2024

I’m not sure what your point is anymore. You were initially pushing for the even older P100 on the basis of cost. Now you’re saying V100 isn’t good cause it’s too old and will lose support? When you don’t even know my workload? lol. My workload won’t lose Volta support for the foreseeable future. Not as long as Nvidia supports it in their drivers. I’ve given this a lot of thought for my specific use case.

my work doesn’t need the VRAM where an A100 is warranted. And the cost is still out of reach to replace my fleet with equal power of A100s. I have 19x Titan Vs now. I can replace them all with V100s for less than the market price of the Titan Vs…

the A100 is fast, I’ve tested it on cloud instances. But price still too high, and not as much faster than V100 than the flops increase (along with some computational efficiency boost by being able to run 2-3 tasks at the same time with MPS). It’s maybe 2-3x faster at 30x the cost. And SXM versions are not really an option in the same way that the V100 is. Platforms are all exorbitantly expensive and locked down/proprietary. The whole basis of this thread is about the relatively open/standard nature of the AOM-SXMV and the ability to use them on any normal platform, and that’s what I’m planning to do.

AmengLu · Mar 13, 2024

Is there anyone use ONLY 8611 cables to connect amo-sxmv to motherboard? Now I have two J-pcie to standard pcie cables(designed by RGL, his name even is printed on the riser board), but they only supply 32 pcie lines. I want to get more lanes by take advantage of 8611, but I don't know whather it works.

gsrcrxsi · Mar 13, 2024

AmengLu said:
Is there anyone use ONLY 8611 cables to connect amo-sxmv to motherboard? Now I have two J-pcie to standard pcie cables(designed by RGL, his name even is printed on the riser board), but they only supply 32 pcie lines. I want to get more lanes by take advantage of 8611, but I don't know whather it works.

This was asked a few times earlier in the thread. It apparently doesn’t work with only the 8611 connection. You have to use the PCIe slots.

bayleyw · Mar 16, 2024

gsrcrxsi said:
I’m not sure what your point is anymore. You were initially pushing for the even older P100 on the basis of cost. Now you’re saying V100 isn’t good cause it’s too old and will lose support? When you don’t even know my workload? lol. My workload won’t lose Volta support for the foreseeable future. Not as long as Nvidia supports it in their drivers. I’ve given this a lot of thought for my specific use case.

my work doesn’t need the VRAM where an A100 is warranted. And the cost is still out of reach to replace my fleet with equal power of A100s. I have 19x Titan Vs now. I can replace them all with V100s for less than the market price of the Titan Vs…

the A100 is fast, I’ve tested it on cloud instances. But price still too high, and not as much faster than V100 than the flops increase (along with some computational efficiency boost by being able to run 2-3 tasks at the same time with MPS). It’s maybe 2-3x faster at 30x the cost. And SXM versions are not really an option in the same way that the V100 is. Platforms are all exorbitantly expensive and locked down/proprietary. The whole basis of this thread is about the relatively open/standard nature of the AOM-SXMV and the ability to use them on any normal platform, and that’s what I’m planning to do.

you were saying that you wanted Volta because it had tensor cores that you might use in the future, I'm arguing that by the time you get around to using the tensor cores they might not be supported any more. if you don't need tensor cores P100 is a very cost effective way to get fp64, if you plan on using them there are cost effective ways to get tensor cores that have good support roadmaps and much richer feature sets.
you could build a fleet of 4x SYS-4028GR with 32 P100's good for 150 tflops fp64 in a neat 16U package for $8K all in, which has features like 'pcie switches' and 'supported by the vendor'. V100 is really all about the tensor cores and if you're not using them *right now* they are just not a big leap over P100.

gsrcrxsi · Mar 16, 2024

bayleyw said:
you were saying that you wanted Volta because it had tensor cores that you might use in the future, I'm arguing that by the time you get around to using the tensor cores they might not be supported any more. if you don't need tensor cores P100 is a very cost effective way to get fp64, if you plan on using them there are cost effective ways to get tensor cores that have good support roadmaps and much richer feature sets.
you could build a fleet of 4x SYS-4028GR with 32 P100's good for 150 tflops fp64 in a neat 16U package for $8K all in, which has features like 'pcie switches' and 'supported by the vendor'. V100 is really all about the tensor cores and if you're not using them *right now* they are just not a big leap over P100.

I run a wide variety of workloads. One of which does use tensors right now. And won’t lost support for Volta.

the other consideration is power draw. My loads are 24/7 not intermittent. Volta is a lot more power efficient across the board. One of my loads, the efficiency of my Titan Vs is on par with 4090s. (Half the throughput, but half the power draw).

Volta checks all the boxes for my use cases. I have 19x Titan Vs now, and happy with them. I’m interested in this SXMV setup because I can get into them cheaper than the TitanVs, and gain some extra VRAM in the process.

bayleyw · Mar 17, 2024

does Volta actually show significant efficiency gains over Pascal? TSMC 12nm was an optimized version of TSMC 16nm, not a shrink, so I would expect 2x P100 at 150W each to outperform 1x V100 at 300W.

gsrcrxsi · Mar 18, 2024

except, my Volta cards rarely use more than ~120W, and yes they outperform Pascal P100s

from my testing with V100 rentals, the V100 doesn't really use more power than the Titan V under the same loads, despite having the 300W TDP. boost clocks are only slightly higher than Titan V, and those low clocks are the biggest reason for the good efficiency. under compute loads, the titan V caps at 1335MHz core clock, and will only go higher with higher power draw under 3D applications (games). you CAN unlock the clocks to allow the card to boost to the power limit (~1800MHz, depends on TDP), but this doubles power draw for ~5% performance gain. not worth it, just throwing power away.

resham · Mar 18, 2024

DieBlub said:
Sorry to interject. But looking at the SXM2 PCBs it seems like a quite straight forward conversion to PCIe. SXM3+ is a different story, as they also changed the voltage. Has anyone ever found any pinout diagrams for SXM2/3? I searched for a couple of hours but no dice.

ownCloud

files.opencompute.org

Without cadence (i think), kicad will work. Schematic and Gerbers will be helpful.

I tried to get x1 instead x16 working only with bare minimum pins (12V, GND, RST, PRSNT,PWR_GOOD -- CLK and RX/TX 1.). No heartbeat from the V100 so far. If you get it working please share.

Anyone who might be knowledgable, is it even possible to get this running at X1 to test before sending out the PCB with other lanes ?

gsrcrxsi · Mar 23, 2024

Oh sweet. It works

FIIZiK_ · Mar 24, 2024

@gsrcrxsi I see in your latest photos you haven't used oculink? Is it not needed with the 2x pcie connected? From my understanding of the rest of the page, it was needed either way.

Also did you end up using the PCIE cables for Power or ESP, there was a convo before about what was the correct one to use. Is your 1600w EVGA enough for it? Thinking of doing something similar but with 2x EPYC 7601(128 threads total and 256GB Ram)

Separately, how's the performance? Could get the 2x adapter cables, 4x V100, the supermicro SMX2 motherboard and the powersupply for 1500$. Still worth it today for Stable Diffusion and Transformers? It's pretty much around 1x used 4090 in terms of pricing.

gsrcrxsi · Mar 24, 2024

FIIZiK_ said:
@gsrcrxsi I see in your latest photos you haven't used oculink? Is it not needed with the 2x pcie connected? From my understanding of the rest of the page, it was needed either way.

Also did you end up using the PCIE cables for Power or ESP, there was a convo before about what was the correct one to use. Is your 1600w EVGA enough for it? Thinking of doing something similar but with 2x EPYC 7601(128 threads total and 256GB Ram)

Separately, how's the performance? Could get the 2x adapter cables, 4x V100, the supermicro SMX2 motherboard and the powersupply for 1500$. Still worth it today for Stable Diffusion and Transformers? It's pretty much around 1x used 4090 in terms of pricing.

you don’t need the oculink cables at all. They aren’t used for PCIe communication with the host PC on this platform. Best I could tell, they are used for network access to the GPUs from other hosts, but I don’t think it’s as standard as the direct PCIe connections are. I don’t plan to use this so it’s not needed.

the power connections are EPS, not PCIe/VGA power. I’m using some 2xVGA->EPS adapters since having lots of VGA power is more common than 4x EPS.

the 1600W PSU is enough in my opinion, and especially since under my loads the GPUs only pull like 150-200W each.if your load will max out the GPU TDP of 300W, you should plan to use a dedicated PSU for the GPU board. Right now I have them hooked up with an HP 1200W server PSU and a mining style breakout board to drive 8x PCIe/VGa connections for the board. The motherboard is running on the 1600W PSU pictured

I can’t give an opinion on performance as I don’t use stable diffusion. They perform like V100s perform.

This is just a temporary setup for now to make sure it all works.

Underscore · Mar 24, 2024

FIIZiK_ said:
@gsrcrxsi I see in your latest photos you haven't used oculink? Is it not needed with the 2x pcie connected? From my understanding of the rest of the page, it was needed either way.

Also did you end up using the PCIE cables for Power or ESP, there was a convo before about what was the correct one to use. Is your 1600w EVGA enough for it? Thinking of doing something similar but with 2x EPYC 7601(128 threads total and 256GB Ram)

Separately, how's the performance? Could get the 2x adapter cables, 4x V100, the supermicro SMX2 motherboard and the powersupply for 1500$. Still worth it today for Stable Diffusion and Transformers? It's pretty much around 1x used 4090 in terms of pricing.

Note: A current seller on eBay sold 3x V100 to me for $165e (I offered $150e then we settled on $165e). Open box. I'd take advantage of that before they sell out, cheap and brand new. Guy restocked after I bought 3, so I bet the "4 available"'s just so no one super bulk buys.

Also, most StableDiffusion XL models only need like 13GB of VRAM, so speed-wise they should be double a 4090, but with a buncha unused memory.

mtg · Mar 24, 2024

Underscore said:
Also, most StableDiffusion XL models only need like 13GB of VRAM, so speed-wise they should be double a 4090, but with a buncha unused memory.

Double a 4090 on stable diffusion from a v100??? Raw Performance Ranking of GPUs

gsrcrxsi · Mar 24, 2024

mtg said:
Double a 4090 on stable diffusion from a v100??? Raw Performance Ranking of GPUs

I think he means that all 4 of them would be double a single 4090

FIIZiK_ · Mar 24, 2024

@gsrcrxsi thanks for the reply. thinking of giving it a go around mid april. I already got a PC with a 4090 so just thinking if it would be worth it at the end of the day.

@Underscore yeah found 4x V100 for £177 each with tax and shipping included. Could get that down to £150 maybe, which would be $190.
But yeah I'm doing transformers more than stable diffusion, which is the reason I would need the vram as 16GB and 24GB can be easily filled up. For stable diffusion 16GB is already enough.

@mtg look at the following benchmarks, if the performance scales linearly (which is questionable), then 4x V100 = 2x 4090 but with 64GB HBM2 instead of 48GB GDDR6X. In mixed Precision it can arrive at 2.5x 4090s (also running mixed precision).

WhatsApp Image 2024-03-24 at 16.29.37.jpeg

WhatsApp Image 2024-03-24 at 16.27.40.jpeg

CyklonDX · Mar 24, 2024

Underscore said:
most StableDiffusion XL models only need like 13GB of VRAM

Use comfyui, there's been plenty of optimizations on the platform. You could make due with lot less vram.

In terms of soldering more ram onto Ampere or Ada as of now its impossible. (there's been some chinese selling 3080/90's with double the ram - but as far as i know they didn't share the changes & special bios.) You can easily get 2080Ti upgraded to 48G (though the performance is lacking - money spent on upgrade + inital cost of the gpu will take it close to A4000 that will outperform it in AI in most cases.)

gsrcrxsi · Mar 24, 2024

@CyklonDX since you have experience with this board. Do you know what the LEDs near the NVLink heatsinks mean? I couldn’t find any documentation about that

I’m noticing that my setup is having a strange issue. Two of the GPUs will randomly drop after some random amount of time. And it seems the computation on those two GPUs really drags after a while.

on the board, initially at boot all 4 LEDs light up, then one of them shuts off as it’s booting the OS. Is this normal or should all 4 LEDs be on all the time?

trying to narrow down if the problem might be the board I got, or maybe the motherboard or the riser cables.

FIIZiK_ · Mar 24, 2024

Anyone knows if SXM3 heatsinks are compatible with SXM2 (V100s)? I found some SXM3 heatsinks for about 30% cheaper than SXM2. Looking online at PCB images, it's difficult to understand if the mounting holes align as the overall scale/dimensions could be different.
Has anyone tested this?

SXM2 over PCIe

Active Member

Active Member

Active Member

New Member

Active Member

Active Member

Active Member

Active Member

Active Member

New Member

Active Member

New Member

Active Member

New Member

New Member

Active Member

New Member

Well-Known Member

Active Member

New Member