SXM2 over PCIe

CyklonDX · Mar 24, 2024

gsrcrxsi said:
Two of the GPUs will randomly drop after some random amount of time

Check the temps on your plx switches; The board was meant for high-airflow.
It was built&design for constant airflow with plx below 80'C - and likely has 100-110'C max before shutdown .
(plx chips have aluminum heatsinks on them - check if you are loosing plx chips running lspci when it happens - if that's the case - it means they are shutting down due to heat)

if you have had pcie errors from the pcie links you would see plenty msgs in your system logs about it.

In terms of LED's i'm not sure, i never paid attention | server was closed before it was powered on.

gsrcrxsi · Mar 24, 2024

Do you know if there is temperature reporting from the PLX chips? That’s obtainable in the OS via some driver?

But thanks for the tip. I’ll throw a fan on both the heat sinks and try it out. That should at least eliminate one variable if it ends up not being the issue.

I have at least isolated it to being something between GPU3 and GPU4 (left half of the board when viewed from the front). When I disconnect power to 3 and 4 and only run GPU1 and GPU2, everything seems fine.

CyklonDX · Mar 24, 2024

Looks for them here
/sys/class/

temp*_input would contain current temp of the device (but you would still need to find out what kind of device that is.)

or set up telegraf with influx grafana - that should be easier to export all values.
(or you can give it a shot using 'sensors' tool/cmd - sometimes its capable of picking up stuff)

I'd say 50/50 depending on kernel you may or may not be able to pick up stuff.

bayleyw · Mar 25, 2024

FIIZiK_ said:
@gsrcrxsi I see in your latest photos you haven't used oculink? Is it not needed with the 2x pcie connected? From my understanding of the rest of the page, it was needed either way.

Also did you end up using the PCIE cables for Power or ESP, there was a convo before about what was the correct one to use. Is your 1600w EVGA enough for it? Thinking of doing something similar but with 2x EPYC 7601(128 threads total and 256GB Ram)

Separately, how's the performance? Could get the 2x adapter cables, 4x V100, the supermicro SMX2 motherboard and the powersupply for 1500$. Still worth it today for Stable Diffusion and Transformers? It's pretty much around 1x used 4090 in terms of pricing.

probably better off with 2x 3090 at the same price. quoting myself from earlier:

The problem with Volta is it is on the verge of being dropped from mainstream frameworks and libraries. I would expect the sunsetting to begin when Blackwell launches, the GPU shortage somewhat eases, and the last V100 clusters get dismantled. There already aren't any V100's in any public clouds which really limits motivation to continue developing libraries, plus GV100 is missing so many mixed precision features - it lacks support for int8, bf16, and fp8, and has much less SRAM than A100 or H100.

Libraries should continue to run for quite some time, but the Volta-Ampere performance gap is only going to grow over time - it started at 1.7x and is now almost 4x for transformer models (and yes, you could fix this by writing Volta-optimized libraries yourself, but that's a lot of work!)

and

FlashAttention2 was never ported to Volta (the SM layout is different) so for transformers you eat an additional 2x difference in performance. Unless your workload fits in 64GB but not 48GB, you are better off with 2x 3090 + NVLINK which supports the latest optimizations and costs about the same

also, training resnet50 at high batch size is not a good benchmark in 2024 and basically offers no information on what real world performance is like on modern models - the network is (1) fully convolutional and (2) very small which makes it have a completely different compute footprint versus post-2022 transformer models

the real value king for AI right now is the 2080Ti-22G. for the same price as 4x V100 with trimmings you get 88 GB of memory, quiet triple fan coolers instead of the insane 3U SXM2 heatsinks, int8 on the tensor cores, and you can even play games on them.

CyklonDX · Mar 25, 2024

bayleyw said:
probably better off with 2x 3090 at the same price. quoting myself from earlier:

The user was interested in fp64 load. Only cards that can deliver more are amd's mi210, nv H100, mi250, mi300 and future blackwell ones. (potentially intel ones too gpu max 1100)

(to illustrate it)

bayleyw · Mar 25, 2024

CyklonDX said:
The user was interested in fp64 load. Only cards that can deliver more are amd's mi210, nv H100, mi250, mi300 and future blackwell ones. (potentially intel ones too gpu max 1100)

the user I was replying to is clearly interested in mixed precision AI work (transformers and stable diffusion)...I did have a lively debate a few posts up with a fp64 user though

CyklonDX · Mar 25, 2024

bayleyw said:
the user I was replying to is clearly interested in mixed precision AI work (transformers and stable diffusion)...I did have a lively debate a few posts up with a fp64 user though

my apologies if that's the case.

bayleyw · Mar 25, 2024

somewhat off topic but speaking of mixed precision on a budget, check out these beauties (eBay link):

for $165 you get 2x16+1x8 out of each socket at 3 slot spacing which lets you build hives of 2x NVLINK plus shared memory communication to a NIC (which is not as good as RDMA but at least you are on a single root). combine it with the triple slot 2080Ti-22G and a mining case and you could have an air cooled 88GB rig for somewhere between $2000 and $2200 depending on your choice of case and power supply. you also get linear scaling across at least pairs with the right framework for transformer models

I don't know how well these work for training (as opposed to LLM inference): it seems plausible even without RDMA (but with NVLINK) you could get usable performance scaling. you would need 2x nodes to do full fine tuning of a 7B model but given how accessible these are (runs off of 110 with minimal engineering needed, possible to make quiet, air cooled, etc) it would be really cool if it worked

FIIZiK_ · Mar 25, 2024

@bayleyw I mean if we go offtopic, at that point you can get any EPYC mobo and dual CPUs (for 700$ you have 2x EPYC 7601 with the motherboard (128threds between the 2) and up to like 1TB ram, 256GB would go for about $500 separately) and then you have 2x pcie3x16, 3x pcie3x8. As long as you are happy with risers. And at that point you have a proper motherboard with IKVM, IPMI, OCULINK and all the bells and whistles.

In fact I was running 2x 4090s on the mentioned mobo before just moving it to my main rig with a Ryzen 7950X as I needed the IO of a gen5 drive for what I was doing.

My hope is to wait for blackwell to be released and hopefully get 4x V100s for 100$ each. Heatsinks for 200$ and the SXM2 mobo + pcie cables are around 250$. That would be 850$ for a setup that can be used with any PC as long as it has 2 PCIE slots free.

Also don't know how you calculated the 4xV100s with trimmings = 4x 2080tis. But the gpu you linked is 425$ each, if you order 4 you get 1% off which is nothing, so at the end of the day you pay 1700$. Compared to the V100s with trimmings which should be 800-1000 depending on luck, that's almost half price. And you have HBM2 not GDDR6.

gsrcrxsi · Mar 25, 2024

@CyklonDX thanks for the input. I'll poke around and see if i can find the temp somewhere.

I'm thinking it's not temp related specifically. I'm still seeing problem with a fan directly on the heatsink. I thought it was workload dependent, but I had another failure with a different load too. I think something weird is happening with the NVLinks between GPUs when it hits a certain kind of calculation and something gets wonky with those two GPUs on the sketchy PLX chip (3&4, left side) and the calculation just drags on at about 1/2 to 1/3 the speed. it's like it's stuck in some kind of cyclical loop jumping between the two GPUs on the bad side just spinning wheels. I even swapped the SXM2 modules around to put 1/2 on 3/4, and vice versa. the problem stayed on GPU 3/4 running the "other" two which were good in the 1/2 spots. and the problem still happens with running only GPU3 or GPU4 alone.

the only PCIe errors in kernlog are when the device craps out. It does look like the PLX device drops as the reason, which takes down two GPUs with it and soon after the other two as well.

my hunch is that one of the NV links has a problem between 3&4, and that might be why one LED is not lit, but I can't be sure since there's no documentation about what they indicate. not sure if disabling NVlink would help that. but it seems impossible to disable NVLink anyway.

if the problem is truly the one PLX, I'm not sure there's an easy/cheap way to solve it. board repair isn't cheap in the US.

FIIZiK_ · Mar 25, 2024

@gsrcrxsi you've mentioned that you tried swapping the modules around, have you tried swapping the 2x pcie cables around. that way you know if it's a cable issue or a board issue.

CyklonDX · Mar 25, 2024

gsrcrxsi said:
@CyklonDX thanks for the input. I'll poke around and see if i can find the temp somewhere.
snip
sketchy PLX chip (3&4, left side) and the calculation just drags on at about 1/2 to 1/3 the speed.
snip

That sounds like plx overheating/failing and dropping its clocks, but check power states of each card before they go dark; You can keep exporting nvidia-smi output to a file every couple sec either with 'watch'. Could be some kernel power state bugs (disabling aspm may help if this is the case - but it does sounds like plx chip overheating.)
There's also potential that you are unable to supply enough power to 2nd half of the board on the rail from psu its connected on.
(the nvlink is handled by plx chips)

gsrcrxsi · Mar 25, 2024

FIIZiK_ said:
@gsrcrxsi you've mentioned that you tried swapping the modules around, have you tried swapping the 2x pcie cables around. that way you know if it's a cable issue or a board issue.

yes, I swapped the PCIe cables. problem stayed with GPU3-4. cables aren't/weren't the issue.

CyklonDX said:
That sounds like plx overheating/failing and dropping its clocks, but check power states of each card before they go dark; You can keep exporting nvidia-smi output to a file every couple sec either with 'watch'. Could be some kernel power state bugs (disabling aspm may help if this is the case - but it does sounds like plx chip overheating.)
There's also potential that you are unable to supply enough power to 2nd half of the board on the rail from psu its connected on.
(the nvlink is handled by plx chips)

PSU is a single rail design. it's a EVGA 1600T2, plenty stout enough for what I'm asking of it.
I also ran a dedicated 1200W HP server PSU, same results.

however, I think I just figured out the problem. at least a workaround. I was about to give up and I wanted to at least see what happens when I plug the x16 riser cable into an x8 slot. if it would still recognize the GPUs, etc. my workloads are not PCIe bound enough that I would be impacted with ~gen3 x4 bandwidth. so I moved the cable to one of the x8 slots (Asrock EPYCD8 board, Naples 7551 CPU). and boom. everything is back to normal on all 4 GPUs now.

so either it's something sketchy with this boards x16 slot when running the PLX bridge on them, or it's something wrong with lanes 9-16 coming from the PLX chip and using the x8 link cuts out the bad ones. wont be able to confirm which is the issue until I plug this AOM-SXMV into another board to check again. if it is the leftmost PLX chip having the issue that I can't run it on an x16 link for whatever reason, that sucks, but I can work with it for my purposes.

still not convinced it's thermal related though. the other PLX chip that's running GPU 1&2 is totally fine, without any fan on it.

CyklonDX · Mar 25, 2024

You can try getting laser thermometer, or just feel the heatsinks;
Working at lower pcie throughput does put less pressure on plx chip (it doesn't have to work as hard anymore - spends half the time it had to prior to serve the link)

if we were to look how plx chip works in patent you would have a buffer and logic that would translate the io addresses; but in my exp plx acts more of a switch that serves each io address range in timed frequency with logic abstracting connected devices and serving them in same timed frequency for each refresh in daisy chain; thus going from 16 to 8 would cut the work load on plx by half.

(swapping thermals - i think they used tape on plx's may be a good idea; if you keep having problems - and chip does run significantly hotter the board may need a reflow - some pads could have lost contacts - and chip requires more power to keep functioning)

gsrcrxsi · Mar 25, 2024

yeah I've felt the heatsinks. both only feel slightly warm to the touch with no fan (gets some airflow from the fan on the GPU heatsink). and putting a fan on the troublesome PLX made no difference in behavior despite direct cooling.

CyklonDX · Mar 25, 2024

gsrcrxsi said:
yeah I've felt the heatsinks. both only feel slightly warm to the touch with no fan (gets some airflow from the fan on the GPU heatsink). and putting a fan on the troublesome PLX made no difference in behavior despite direct cooling.

then i think you may have faulty board. (i'd recommend reflowing) | check the power states, and if disabling aspm does anything.

Underscore · Mar 25, 2024

FIIZiK_ said:
at the end of the day you pay 1700$

I think the upsides about the 2080ti 22G setup are the rendering aspects—specifically ray tracing—and the Turing compatibility—for longer support. With 3x 22GB, you can get 66GB for about $275 more than the 4x V100, though installing them will of course require a few extra costs and workarounds due to the increased size of the cards. Of course, the fact that a 2080ti's generally slower than a V100, added to the fact that you'd only get 3 instead of 4 does make it an iffy choice.

FIIZiK_ said:
@bayleywThat would be 850$ for a setup that can be used with any PC as long as it has 2 PCIE slots free.

Also note that you can install an x8x8 bifurcation riser for a measly €50 even if you have only one x16 slot free, quite easily thanks to the thinness of the ribbon cables. I haven't tested this myself, but I plan to once I get my new server board.

bayleyw · Mar 25, 2024

FIIZiK_ said:
@bayleyw I mean if we go offtopic, at that point you can get any EPYC mobo and dual CPUs (for 700$ you have 2x EPYC 7601 with the motherboard (128threds between the 2) and up to like 1TB ram, 256GB would go for about $500 separately) and then you have 2x pcie3x16, 3x pcie3x8. As long as you are happy with risers. And at that point you have a proper motherboard with IKVM, IPMI, OCULINK and all the bells and whistles.

In fact I was running 2x 4090s on the mentioned mobo before just moving it to my main rig with a Ryzen 7950X as I needed the IO of a gen5 drive for what I was doing.

My hope is to wait for blackwell to be released and hopefully get 4x V100s for 100$ each. Heatsinks for 200$ and the SXM2 mobo + pcie cables are around 250$. That would be 850$ for a setup that can be used with any PC as long as it has 2 PCIE slots free.

Also don't know how you calculated the 4xV100s with trimmings = 4x 2080tis. But the gpu you linked is 425$ each, if you order 4 you get 1% off which is nothing, so at the end of the day you pay 1700$. Compared to the V100s with trimmings which should be 800-1000 depending on luck, that's almost half price. And you have HBM2 not GDDR6.

800 for the GPUs, plus AOM-SXMV, risers, heatsinks, fans, and whatever contraption you build to mechanically hold it all together gets pretty close to 4x 2080Ti (the $150 V100s aren't repeatable right now), and I'd definitely pay a bit more to get 88GB instead of 64GB. for 60 bucks more you can get blower style cards which lets you pack 4 cards in anything with four pcie slots and the 88GB is meaningful especially if you care about context length which really starts eating memory at large model sizes

you don't want epyc 7601 as a machine learning host because the pcie topology is utter chaos. from each socket you get 64 lanes, but in the form of 16 lanes per quadrant so _any_ inter-device communication has to traverse the IF links and GPUs between sockets have to traverse the IF links twice. the commercial epyc based ML systems put everything behind pcie switches to avoid this problem but that's not a luxury you get on a consumer rig

FIIZiK_ · Mar 25, 2024

bayleyw said:
800 for the GPUs, plus AOM-SXMV, risers, heatsinks, fans, and whatever contraption you build to mechanically hold it all together gets pretty close to 4x 2080Ti (the $150 V100s aren't repeatable right now), and I'd definitely pay a bit more to get 88GB instead of 64GB. for 60 bucks more you can get blower style cards which lets you pack 4 cards in anything with four pcie slots and the 88GB is meaningful especially if you care about context length which really starts eating memory at large model sizes

you don't want epyc 7601 as a machine learning host because the pcie topology is utter chaos. from each socket you get 64 lanes, but in the form of 16 lanes per quadrant so _any_ inter-device communication has to traverse the IF links and GPUs between sockets have to traverse the IF links twice. the commercial epyc based ML systems put everything behind pcie switches to avoid this problem but that's not a luxury you get on a consumer rig

You misunderstood. It's not 800 for the GPUs, it's 800 for the whole thing, besides the "contraption to hold everything together". So yes, that's a huge difference, in fact it's equal to 2x 2080ti 22GB you tagged.

You can get right now 2x pcie cables for 65 USD, AOM-SMXV for 130 USD, heatsinks for 70 (4 of them total 70), and V100s for around 600-700 USD depending on luck. The whole thing is 865 USD at the lowest.

bayleyw · Mar 25, 2024

FIIZiK_ said:
You misunderstood. It's not 800 for the GPUs, it's 800 for the whole thing, besides the "contraption to hold everything together". So yes, that's a huge difference, in fact it's equal to 2x 2080ti 22GB you tagged.

You can get right now 2x pcie cables for 65 USD, AOM-SMXV for 130 USD, heatsinks for 70 (4 of them total 70), and V100s for around 600-700 USD depending on luck. The whole thing is 865 USD at the lowest.

if you go by 'median trustworthy looking' ebay/taobao pricing it is 750 for the GPUs, 300 for the board not counting shipping, and 200ish for the heatsinks shipped from china (or 300 from the US), 1250 total.

also, I am not sure if folks here have actually used an sxm2 based system (i've dealt with them extensively in the past), but the 3U sxm2 nodes are *extremely loud* and designed for high static pressure front to back airflow. the dgx-1 class systems are basically vacuum cleaners and pull a significant vacuum at the front grille under high load

now, water is an option but at 179 a block it's not cheap...

SXM2 over PCIe

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

New Member

Active Member

New Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Well-Known Member

New Member

Active Member

New Member

Active Member