SXM2 over PCIe

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gsrcrxsi

Active Member
Dec 12, 2018
303
103
43
My board was about $200 from China and that seemed normal across many sellers, I’d love to know where I can get it for $130 since I’m trying to buy like 4-5 more.

My heatsinks were about 15/each from china, so that tracks.

Shipping from china for the board, 4 heatsinks, and 4 waterblocks, was less than $100 for 3-day air mail. It would be a lot cheaper (like $20) for the boat shipping but I just didn’t want to wait a month to get it.

I’m running mine now on air and it’s not really loud at all. I’m using some beefy 80x38mm fans I had lying around for testing, I just zip tied them to the heatsinks directly, and the board seems to run them at fairly low speed since it doesn’t need more. The GPUs are in the 40-50C range for temps. I will water cool them in the future since I bought the blocks too, I want the heat load to be more mobile, but I don’t see why you couldn’t use some noctuas or something also. Or 3D print some shrouds to duct air from some 120mm Noctua iPPCs or something. It’s not that much of an issue IMO. Since my uses are fairly low power and I’m power limiting for efficiency and power draw purposes anyway, I’m just gonna strap some more normal 80x25mm fans on air cooled setups and let temps be 60-70C. This might change if your workload will use all 300W of the TDP, but mine doesn’t.
 
Last edited:

RGL

New Member
Oct 24, 2020
8
10
3
@CyklonDX since you have experience with this board. Do you know what the LEDs near the NVLink heatsinks mean? I couldn’t find any documentation about that

I’m noticing that my setup is having a strange issue. Two of the GPUs will randomly drop after some random amount of time. And it seems the computation on those two GPUs really drags after a while.

on the board, initially at boot all 4 LEDs light up, then one of them shuts off as it’s booting the OS. Is this normal or should all 4 LEDs be on all the time?

trying to narrow down if the problem might be the board I got, or maybe the motherboard or the riser cables.
The LEDs next to the PLX are the PLX upstream and downstream LEDs. One PLX has one upstream port and three downstream ports. During normal operation, three LEDs on each side are always on. If the upstream LED goes out during operation, it means there is a problem with the adapter cable. If the downstream LED goes out, it means there is a problem with the AOM-SXMV or graphics cards. If they all go out at the same time, it means there is overheating or a power supply problem with the PLX.
 

bayleyw

Active Member
Jan 8, 2014
302
99
28
My board was about $200 from China and that seemed normal across many sellers, I’d love to know where I can get it for $130 since I’m trying to buy like 4-5 more.

My heatsinks were about 15/each from china, so that tracks.

Shipping from china for the board, 4 heatsinks, and 4 waterblocks, was less than $100 for 3-day air mail. It would be a lot cheaper (like $20) for the boat shipping but I just didn’t want to wait.

I’m running mine now on air and it’s not really loud at all. I’m using some beefy 80x38mm fans I had lying around for testing, I just zip tied them to the heatsinks directly, and the board seems to run them at fairly low speed since it doesn’t need more. The GPUs are in the 40-50C range for temps. I will water cool them in the future since I bought the blocks too, I want the heat load to be more mobile, but I don’t see why you couldn’t use some noctuas or something also. Or 3D print some shrouds to duct air from some 120mm Noctua iPPCs or something. It’s not that much of an issue IMO. Since my uses are fairly low power and I’m power limiting for efficiency and power draw purposes anyway, I’m just gonna strap some more normal 80x25mm fans on air cooled setups and let temps be 60-70C. This might change if your workload will use all 300W of the TDP, but mine doesn’t.
ok, thanks. so about 1200 all in for the GPUs and trimmings except the case, or 1900 if you add the waterblocks (are there any cheaper than the Bykski's?)
 

gsrcrxsi

Active Member
Dec 12, 2018
303
103
43
The LEDs next to the PLX are the PLX upstream and downstream LEDs. One PLX has one upstream port and three downstream ports. During normal operation, three LEDs on each side are always on. If the upstream LED goes out during operation, it means there is a problem with the adapter cable. If the downstream LED goes out, it means there is a problem with the AOM-SXMV or graphics cards. If they all go out at the same time, it means there is overheating or a power supply problem with the PLX.
Thanks! That’s helpful. Can you relate the specific LED numbers to which are upstream and downstream?

on the right side of the board I have
LED5
LED4
LED3
LED2

and on the left side I have
LED9
LED8
LED7
LED6

at first boot up, all 8 LEDs illuminate, but once it posts, it drops to just [5,4,*,2] and [9,8,*,6] on, with 3 and 7 off. Is this the “normal” situation?

when I had the dropout, I think it was LED8 turning off, and LED6 would be flashing. I never had a situation where all the LEDs turned off. And I could tell there was a problem before the PLX dropped out because computation would be really slow for the GPUs on the left side
 
Last edited:

gsrcrxsi

Active Member
Dec 12, 2018
303
103
43
ok, thanks. so about 1200 all in for the GPUs and trimmings except the case, or 1900 if you add the waterblocks (are there any cheaper than the Bykski's?)
the Bykskis are the only consumer standardized waterblocks I found. They are considerably cheaper from China, I think it was around $60 each. The US market seems sold out anyway.
 

RGL

New Member
Oct 24, 2020
8
10
3
Thanks! That’s helpful. Can you relate the specific LED numbers to which are upstream and downstream?

on the right side of the board I have
LED5
LED4
LED3
LED2

and on the left side I have
LED9
LED8
LED7
LED6

at first boot up, all 8 LEDs illuminate, but once it posts, it drops to just [5,4,*,2] and [9,8,*,6] on, with 3 and 7 off. Is this the “normal” situation?

when I had the dropout, I think it was LED8 turning off, and LED6 would be flashing. I never had a situation where all the LEDs turned off. And I could tell there was a problem before the PLX dropped out because computation would be really slow for the GPUs on the left side
LED[5,4,*,2] and [9,8,*,6] on with 3 and 7 off is normal. LED 2,6 are upstream and the rest are downstream. LED8 turning off means that one graphic card is lost, LED6 flashing means that the upstream is downgraded from PCIe gen3 to gen2. The loss of graphic card may be caused by poor contact in the sxm slot, and PCIe degradation may be caused by poor contact in PCIe or excessive PCIe signal loss.
 
  • Like
Reactions: Underscore

gsrcrxsi

Active Member
Dec 12, 2018
303
103
43
LED[5,4,*,2] and [9,8,*,6] on with 3 and 7 off is normal. LED 2,6 are upstream and the rest are downstream. LED8 turning off means that one graphic card is lost, LED6 flashing means that the upstream is downgraded from PCIe gen3 to gen2. The loss of graphic card may be caused by poor contact in the sxm slot, and PCIe degradation may be caused by poor contact in PCIe or excessive PCIe signal loss.
Thanks again for the detailed info.

I shuffled the GPUs around and remounted them several times. Same results every time and always with the same GPU positions (3 and 4 on the board) regardless of which actual GPU SXM2 module was installed there. and when it dropped, I would lose all GPUs and the whole system essentially locked up.

the problem has totally gone away after connecting the left side PCIe cable to one of the x8 PCIe slots on the motherboard though. Has run for about 24hrs non-stop without any problems with slowdowns, PCIe errors, or anything else.
 

RGL

New Member
Oct 24, 2020
8
10
3
Thanks again for the detailed info.

I shuffled the GPUs around and remounted them several times. Same results every time and always with the same GPU positions (3 and 4 on the board) regardless of which actual GPU SXM2 module was installed there. and when it dropped, I would lose all GPUs and the whole system essentially locked up.

the problem has totally gone away after connecting the left side PCIe cable to one of the x8 PCIe slots on the motherboard though. Has run for about 24hrs non-stop without any problems with slowdowns, PCIe errors, or anything else.
This may due to different loss values of different PCIe slots on the motherboard, or oxidation of the contacts of the PCIe slot.
 

gsrcrxsi

Active Member
Dec 12, 2018
303
103
43
This may due to different loss values of different PCIe slots on the motherboard, or oxidation of the contacts of the PCIe slot.
I’m going to do some more in depth testing of the PCIe risers cables.
 

bayleyw

Active Member
Jan 8, 2014
302
99
28
do you get any MCE's in dmesg before the devices drop off the bus? if so definitely telltale of bad PCIe signaling.
 

gsrcrxsi

Active Member
Dec 12, 2018
303
103
43
So i took the nuclear option and moved the whole GPU board to a different motherboard and setup. moved from my EPYCD8 motherboard (which has had issues with PCIe in the past, but worked fine with 4x Titan V in slots 1/3/5/7) to my X99 based test bench. though the test bench has it's own quirks being the ridiculous ASUS X99-E-10G WS with two PLX 8747 and some QSWs running the PCIe slots.

I plugged the AOM-SXMV into the X99 board in slots 7 and 5, with a Titan V in slot 1 for video out. these slots should all register x16 in this configurations according to the manual. connected with the same 2x Silverstone RC-04B 40cm risers I was using on the EPYCD8 system previously. everything is working fine now and even crunching a bit faster. no signs of issues but I will let it run another 24hrs.

I also tried some GLOTRENDS 60cm risers. but they are garbage. immediate PCIe errors when used on this board.

maybe @RGL can give some insight. I know he previously mentioned that it was hard to find good risers. but is the AOM-SXMV especially sensitive to signal issues? like even more than a GPU is? I've used tons of cheap sketchy risers with relative success on PCIe 3.0 in lengths from 20cm-40cm. but it seems like the AOM-SXMV needs even tighter signalling or something. beyond what PCIe gen3 risers tend to get away with for GPUs. I'm trying not to have to go with expensive gen4 risers. maybe I'll grab the ones from RGL if he can source some that plug straight into the motherboard rather from the back of the I/O panel, and maybe lengths longer than 40cm.
 
  • Like
Reactions: Keith Myers

RGL

New Member
Oct 24, 2020
8
10
3
So i took the nuclear option and moved the whole GPU board to a different motherboard and setup. moved from my EPYCD8 motherboard (which has had issues with PCIe in the past, but worked fine with 4x Titan V in slots 1/3/5/7) to my X99 based test bench. though the test bench has it's own quirks being the ridiculous ASUS X99-E-10G WS with two PLX 8747 and some QSWs running the PCIe slots.

I plugged the AOM-SXMV into the X99 board in slots 7 and 5, with a Titan V in slot 1 for video out. these slots should all register x16 in this configurations according to the manual. connected with the same 2x Silverstone RC-04B 40cm risers I was using on the EPYCD8 system previously. everything is working fine now and even crunching a bit faster. no signs of issues but I will let it run another 24hrs.

I also tried some GLOTRENDS 60cm risers. but they are garbage. immediate PCIe errors when used on this board.

maybe @RGL can give some insight. I know he previously mentioned that it was hard to find good risers. but is the AOM-SXMV especially sensitive to signal issues? like even more than a GPU is? I've used tons of cheap sketchy risers with relative success on PCIe 3.0 in lengths from 20cm-40cm. but it seems like the AOM-SXMV needs even tighter signalling or something. beyond what PCIe gen3 risers tend to get away with for GPUs. I'm trying not to have to go with expensive gen4 risers. maybe I'll grab the ones from RGL if he can source some that plug straight into the motherboard rather from the back of the I/O panel, and maybe lengths longer than 40cm.
Theoretically AOM-SXMV is not sensitive to signal issues, but PCIe signal loss is superimposed in the system. Most of the loss comes from the connector and the length of cable. There are two PCIe connectors and one JPCIE connector in this system, so there is not much accectable loss left for the wire. I am designing a solution with retimer, which can solve this problem very well and provide a maximum cable length of about 1.5M, but it may not be mass-produced because the price of V100 SXM has increased and demand has decreased.
 

gsrcrxsi

Active Member
Dec 12, 2018
303
103
43
Naples CPU (or at least THIS particular cpu) seems to be the problem.

swapped in a Milan CPU, plugged everything up the same, even the same motherboard.

everything working normally. no errors, no slowdowns. running much faster than before even.
 

bayleyw

Active Member
Jan 8, 2014
302
99
28
were the GPUs that were falling off the bus the ones connected to the slot further from the CPU by chance?
 

gsrcrxsi

Active Member
Dec 12, 2018
303
103
43
Yes and no.

yes because that’s usually the config I was running, but no because I tried both ways and I had the problem either way.
 

FIIZiK_

New Member
Nov 15, 2022
15
3
3
if you go by 'median trustworthy looking' ebay/taobao pricing it is 750 for the GPUs, 300 for the board not counting shipping, and 200ish for the heatsinks shipped from china (or 300 from the US), 1250 total.

also, I am not sure if folks here have actually used an sxm2 based system (i've dealt with them extensively in the past), but the 3U sxm2 nodes are *extremely loud* and designed for high static pressure front to back airflow. the dgx-1 class systems are basically vacuum cleaners and pull a significant vacuum at the front grille under high load

now, water is an option but at 179 a block it's not cheap...
don't know exactly where you find your prices, but on Taobao the heatsinks I found were 15-20 USD each. The motherboard 130 USD and the risers 60 USD. That's 250 USD, shipping is at worse case scenario 50 USD, so 300 total. GPUs with shipping included can be found locally (For me) and are around 750 USD (converted from my currency). That's 1050$ total for everything.
In US you can get V100s shipped for 600 (all 4 total) so 900$ whole system.

As for the case, you are talking as if it's some unholy beast. Get a regular mini-atx case for under 50USD and drill some extra holes to fit everything it. Bonus, you even have fan and PSU mounting. Better yet, you can create a 3D printed bracket and move the PCIE ports to the back of the case where the PCIE slots are, and do something similar to any PC you want, and now you have. GPU box to connect between the 2 with 2 short pcie cables (male to male).
 

gsrcrxsi

Active Member
Dec 12, 2018
303
103
43
don't know exactly where you find your prices, but on Taobao the heatsinks I found were 15-20 USD each. The motherboard 130 USD and the risers 60 USD. That's 250 USD, shipping is at worse case scenario 50 USD, so 300 total. GPUs with shipping included can be found locally (For me) and are around 750 USD (converted from my currency). That's 1050$ total for everything.
In US you can get V100s shipped for 600 (all 4 total) so 900$ whole system.

As for the case, you are talking as if it's some unholy beast. Get a regular mini-atx case for under 50USD and drill some extra holes to fit everything it. Bonus, you even have fan and PSU mounting. Better yet, you can create a 3D printed bracket and move the PCIE ports to the back of the case where the PCIE slots are, and do something similar to any PC you want, and now you have. GPU box to connect between the 2 with 2 short pcie cables (male to male).
Post a link to a 130 dollar AOM-SXMV on taobao. I’ve looked and don’t really see anything. I see some board listings on Xianyu, but most sellers tell me they don’t have any in stock and only have the AOM-SXM2 board, which only supports the P100 and not the V100. Also you can't go by the listing price most of the time unless the seller confirmed that was the price. most times they have a fake low price listed, but when you ask about it, they tell you it's actually a higher price.
 
Last edited:

bayleyw

Active Member
Jan 8, 2014
302
99
28
don't know exactly where you find your prices, but on Taobao the heatsinks I found were 15-20 USD each. The motherboard 130 USD and the risers 60 USD. That's 250 USD, shipping is at worse case scenario 50 USD, so 300 total. GPUs with shipping included can be found locally (For me) and are around 750 USD (converted from my currency). That's 1050$ total for everything.
In US you can get V100s shipped for 600 (all 4 total) so 900$ whole system.

As for the case, you are talking as if it's some unholy beast. Get a regular mini-atx case for under 50USD and drill some extra holes to fit everything it. Bonus, you even have fan and PSU mounting. Better yet, you can create a 3D printed bracket and move the PCIE ports to the back of the case where the PCIE slots are, and do something similar to any PC you want, and now you have. GPU box to connect between the 2 with 2 short pcie cables (male to male).
you're really underestimating the flakiness of taobao sellers, shipping, and cooling (and also apparently the flakiness of the pcie signals...)
 

gsrcrxsi

Active Member
Dec 12, 2018
303
103
43
PCIe has been fine ever since I swapped out the Naples CPU, I’m thinking somethingn wrong with the CPU itself caused my problems there.

but yeah, I think FIIZiK is taking the listed price at face value. I contacted several sellers listing AOM-SXMV for 1000rmb (138 USD) and they all first asked me what board I’m looking for, then they asked for prices closer to 3000rmb after.

these are generic listings they are making for a wide variety of boards and you have to confirm price with the seller. The Chinese marketplaces are not as strictly structured or controlled as things here like eBay or Amazon. They kind of do more shady things like list really low price for attention.