Instability in 6x3090 workstation

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

pramaLLC

New Member
Oct 28, 2024
6
1
3
I am running a 6x3090 workstation using 6 server power supplies and breakout boards. The power supplies are rated at 800 watts. Each power supply is set up with a breakout board and there is one power supply for each gpu. Each gpu has 4 6 pin pcie cables feeding it. The computer is run off 4 1800w enterprise double-conversion ups, with each of these on its own 120v 20 amp circuit. I mainly use the computer for training image segmentation models where it crashes every 24 hours of training or so. When running blender its crashing every few minutes of rendering. What might be causing this? Let me know if you need more information.

Parts list:
6x nvidia 3090 Founders Edition
AMD Ryzen Threadripper pro 3955wx
Asus pro ws wrx80 sage se motherboard
256GB(8x32GB) DDR4 2133 ECC RAM
Be Quiet Straight power 12 1200W
6x HP 754381-001 800W server power supply
X21 Breakout Board from parallel miner
 

pramaLLC

New Member
Oct 28, 2024
6
1
3
Why do you pinpoint crash to PSUs in the first place ?
I believe it is the PSUs because its not a software specific issue. I get instability in pytorch and in blender. I would assume this could be due to some issue in the power delivery but maybe I'm missing something. I was thinking a large amount of current was spiking very quickly (as in from idle at 50w in each gpu to 350w so 1800w in a few milliseconds) and draining the capacitors because I ran the system on only consumer PSUs previously and had even more instability and consumer PSUs have smaller capacitors. Let me konw what you think.
 

pimposh

hardware pimp
Nov 19, 2022
231
134
43
that 1200W PSU ain't cutting it for 6x3090s, you'd need at least 1600W.
OP stated 6 separate PSUs/800W each for each GPU. Or i misunderstood it ?
PCI Express slots by design can deliver up to 75W. 6x GPU, going with not realistic max draw of 75Wx6 = 450W. Since MB PSU is 1200W, there is still plenty of power left. Hence my question why PSUs are to be an issue here.
 
Last edited:

Wasmachineman_NL

Wittgenstein the Supercomputer FTW!
Aug 7, 2019
1,970
683
113
OP stated 6 separate PSUs/800W each for each GPU. Or i misunderstood it ?
PCI Express slots by design can deliver up to 75W. 6x GPU, going with not realistic max draw of 75Wx6 = 450W. Since MB PSU is 1200W, there is still plenty of power left. Hence my question why PSUs are to be an issue here.
3090s don't pull 75W each, you're looking at ~300W each during inference.
 

T_Minus

Build. Break. Fix. Repeat
Feb 15, 2015
7,762
2,141
113
might just be that one of your GPUs is bad. Try removing all but one and see if the instability persists.
Could be, but I sure as hell wouldn't want to troubleshoot 7 PSU and 6 PDUs in a hacked DIY build.

As 1 GPU being bad it could also be 1 PSU that's not holding voltage as well... easier to check a PSU voltage under use... a lot more time consuming to test PSU, PDU for SIX though. Could also be a bad cable, loose cable or similar.


Too many failure points in this DIY with the usage of so many PSU and PDU from datacenter into a DIY case.

note- i don't mean hacked as negative thing here, just noting that it's a very abnormal, robust DIY build :)
 
  • Like
Reactions: autoturk and nexox

senso

New Member
Jul 17, 2022
23
14
3
How are you tying all those grounds? That must be noise city in there...

I would star "ground" all the PSUs into the chassis, grab 2 0V wires from each PSU, and connect them all with a crimped lug to the case.
The there is the matter of those PDUs as well, I would remove them out of the loop for starters.

And make sure you are using both phases (hots) so you dont sag a 110V leg to brownout territory.
 
  • Like
Reactions: T_Minus

CyklonDX

Well-Known Member
Nov 8, 2022
1,174
401
83
Test each gpu separately
Then test all gpu's with just single memory stick

record all the data to a file using hwinfo *(since you are crashing, make sure to check that)
1730298717710.png


Whats the mobo you are using, does it have enough wattage per each slot? or are you using some external board/s with their own power.
A gpu crash typically won't cause whole system to crash - make sure you run ups's as you can create a lot of interference on single power line causing a lot of instability (with so many psu's and big loads as such - if you aren't willing to spend $ at very least try ups for your mobo cpu ).
 

pramaLLC

New Member
Oct 28, 2024
6
1
3
Could be, but I sure as hell wouldn't want to troubleshoot 7 PSU and 6 PDUs in a hacked DIY build.

As 1 GPU being bad it could also be 1 PSU that's not holding voltage as well... easier to check a PSU voltage under use... a lot more time consuming to test PSU, PDU for SIX though. Could also be a bad cable, loose cable or similar.


Too many failure points in this DIY with the usage of so many PSU and PDU from datacenter into a DIY case.

note- i don't mean hacked as negative thing here, just noting that it's a very abnormal, robust DIY build :)
I will probably just pick up a load tester and check each server psu and pdu combo. I appreciate the advice.
 
  • Like
Reactions: T_Minus

pramaLLC

New Member
Oct 28, 2024
6
1
3
Test each gpu separately
Then test all gpu's with just single memory stick

record all the data to a file using hwinfo *(since you are crashing, make sure to check that)
View attachment 39796


Whats the mobo you are using, does it have enough wattage per each slot? or are you using some external board/s with their own power.
A gpu crash typically won't cause whole system to crash - make sure you run ups's as you can create a lot of interference on single power line causing a lot of instability (with so many psu's and big loads as such - if you aren't willing to spend $ at very least try ups for your mobo cpu ).
I am running a WRX80 SAGE SE WIFI mobo. It provides 75w to each slot and the rest comes from the server psus on breakout boards. Good information of the gpu crashing not causing the whole system to crash I just assumed it would work this way if the gpus got voltages out of a preset range (as in the voltage would drop when it drains the capacitors). I will use hwinfo when I get back to running the system. I have everything pulled apart right now but have new power supplies and a load tester on their way to try to troubleshoot this. I am running 4 dual conversion enterprise UPSs. This means they don't operate on a switching mechanism like most line-interactive ones do. They fully convert 120vac to 12v dc and then back to 120vac so there is no interference whatsoever. The only concern in my mind is that they are not synchonized but this costs too much for me to deal with so I figure its worth ignoring. I'll check out hwinfo and let you know. Thanks.
 

pramaLLC

New Member
Oct 28, 2024
6
1
3
What does "crash" mean here? What happens, specifically?
I should have specified this. The computer functions properly right up until the screen goes black, the monitor indicates it is looking for a signal, then after about 10 seconds the computer displays the linux lockscreen. The gpus lights also go off during these 10 seconds. Let me know if you have any more quesitons.
 

unwind-protect

Active Member
Mar 7, 2016
532
216
43
Boston
The lockscreen? That's unusual for a crash.

Is the monitor connected to one of the cruncher GPUs?

What happens when you re-authorize on the lockscreen?

Anything in `dmesg`?
 
  • Like
Reactions: T_Minus

pramaLLC

New Member
Oct 28, 2024
6
1
3
The lockscreen? That's unusual for a crash.

Is the monitor connected to one of the cruncher GPUs?

What happens when you re-authorize on the lockscreen?

Anything in `dmesg`?
Yes, the lockscreen is odd. The monitor is connected to a gpu that is being used by the programs we are running. Once you enter credentials on the lockscreen the screen is blank and you have to re-enter whatever program you were in eg. blender or vscode. I have not checked dmesg before I'll take a look.