supermicro H12ssl-i , 4 3090s failing under 100% power draw

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Beta2099

New Member
May 9, 2024
1
0
1
In my training cluster, which is an Ubuntu 24.04, with an AMD EPYC 7282 and a supermicro H12SSL-i motherboard, 4TB NVME, 256GB RAM, I have installed 4 GPUs. All of them are variants of the RTX 3090 by NVIDIA (Aorus, Zotac and Gigabyte).



- A Gigabyte XTREME AORUS RTX 3090 - max power draw of 420W. Connected to PCIE Slot 6. CUDA Device 0.

- A gigabyte black RTX 3090 - max power draw of 370W. Connected to PCIE Slot 1. CUDA Device 2.

- A gigabyte white RTX 3090 - max power draw of 370W. Connected to PCIE Slot 7. CUDA Device 1.

- A Zotac RTX 3090 - max power draw of 350W. Connected to PCIE Slot 3. CUDA Device 3.



I have 2, 1000W power supplies set up for this system. My power supply 1, supplies GPUs 0 and 1 only. One 8 pin connector connects this to the motherboard. My power supply 2, supplies the motherboard, CPU and GPUs 2 and 3. One 8 pin connector connects this to the motherboard, along with the general ATX power connector. My CPU has a max TDP of 280W.

I am trying to run some stress tests (GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test). However, curiously, some of these are failing. I am trying to understand why.



Experiments Run:

1. I ran a stress test on all GPUs individually. All of these passed, with no errors.

2. I ran a stress test on sets of 2 GPUs at a time (all combinations). These also passed, with no errors.

This means, tests on

CUDA GPUS 0 and 1 work perfectly

CUDA GPUS 1 and 2 work perfectly

CUDA GPUS 2 and 3 work perfectly

CUDA GPUS 0 and 3 work perfectly

etc

Now, I started to run stress tests on 3 gpus together. Curiously, there is only one combination that failed.

- When I ran the stress test on CUDA GPUS 0, 1 and 2, the test worked perfectly.

- When I ran the stress test on CUDA GPUS 1, 2 and 3, the test worked perfectly.

- When I ran the stress test on CUDA GPUS 0, 2 and 3, the test worked perfectly.



However:



- When I ran the stress test on CUDA GPUS 0, 1 and 3, the test failed, and the system crashed and rebooted.

When I limit the power draw to 100W for all GPUs, then it works fine when I run all 4 together. However, when I raise it to 150W, it fails. This doesn't make sense, because when I run the test on the unlimited 0, 1, 2 combination, the test works perfectly.

Mainly, I'm curious, because the combination of 0, 1, 2 works, which splits the power supply, but 0, 1, 3 doesn't, even though the power setup is nearly identical. Do I need to change where my GPUs are slotted in? Do I need to get more power supplies? Is there an issue with the power draw from my commercial apartment?
 

skipper ohms

Member
Jan 24, 2024
35
23
8
It's likely your power supplies don't provide enough amperage on the 12V rail(s). The 1000 watts is distributed to various voltages, so you don't get the full 1000 watts on the 12V rail. What's the label on the PSU say?

Another possibility is you have plugged both power supplies into the same 15 amp wall socket.