In my training cluster, which is an Ubuntu 24.04, with an AMD EPYC 7282 and a supermicro H12SSL-i motherboard, 4TB NVME, 256GB RAM, I have installed 4 GPUs. All of them are variants of the RTX 3090 by NVIDIA (Aorus, Zotac and Gigabyte).
- A Gigabyte XTREME AORUS RTX 3090 - max power draw of 420W. Connected to PCIE Slot 6. CUDA Device 0.
- A gigabyte black RTX 3090 - max power draw of 370W. Connected to PCIE Slot 1. CUDA Device 2.
- A gigabyte white RTX 3090 - max power draw of 370W. Connected to PCIE Slot 7. CUDA Device 1.
- A Zotac RTX 3090 - max power draw of 350W. Connected to PCIE Slot 3. CUDA Device 3.
I have 2, 1000W power supplies set up for this system. My power supply 1, supplies GPUs 0 and 1 only. One 8 pin connector connects this to the motherboard. My power supply 2, supplies the motherboard, CPU and GPUs 2 and 3. One 8 pin connector connects this to the motherboard, along with the general ATX power connector. My CPU has a max TDP of 280W.
I am trying to run some stress tests (GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test). However, curiously, some of these are failing. I am trying to understand why.
Experiments Run:
1. I ran a stress test on all GPUs individually. All of these passed, with no errors.
2. I ran a stress test on sets of 2 GPUs at a time (all combinations). These also passed, with no errors.
This means, tests on
CUDA GPUS 0 and 1 work perfectly
CUDA GPUS 1 and 2 work perfectly
CUDA GPUS 2 and 3 work perfectly
CUDA GPUS 0 and 3 work perfectly
etc
Now, I started to run stress tests on 3 gpus together. Curiously, there is only one combination that failed.
- When I ran the stress test on CUDA GPUS 0, 1 and 2, the test worked perfectly.
- When I ran the stress test on CUDA GPUS 1, 2 and 3, the test worked perfectly.
- When I ran the stress test on CUDA GPUS 0, 2 and 3, the test worked perfectly.
However:
- When I ran the stress test on CUDA GPUS 0, 1 and 3, the test failed, and the system crashed and rebooted.
When I limit the power draw to 100W for all GPUs, then it works fine when I run all 4 together. However, when I raise it to 150W, it fails. This doesn't make sense, because when I run the test on the unlimited 0, 1, 2 combination, the test works perfectly.
Mainly, I'm curious, because the combination of 0, 1, 2 works, which splits the power supply, but 0, 1, 3 doesn't, even though the power setup is nearly identical. Do I need to change where my GPUs are slotted in? Do I need to get more power supplies? Is there an issue with the power draw from my commercial apartment?
- A Gigabyte XTREME AORUS RTX 3090 - max power draw of 420W. Connected to PCIE Slot 6. CUDA Device 0.
- A gigabyte black RTX 3090 - max power draw of 370W. Connected to PCIE Slot 1. CUDA Device 2.
- A gigabyte white RTX 3090 - max power draw of 370W. Connected to PCIE Slot 7. CUDA Device 1.
- A Zotac RTX 3090 - max power draw of 350W. Connected to PCIE Slot 3. CUDA Device 3.
I have 2, 1000W power supplies set up for this system. My power supply 1, supplies GPUs 0 and 1 only. One 8 pin connector connects this to the motherboard. My power supply 2, supplies the motherboard, CPU and GPUs 2 and 3. One 8 pin connector connects this to the motherboard, along with the general ATX power connector. My CPU has a max TDP of 280W.
I am trying to run some stress tests (GitHub - wilicc/gpu-burn: Multi-GPU CUDA stress test). However, curiously, some of these are failing. I am trying to understand why.
Experiments Run:
1. I ran a stress test on all GPUs individually. All of these passed, with no errors.
2. I ran a stress test on sets of 2 GPUs at a time (all combinations). These also passed, with no errors.
This means, tests on
CUDA GPUS 0 and 1 work perfectly
CUDA GPUS 1 and 2 work perfectly
CUDA GPUS 2 and 3 work perfectly
CUDA GPUS 0 and 3 work perfectly
etc
Now, I started to run stress tests on 3 gpus together. Curiously, there is only one combination that failed.
- When I ran the stress test on CUDA GPUS 0, 1 and 2, the test worked perfectly.
- When I ran the stress test on CUDA GPUS 1, 2 and 3, the test worked perfectly.
- When I ran the stress test on CUDA GPUS 0, 2 and 3, the test worked perfectly.
However:
- When I ran the stress test on CUDA GPUS 0, 1 and 3, the test failed, and the system crashed and rebooted.
When I limit the power draw to 100W for all GPUs, then it works fine when I run all 4 together. However, when I raise it to 150W, it fails. This doesn't make sense, because when I run the test on the unlimited 0, 1, 2 combination, the test works perfectly.
Mainly, I'm curious, because the combination of 0, 1, 2 works, which splits the power supply, but 0, 1, 3 doesn't, even though the power setup is nearly identical. Do I need to change where my GPUs are slotted in? Do I need to get more power supplies? Is there an issue with the power draw from my commercial apartment?