EPYCD8-2T Serious issues

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

balnazzar

Active Member
Mar 6, 2019
221
30
28
Hi mates. I think that my EPYCD8-2T failed on me (at the worst moment, of course).

My plight started 2 days ago, when I added the fourth GPU. The board did not boot properly (no screen output), and after the usual 30/40 secs, the power draw skyrocketed at almost 300W. No ssh server running, so the OS was not booted. BMC unreachable. All the fans at flank speed.
I couldn't see the Dr.Debug display, since they intelligently placed it under the fourth GPU, so I detached that GPU. No success, but at least now I could see the display. Pretty useless, since it displayed the vastest amount of codes, and ended up with a countdown from 90 to 01 (see the video, it starts at 0:19).
I detached all the GPUs save one, the all of them. No success. Lower power consumption but still high (>200W) after 30/45s from boot.

I noticed another strange thing, which makes me suspect that the board is indeed broken. While powered off, it always absorbed 7-8W, probably because of the BMC (a normal pc draws 1-2W on the average). Now it draws 20-22W, which is almost impossible even considering the BMC. Furthermore, the heatsink over the 10Gb NICs is so hot that you can hold your finger on it for just an instant. It was NOT so before.

Of course I cleared the cmos and even replaced the cmos battery. No success.

Any clue?? :-/

 

wallop

New Member
Nov 6, 2018
12
1
3
I understand you had a 12 V pci-e cable connected as a 12V Power Connector graphics ?
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
Thanks for your reply. Yes, if you mean the 6-pin PCI-express connector, I connected it even before purchasing the fourth graphics card, specifically not to forget it..
 

lpallard

Member
Aug 17, 2013
276
11
18
1. Have you figured out this yet? Its May, and your post dates back from Feb.... I may be late...
2. Have you pulled everything (except of course what's required to boot a system) and proceeded by elimination?
3. Have you contacted Asrock? When I had a freak occurrence with my Freenas server (supermicro hardware), I contacted them and they figured out pretty quick what was going on. $48USD shipping back & forth, 10 days, and the mainboard was back fully Q&A checked. Its been rock solid for 7 years now.

Sounds like you definitely have an electrical issue somewhere ..
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
1. Have you figured out this yet? Its May, and your post dates back from Feb.... I may be late...
2. Have you pulled everything (except of course what's required to boot a system) and proceeded by elimination?
3. Have you contacted Asrock? When I had a freak occurrence with my Freenas server (supermicro hardware), I contacted them and they figured out pretty quick what was going on. $48USD shipping back & forth, 10 days, and the mainboard was back fully Q&A checked. Its been rock solid for 7 years now.

Sounds like you definitely have an electrical issue somewhere ..
Point 2: Yes.

The board stopped working altogether during the subsequent weeks. Returned to the vendor. I took an Intel board as replacement since I was in a hurry for having a working system. Also, I reluctantly returned the epyc 7282 to amazon.

It's a shame. The epyc had stellar performance with a low power draw. My experience with Asrock has been dismal, but maybe I have been just unlucky.
 

lpallard

Member
Aug 17, 2013
276
11
18
Thats too bad, at least if you got it sorted out....

I myself try to avoid manufacturers who's bread & butter are gaming components. Gamers and us are two different crowds with different needs... I only buy Supermicro, quality wise, I've had board running for 12 years without a huiccup and when I had problems, they charged nothing for a series of tests on a H8DCL-iF (I only had to pay shipping to them from Canada).
 

balnazzar

Active Member
Mar 6, 2019
221
30
28
My problem with supermicro was that I needed a board capable of supporting four GPUs, and the only epyc rome board that could do that was the Asrock one. Indeed, I'd want to give epyc another try, and Asrock has even released the beefed up version of that board (ROMED8), but I just don't want to buy another Asrock product.

Since there are so few boards with four mechanical 16X slots no matter the epyc monstrous number of pcie lanes, it seems that AMD wants you to buy a threadripper for workstation use. For me, it is a non-option. I need RDIMM support and low power consumption, and the TR provides neither.
 

hakabe

Member
Jul 6, 2016
124
4
18
42
I'm considering to buy this motherboard for multi GPU purposes solely. Did you tried to reset the BIOS? I'm using Supermicro product and it gave me a headache until it was able to boot up one time and told threw me back to BIOS telling:
!!!!PCI Resource ERROR!!!!

PCI OUT OF RESOURCES CONDITION:

Error: Insufficient PCI Resources Detected!!!
and suggested
"PCI suggested me to " BIOS > Advanced > PCIe/PCI/PnP Configuration > Above 4G Decoding and set it to Enabled "
After that, no issues.

Just wondering that if this board has similar setting for supporting multi GPU's - and without it, it's unable to POST.
 
Last edited:

gsrcrxsi

Active Member
Dec 12, 2018
291
96
28
This board definitely CAN support multi-GPU. I have 3 of these boards, but in the non “2T” variant with only 1Gb LAN (2T has 10Gb)

I think the OP didn’t have above4G decoding turned on. That’s the classic misconfiguration that leads people to have post/boot issues with multiple GPUs.

my three GPU compute systems are as follows:

1.
ASRock Rack EPYCD8 board
EPYC 7402P 24-core CPU
Six [6] RTX 2080 Ti GPUs
(using PCIe 3.0 x16 risers)

2.
ASRock Rack EPYCD8 board
EPYC 7402P 24-core CPU
Eight [8] RTX 2070 GPUs
(Using PCIe 3.0 x16 risers)
(Using C_Payne custom x8x8 riser to bifurcate slot#7 to add two GPUs with x8 lanes each)
Slot needs to be bifurcated in the BIOS to work.

3.
ASRock Rack EPYCD8 board
EPYC 7502 32-core CPU
Seven [7] RTX 2080 GPUs
Each GPU watercooled to single slot size, and plugged directly into the motherboard.

in all cases, everything works as expected with all lanes from the motherboard operating as expected. I really only ever have a problem with Risers, or needing to reseat/re-install risers if all lanes don’t get detected.

I did return one motherboard that had a flaky PCIe slot. Slot #3 would consistently drop the GPU from the driver and cause system crashes or lock ups. I spent a lot of time trouble shooting, moving GPUs and risers around and the only thing that stayed constant was the PCIe slot on the motherboard. This could have been either a motherboard or CPU problem. Replaced the motherboard and the problem went away.

All in all these are great boards for multi-GPU setups that don’t need PCIe 4.0 (most shouldn’t). These have been the most stable boards for my GPU compute systems by far.
 

hakabe

Member
Jul 6, 2016
124
4
18
42
(Using PCIe 3.0 x16 risers)
Out of curiosity, what kind and how long are your riser cables? I've also ordered bunch of 18 to 30cm x16 risers, but not sure if there is some length limitations to GPUs (for power draw)?
 

gsrcrxsi

Active Member
Dec 12, 2018
291
96
28
Out of curiosity, what kind and how long are your riser cables? I've also ordered bunch of 18 to 30cm x16 risers, but not sure if there is some length limitations to GPUs (for power draw)?
i use these ones, EZDIY-FAB from Amazon. 20cm length. But I bought a couple 30cm ones for cards that needed a bit longer reach. They do not have power input, so power will be provided from the motherboard. Be sure to plug in the 6-pin VGA power plug on the front of the board when using lots of GPUs.

EZDIY-FAB New PCI Express PCIe3.0 16x Flexible Cable Card Extension Port Adapter High Speed Riser Card (20cm 180 Degree)-Upgrade Version

 
  • Like
Reactions: balnazzar

balnazzar

Active Member
Mar 6, 2019
221
30
28
I think the OP didn’t have above4G decoding turned on. That’s the classic misconfiguration that leads people to have post/boot issues with multiple GPUs.

...

Each GPU watercooled to single slot size, and plugged directly into the motherboard.

...

Your setups are quite impressive. My compliments :) May I ask you about...:

1. Details/components used in the waterloop for the 7-2070s machine.

2. Would you tell me/us more about the above-4G decoding that prevents using multiple gpus if not turned on?

3. What do you do with these machines? (That's just out of curiosity)

Thanks!
 

gsrcrxsi

Active Member
Dec 12, 2018
291
96
28
It’s actual 7x RTX2080.

7x ASUS RTX2080 (single slot I/O, ref pcb)
7x EK waterblocks
7-GPU EK waterblock bridge
1x Watercool MO-RA3 360x360 radiator
2x D5 pumps
9x Noctua NF-F12 iPPC-2000 fans

Above 4G decoding allows mapping more than 4GB of memory space. When you have many GPUs, you need this enabled or the motherboard won’t boot. Every board is different to where the line is. Some boards need it for 3+, some 4+, some 5+, etc. probably depends on how much memory on each GPU too.

I use these system for BOINC computing mostly. Contributing to projects like GPUGRID, Einsten@home, Universe@home, World Community Grid - OpenPandemics COVID-19
 
  • Like
Reactions: balnazzar

balnazzar

Active Member
Mar 6, 2019
221
30
28
1x Watercool MO-RA3 360x360 radiator
I somewhat imagined a Mo-Ra, but I thought you went for the 420. It's remarkable that even the 360 can handle something like 1500W worth of heat.

2x D5 pumps
Is it the EK-revo dual?

Also it would be very useful to know which case you employed for each of these systems, if I may ask. Casing is a delicate matter when you have to deal with so many components and associated heat output.

7-GPU EK waterblock bridge
Do you use the bridge to cool them in parallel?

Above 4G decoding allows mapping more than 4GB of memory space. When you have many GPUs, you need this enabled or the motherboard won’t boot.
That's very good to know, and I wasn't aware of that. Thanks!
 

gsrcrxsi

Active Member
Dec 12, 2018
291
96
28
I was on the fence whether 1 360 MORA3 would be enough. But I’ve power limited each 2080 to 185W (1295W total), and overclocked them to gain back most of the lost performance. So they are pretty efficient. And I have the radiator mounted to the window blowing the hot air directly outside. So the room isn’t being heated by the heat from the radiator. Under lighter loads like Einstein Gamma ray tasks. The GPUs stay below 50C. Under heavier loads like GPUGRID, they might get up to 65-70C in the warmer months. Warm for watercooling, but more than manageable. Yes the revo-dual housing for the pumps. Mounted to the radiator itself.

the bridge is semi-parallel. 4 GPUs in parallel, then serial into the last 3 GPUs which are also parallel. Only the GPUs in the loop. The CPU is not in the loop. Just using a Supermicro 4U air cooler.

the case is a Rosewill LSV-4500. I had to cut a slot out to make room for the fittings off the side of the bridge. The case is rackmounted and there are quick disconnects between the case and the radiator to make things easier when I want to move the system around.

pic:
 
  • Like
Reactions: balnazzar

balnazzar

Active Member
Mar 6, 2019
221
30
28
That's a damn good setup!

65-70C in the warmer months. Warm for watercooling, but more than manageable.
May I ask which temp does the coolant reach during the hot months?

Mounted to the radiator itself.
Is there a mounting frame, or you just built it by yourself?

I have the radiator mounted to the window blowing the hot air directly outside
That's very interesting and wise, and indeed many people recommends to do so. May I ask you for a pic, so that I can have a look at the mounting frame? Thanks!!
 

gsrcrxsi

Active Member
Dec 12, 2018
291
96
28
Sorry I don’t monitor coolant temp. Only component temp. I really don’t care what temp the coolant is if the components are being managed. The Mount for the D5 pumps is like a universal mount for the EK dual revo

Pic of the radiator setup:

I just mounted the radiator to a piece of plywood cut to fit the window, then mounted the whole thing.
 
Last edited: