T620 gpu issue

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

hhoria

New Member
Mar 1, 2022
1
0
1
Hi everyone,
Not sure where i should post this since I couldn't find a gpu server board; here's what's happening:

For a couple of months now, it's been running with 2 Tesla K80 just fine (mostly machine learning stuff) and the fans on full blast.

A couple of days ago, I got a Quadro RTX6000 and replaced one of the teslas and here's what's happening:

1st boot, new pcie found, reconfiguring and rebooting

on 2nd launch, it's telling me that: A bus fatal error was detected on a component at slot 7.

switched the card to another slot, same error after 2nd restart, just a different slot.

I also went in a little deeper and replaced the quadro with a 2070 super just to see if it's working, same fatal error even for the 2070.

Weird thing about the Teslas is that, it uses that weird cpu connector which splits into 2 pcie power connectors, that works fine. However, the other cards have a 8pin pcie and a 6 pin pcie, use less power than the teslas, and it doesn't like it for some reason.

I also tested a Vega 64 and that worked on 1st try ...

it's running a xcp hypervisor, the teslas do show up with lspci | grep 3D, quadro or 2070 not at all

System Information
Description PowerEdge T620
BIOS Version 2.9.0
Service Tag GC8CZ42
Node Id GC8CZ42
Express Service Code 35568153794
Host Name xcp1
OS Name XCP-ng
OS Version release 8.2.0 (xenenterprise) Kernel 4.19.0+1 (x86_64)
System Revision I
Lifecycle Controller Firmware 2.65.65.65
IDSDM Firmware Version N/A

Any thoughts what should I check/test?

NB: Dell forums weren't helpful other than telling me the gpu is too new and the t620 is too old....
 

mrpasc

Well-Known Member
Jan 8, 2022
495
263
63
Munich, Germany
Well, I have used a T620 some years ago, and as far as I remember they are very picky for what GPU they accept. If you use more than 1 GPU then those must be the same model/chip. This is clearly stated in the owners manual. Additional there are some limitations if you have a PERC SAS HBA or raid card in, and for 2 Gpus (300W each) you must have the the 1100W power supply if I remember correctly.
So, what happens if you just have the RTX600 installed, means without the Teslas?