Looking for advice for my first Deep Learning system

LenE

New Member
Jan 29, 2020
28
8
3
That’s great info on NVLINK compatibility!

I saw articles on the hashing technique to bypass GPU compute. It’s neat, but it is still research that is not implemented for broader use in any more general purpose framework yet.

I have simultaneously scaled down and scaled up my proposed system build to use a maximum of 2 bigger GPU’s and a Threadripper 3960X. I have awful timing though. It seems that the Covid-19 fiasco has pushed prices from list to list + $400 since I started this thread. I have seen drops on the consumer Ryzen chips and slight increases on Epyc, but the TR’s got really expensive. Maybe after nVidia announces whatever they are announcing at their now-virtual GTC, the drop on current RTX cards may partially offset the CPU’s exorbitant rise in price.
 

LenE

New Member
Jan 29, 2020
28
8
3
What a difference a day makes! The 3960X is now down to list price at Amazon for early April delivery and only $165 over list at Newegg. The 3990X is now $100 under list.
 

LenE

New Member
Jan 29, 2020
28
8
3
With nVidia’s postponement and then cancellation of GTC, I decided that I couldn’t stand waiting any longer. Amazon brought the 3960x’s price back down to list, so I went on a spending spree in anticipation of some Covid-19 induced down time. I’m hoping to get all of the bits I need by the first week of April.

Thanks to everyone who provided advice, information, and inspiration!
 

LenE

New Member
Jan 29, 2020
28
8
3
I got my machine built, and now I’m doing the bleeding edge hardware and software dance. With Ubuntu 18.04, 19.10, and the nightly of 20.04 all have many crashes and panics. Lots of things aren’t playing nice yet. As a public service announcement, most motherboard makers making TRX40 boards are only suggesting Windows 10. If you want to do Linux, you need to pass “mce=off” as a command line argument in Grub. Couldn’t even boot the installer without this.

I’m probably going back to 18.04 to limit some of the bleeding edge-ness. I have only found people praising the Threadripper 39xx series for Linux prowess, or others completely frustrated trying to get it to work at all. I’m currently between the two camps.
 

Cixelyn

Researcher
Nov 7, 2018
48
30
18
San Francisco
Glad to hear you finally got the parts in!

For deep learning I'd definitely recommend sticking with 18.04 for now. iirc CUDA Toolkit is only officially released for 16.04 and 18.04 -- it's not really worth the hassle to debug random CUDA / Driver issues. Getting your deep learning models to converge will already be big enough of a headache ;)
 

LenE

New Member
Jan 29, 2020
28
8
3
Figuring out the versions that work for CUDA and TensorFlow 2.x with the RTX cards under Linux seems to be a challenge on its own. It feels like I’m in my own platformer arcade game from the 1980’s, where I get so far, and then run into a problem that requires backing up and starting all over again. Now if I can just find a driver version that CUDA 10.0 is happy with...
 

Cixelyn

Researcher
Nov 7, 2018
48
30
18
San Francisco
One solution we've been doing to the above problem is to install the latest on the machines (CUDA 10.2), and then run the exact version of TF (1.15) + CUDA (10.0) that we need in a Singularity / Docker container. From my understanding, you can pass through and run an older version of CUDA as long as the host's version of CUDA is the same or later.

NVIDIA publishes official packaged versions of CUDA which make this easier: Docker Hub
I suspect this is the only way to make the dependencies sane.