DeepLearning12 NVLink

Patrick · Jul 1, 2018

I have been getting restless to do another deep learning build. Today I invested in some Tesla P100 16GB GPUs.

Instead of going with PCIe cards, I decided SXM2 with NVLink.

Next items:
1. Need to do some research on whether I can put P100 in V100 trays. The V100 I believe has 6x 50gb/s links. The P100 was 4x 40. That is a big difference but if if they work with both, I will want the newer V100 trays.
2. I think this is going to be Skylake based. It looks like the E5 generations were less expensive because CPUs were less expensive. Also, motherboards were less expensive.
3. Skylake is somewhat strange. If you want 2x GPU memory, and each P100 has 16GB that is 64GB in a 4x GPU system or 128GB in an 8x GPU system. That means, at 2x is 128 or 256GB of system RAM. With Skylake the options are really 96GB, 192GB, or 384GB. With E5 128 or 256GB would be easier.
4. CPUs. What to use?

Many questions. Likely a few weeks from answers.

MiniKnight · Jul 2, 2018

popcorn time

Jaket · Jul 10, 2018

I would love to see how AMD's CPU's work with AI, Deep learning etc. We've been building a lot of Intel systems with 8x 1080TI's however nothing with AMD as of yet.

Patrick · Jul 10, 2018

@jacket are you doing single root or dual root? Tried 10x 1080 Ti yet?

Jaket · Jul 10, 2018

We haven't tried running 10x 1080 Ti's yet it's mostly for one of our clients and they've only requested 8 cards so far. We have mostly used SM for them however this is the next system we will be building out.
G481-HA0 (rev. 100) | High Performance Computing System - GIGABYTE B2B Service

All of the storage options in this system seems like a great option for their requirements.

Have you found it being a big advantage using 10 cards over 8? Might be interesting to bring up to our client.

Patrick · Jul 10, 2018

The major benefits are that you save 15-20% on the initial installation per GPU and some on the ongoing costs since you are using more GPUs per chassis.

I am really interested in the build-out of that Gigabyte server. It is a dual root design so are you planning to use 2x Mellanox cards and avoid the NUMA penalty?

cactus · Jul 10, 2018

Patrick said:
I am really interested in the build-out of that Gigabyte server. It is a dual root design so are you planning to use 2x Mellanox cards and avoid the NUMA penalty?

Block diagram shows you are stuck with a built in X550-AT2 on CPU1. Only a non-GPU x16 slot off CPU0. Spec page suggested it's designed for dual Omni-Path CPUs.

Revrnd · Jul 12, 2018

If you had a really good use case and some extra cash laying around you could always opt for one of these...

Nvidia DGX-2

On a side note, I'd love to see how these would go rendering some really intense scenes like in Ready Player One or some other CGI intense movie.

But on a serious note, just out of interest, do you guys hire these things out? Or do you use them for data analytics etc?

Love your work on the other Deep Learning machines though Patrick. Keep up the good work.

Patrick · Jul 13, 2018

@Revrnd we are testing allowing people to hire the big GPU systems

Patrick · Jul 17, 2018

DeepLearning12 update

Revrnd · Jul 22, 2018

Mmmm.. that's nice seeing over 80 TFLOPS there. Would be great to see how something like that would go with some well rounded benchmarks.

Keep up the good work Patrick, looking forward to seeing the article when you get this machine up and running.

Would it be possible to run OctaneBench 3.x on your Deep Learning machines for a comparison by any chance? Would be great for those with an interest in 3D Rendering and render farms.

gigatexal · Jul 22, 2018

Patrick said:
I have been getting restless to do another deep learning build. Today I invested in some Tesla P100 16GB GPUs.

Instead of going with PCIe cards, I decided SXM2 with NVLink.

Next items:
1. Need to do some research on whether I can put P100 in V100 trays. The V100 I believe has 6x 50gb/s links. The P100 was 4x 40. That is a big difference but if if they work with both, I will want the newer V100 trays.
2. I think this is going to be Skylake based. It looks like the E5 generations were less expensive because CPUs were less expensive. Also, motherboards were less expensive.
3. Skylake is somewhat strange. If you want 2x GPU memory, and each P100 has 16GB that is 64GB in a 4x GPU system or 128GB in an 8x GPU system. That means, at 2x is 128 or 256GB of system RAM. With Skylake the options are really 96GB, 192GB, or 384GB. With E5 128 or 256GB would be easier.
4. CPUs. What to use?

Many questions. Likely a few weeks from answers.

All this hardware porn I’m feeling guilty subbing this thread

Patrick · Jul 23, 2018

Running like crazy this week with travel. Update. We have an 8x SXM2 server confirmed. It is being produced and hopefully shipped and arriving here next week.

GPUs: Check
CPUs: Check
RAM: Check
NVMe SSDs: Check
Boot SSDs: Check
Mellanox 100Gb: Check
Server: Inbound!

nrtc · Jul 25, 2018

Patrick said:
The major benefits are that you save 15-20% on the initial installation per GPU and some on the ongoing costs since you are using more GPUs per chassis.

Our supplier didn't want to deliver the SYS-4028GR-TRT2 with 10x GPU, since they said it was not a configuration supported by SM.

In any case, I'm looking forward to DeepLearning12 and performance of the V100's. What DL frameworks and benchmarks do you intend to run?

Patrick · Jul 25, 2018

nrtc said:
Our supplier didn't want to deliver the SYS-4028GR-TRT2 with 10x GPU, since they said it was not a configuration supported by SM.

In any case, I'm looking forward to DeepLearning12 and performance of the V100's. What DL frameworks and benchmarks do you intend to run?

Likely mlperf but we may do our keras + TF GAN as well.

Who was the vendor BTW? Feel free to PM.

nrtc · Jul 26, 2018

Patrick said:
Likely mlperf but we may do our keras + TF GAN as well.

Does mlperf support multi-gpu benchmarking? It'd be interesting to see how much NVLink helps in scaling up learning. TensorFlow's benchmarks are actually quite straightforward and quick to run, although image-classification centric.

Patrick said:
Who was the vendor BTW? Feel free to PM.

A vendor in Europe. (pm sent)

ideabox · Jul 30, 2018

I cannot find a SXM2 server barebone ;(

Looking forward to this .. Does single/dual root matter with NVLINK?

Patrick · Jul 30, 2018

ideabox said:
I cannot find a SXM2 server barebone ;(

Looking forward to this .. Does single/dual root matter with NVLINK?

More on this Wed/ Thursday on STH for single root servers.

SXM2 installation is borderline scary. Servers are only sold with Teslas. You can sometimes get them with 4 of 8 populated.

Also, an update on the project coming Thursday.

Patrick · Aug 3, 2018

Well, time to get started

MiniKnight · Aug 3, 2018

What system is that?

DeepLearning12 NVLink

Administrator

Well-Known Member

Active Member

Administrator

Active Member

Administrator

Moderator

Member

Administrator

Administrator

Member

I'm here to learn

Administrator

New Member

Administrator

New Member

Member

Administrator

Administrator

Well-Known Member