Supermicro Superserver 8xA100 80GB SMX

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

Jelle458

New Member
Oct 4, 2022
29
15
3
Hey,

I am so lucky I have the priviledge of getting to play with a machine like this, and I would like to load up all the GPUs. My first instinct is to run some kind of rendering in Blender using CUDA.

Maybe the cards just don't want to do this, I don't know, but I have one hell of a headache here.

I have installed pop OS with Nvidia drivers. This works, and I can use nvidia-smi to see all 8 of the GPUs on the GPU board. I am running Nvidia driver 575.57.08 with CUDA toolkit 12.9.

I have tried to install blender and run it as sudo, where I ran into problems with XDG_RUNTIME_DIR. I set a group for all the GPUs so they are all in a group called video, and my user (called power) is also in that group.

Environment variables PATH and LD_LIBRARY_PATH is set to the correct values according to, funny enough, an AI. I didn't use AI for much, it could be wrong, and I don't think the environment variables are what's stopping me here.

Blender still refuses to see the CUDA devices, I get an error with CudaInit "Unknown CUDA error value".

Running deviceQuery from the toolkit shows me "system not yet initialized" and I think this is my problem. It might just be that these A100's just doesn't want to do what I want.

The end goal was basically just to evaluate the cooling needed for a production environment, where this OS does not matter, I just want to load up the GPUs.

Maybe I need some kind of AI model to be loaded up on them. Can I do that in pop OS? Is there an easy and quick way with limited setup for me to just load up the GPUs?

This is the first time I have the SMX board in hand, I have multiple PCIe GPUs, but I never loaded them up.

I was considering a Windows VM on each and run furmark, but the GPUs probably don't want to do that either.

What would be the most cool is to have an AI model make a picture for me, so I can hang it at the workplace, but I don't know the setup needed, or if pop OS can run any AI model - I have no experience with this.

I have some linux experience, but that is also limited, so I might be a lost cause needed to be spoon-fed the solution.

Any clues here? I would be very grateful!
 

TrashMaster

New Member
Sep 8, 2024
7
3
3
Grab the nvidia cuda toolkit and run the bandwidth test cases. Poke chatGPT for help compiling or running them.

# /usr/share/doc/nvidia-cuda-toolkit/examples/bin/x86_64/linux/release/p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 3090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 3090, pciBusID: 21, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 3090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 3090, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CANNOT Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CANNOT Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CANNOT Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
D\D 0 1 2 3
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 831.56 11.38 11.40 11.48
1 11.43 832.89 11.50 11.36
2 11.50 11.38 833.33 11.39
3 11.46 11.45 11.41 835.11
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3
0 834.22 11.57 11.47 11.43
1 11.39 834.22 11.39 11.46
2 11.44 11.47 833.78 11.40
3 11.41 11.45 11.34 835.11
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 839.38 16.98 16.98 17.02
1 16.89 838.93 16.92 16.87
2 16.93 16.83 838.70 16.91
3 16.80 16.97 16.97 840.05
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 839.81 16.87 16.96 16.89
1 16.97 838.93 16.90 16.98
2 16.90 16.84 839.83 16.88
3 16.95 17.01 16.89 840.01
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3
0 1.54 15.22 15.26 15.08
1 15.98 1.51 16.13 14.28
2 15.94 16.59 1.51 12.83
3 15.34 14.79 15.59 1.55

CPU 0 1 2 3
0 2.26 6.86 6.35 6.31
1 6.60 2.24 6.21 6.09
2 6.32 6.27 2.16 6.12
3 6.32 6.33 6.15 2.12
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3
0 1.53 14.43 11.17 10.78
1 17.22 1.52 13.18 10.53
2 16.31 15.55 1.52 16.49
3 16.39 15.88 14.96 1.54

CPU 0 1 2 3
0 2.16 6.57 6.27 6.13
1 6.55 2.23 6.15 6.15
2 6.30 6.14 2.11 6.16
3 6.37 6.15 6.26 2.11

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.


# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE NODE SYS 0-31,64-95 0 N/A
GPU1 NODE X NODE NODE SYS 0-31,64-95 0 N/A
GPU2 NODE NODE X NODE SYS 0-31,64-95 0 N/A
GPU3 NODE NODE NODE X SYS 0-31,64-95 0 N/A
NIC0 SYS SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
 

Jelle458

New Member
Oct 4, 2022
29
15
3
Thank you very much for the reply!

I compiled and ran p2pBandwidthLatencyTest, however I still get the error "System not yet initialized". It makes sense that Blender don't want to see the CUDA cycles, if even CUDA samples can't see them.

I used a bit of chatGPT to help here, it believes my driver isn't built for the kernel I am running.

However, I checked with:
modinfo nvidia | grep vermagic
uname -r

Both come back with 6.12.10.

Now the AI wants me to do a full re-install because it believes the problem is a partial driver installation, or a broken runtime.

I believe pop OS came with Nvidia 525.something drives, while I updated the driver, I also updated the CUDA, and the versions found by nvidia-smi "work together" it still don't want to initilize CUDA.

Should I do a full re-install, and not upgrade any drivers?

EDIT:
ChatGPT is leading me down a rabbit hole, and I feel like I am getting nowhere. Checking the logs I get some errors during startup with Nvidia DRM. ChatGPT tells me that the DRM is being used for displays, and I get the errors because the GPUs are headless.
Errors are:
[drm] No compatible format found
[drm] Cannot find any crtc or sizes

It's spammed a few times during startup. Blacklisting DRM didn't work, I still get the errors. I think I am done with this project now, and must admit defeat. I don't like how heavy linux is to work with, and I'm getting really tired after days trying to get this to work :(
 
Last edited:

Jelle458

New Member
Oct 4, 2022
29
15
3
If anyone finds this, heres and update.

I scratched the idea of Linux. I don't know why I wanted Linux to begin with, but apparently there is an Nvidia data center driver for A100 running on Windows server. I installed Windows server 2022, updated and installed drivers.

GPU-Z was able to see all 8 GPUs, and CUDA just WORKS. Blender needs OpenGL which the VGA doesn't have, so it doesn't want to start. But installing Stable Diffusion wasn't a big problem, however the models I was able to find are old, and can't really understand human anotomy. I am getting the problems from the earlier days with hands and fingers.

I was sure that 2048x2048 wouldn't be a problem, but I get really weird hallucinations when going above 768x768, and one time I got a VRAM error, so I guess even 80GB VRAM isn't enough, I didn't think it would be that crazy. I was running FP32 because, well, 80GB had to be enough.

Stable Diffusion apparently can't run on more than 1 GPU, but running 8 instances, and making it do 100 pictures I was able to start all 8 manually and have all 8 GPUs with a load at the same time. They got hot, and the server was at 110dB.