NCCL testing on Supermicro SYS-4028GR-TR

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

dhenzjhen

Member
Sep 14, 2016
38
55
18
San Jose, California
GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication
Fast Multi-GPU collectives with NCCL


System Configuration:
Motherboard: X10DRG-O / Product Name: SYS-4028GR-TR
BIOS: 7/27/2016
IPMI: 3.44
CPU: E5 2689 3.1Ghz V4 x 2
Memory: Samsung 16GB x 24
GPU: Tesla M40 x 8
OS: Redhat Linux 7.2 x64
CUDA: 7.5
Software used: NCCL


Testing results:
[root@localhost nccl]# ./build/test/single/broadcast_test 10000000 8
# Using devices
# Rank 0 uses device 0 [0x04] Tesla M40
# Rank 1 uses device 1 [0x06] Tesla M40
# Rank 2 uses device 2 [0x07] Tesla M40
# Rank 3 uses device 3 [0x08] Tesla M40
# Rank 4 uses device 4 [0x0c] Tesla M40
# Rank 5 uses device 5 [0x0d] Tesla M40
# Rank 6 uses device 6 [0x0e] Tesla M40
# Rank 7 uses device 7 [0x0f] Tesla M40

# bytes N type root time algbw busbw delta
10000000 10000000 char 0 1.068 9.37 9.37 0e+00
10000000 10000000 char 1 1.134 8.82 8.82 0e+00
10000000 10000000 char 2 1.143 8.75 8.75 0e+00
10000000 10000000 char 3 1.145 8.73 8.73 0e+00
10000000 10000000 char 4 1.099 9.10 9.10 0e+00
10000000 10000000 char 5 1.150 8.70 8.70 0e+00
10000000 10000000 char 6 1.158 8.63 8.63 0e+00
10000000 10000000 char 7 1.162 8.61 8.61 0e+00
10000000 2500000 int 0 1.073 9.32 9.32 0e+00
10000000 2500000 int 1 1.138 8.79 8.79 0e+00
10000000 2500000 int 2 1.147 8.72 8.72 0e+00
10000000 2500000 int 3 1.148 8.71 8.71 0e+00
10000000 2500000 int 4 1.106 9.04 9.04 0e+00
10000000 2500000 int 5 1.160 8.62 8.62 0e+00
10000000 2500000 int 6 1.161 8.62 8.62 0e+00
10000000 2500000 int 7 1.156 8.65 8.65 0e+00
10000000 5000000 half 0 1.073 9.32 9.32 0e+00
10000000 5000000 half 1 1.136 8.80 8.80 0e+00
10000000 5000000 half 2 1.143 8.75 8.75 0e+00
10000000 5000000 half 3 1.147 8.72 8.72 0e+00
10000000 5000000 half 4 1.100 9.09 9.09 0e+00
10000000 5000000 half 5 1.153 8.68 8.68 0e+00
10000000 5000000 half 6 1.160 8.62 8.62 0e+00
10000000 5000000 half 7 1.157 8.64 8.64 0e+00
10000000 2500000 float 0 1.071 9.34 9.34 0e+00
10000000 2500000 float 1 1.137 8.80 8.80 0e+00
10000000 2500000 float 2 1.142 8.76 8.76 0e+00
10000000 2500000 float 3 1.145 8.73 8.73 0e+00
10000000 2500000 float 4 1.104 9.06 9.06 0e+00
10000000 2500000 float 5 1.154 8.67 8.67 0e+00
10000000 2500000 float 6 1.159 8.63 8.63 0e+00
10000000 2500000 float 7 1.154 8.66 8.66 0e+00
10000000 1250000 double 0 1.071 9.34 9.34 0e+00
10000000 1250000 double 1 1.137 8.79 8.79 0e+00
10000000 1250000 double 2 1.146 8.73 8.73 0e+00
10000000 1250000 double 3 1.148 8.71 8.71 0e+00
10000000 1250000 double 4 1.104 9.06 9.06 0e+00
10000000 1250000 double 5 1.153 8.68 8.68 0e+00
10000000 1250000 double 6 1.158 8.64 8.64 0e+00
10000000 1250000 double 7 1.154 8.66 8.66 0e+00
10000000 1250000 int64 0 1.073 9.32 9.32 0e+00
10000000 1250000 int64 1 1.140 8.77 8.77 0e+00
10000000 1250000 int64 2 1.143 8.75 8.75 0e+00
10000000 1250000 int64 3 1.149 8.71 8.71 0e+00
10000000 1250000 int64 4 1.106 9.04 9.04 0e+00
10000000 1250000 int64 5 1.152 8.68 8.68 0e+00
10000000 1250000 int64 6 1.162 8.60 8.60 0e+00
10000000 1250000 int64 7 1.159 8.63 8.63 0e+00
10000000 1250000 uint64 0 1.072 9.33 9.33 0e+00
10000000 1250000 uint64 1 1.142 8.75 8.75 0e+00
10000000 1250000 uint64 2 1.147 8.72 8.72 0e+00
10000000 1250000 uint64 3 1.149 8.70 8.70 0e+00
10000000 1250000 uint64 4 1.108 9.03 9.03 0e+00
10000000 1250000 uint64 5 1.155 8.66 8.66 0e+00
10000000 1250000 uint64 6 1.163 8.60 8.60 0e+00
10000000 1250000 uint64 7 1.157 8.65 8.65 0e+00

[root@localhost nccl]# ./build/test/single/all_gather_test 10000000 8
# Using devices
# Rank 0 uses device 0 [0x04] Tesla M40
# Rank 1 uses device 1 [0x06] Tesla M40
# Rank 2 uses device 2 [0x07] Tesla M40
# Rank 3 uses device 3 [0x08] Tesla M40
# Rank 4 uses device 4 [0x0c] Tesla M40
# Rank 5 uses device 5 [0x0d] Tesla M40
# Rank 6 uses device 6 [0x0e] Tesla M40
# Rank 7 uses device 7 [0x0f] Tesla M40

# bytes N type time algbw busbw delta
10000000 10000000 char 7.827 8.94 8.94 0e+00
10000000 2500000 int 7.818 8.95 8.95 0e+00
10000000 5000000 half 7.816 8.96 8.96 0e+00
10000000 2500000 float 7.807 8.97 8.97 0e+00
10000000 1250000 double 7.828 8.94 8.94 0e+00
10000000 1250000 int64 7.826 8.94 8.94 0e+00
10000000 1250000 uint64 7.815 8.96 8.96 0e+00

[root@localhost nccl]# ./build/test/single/all_reduce_test 10000000 8
# Using devices
# Rank 0 uses device 0 [0x04] Tesla M40
# Rank 1 uses device 1 [0x06] Tesla M40
# Rank 2 uses device 2 [0x07] Tesla M40
# Rank 3 uses device 3 [0x08] Tesla M40
# Rank 4 uses device 4 [0x0c] Tesla M40
# Rank 5 uses device 5 [0x0d] Tesla M40
# Rank 6 uses device 6 [0x0e] Tesla M40
# Rank 7 uses device 7 [0x0f] Tesla M40

# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
10000000 10000000 char sum 1.973 5.07 8.87 0e+00 2.059 4.86 8.50 0e+00
10000000 10000000 char prod 1.963 5.09 8.91 0e+00 1.993 5.02 8.78 0e+00
10000000 10000000 char max 1.970 5.08 8.89 0e+00 1.986 5.03 8.81 0e+00
10000000 10000000 char min 1.964 5.09 8.91 0e+00 1.991 5.02 8.79 0e+00
10000000 2500000 int sum 1.967 5.08 8.90 0e+00 1.989 5.03 8.80 0e+00
10000000 2500000 int prod 1.968 5.08 8.89 0e+00 1.985 5.04 8.81 0e+00
10000000 2500000 int max 1.963 5.09 8.92 0e+00 1.987 5.03 8.81 0e+00
10000000 2500000 int min 1.966 5.09 8.90 0e+00 1.987 5.03 8.81 0e+00
10000000 5000000 half sum 2.202 4.54 7.95 1e-02 2.229 4.49 7.85 1e-02
10000000 5000000 half prod 2.203 4.54 7.94 7e-04 2.235 4.47 7.83 7e-04
10000000 5000000 half max 1.956 5.11 8.94 0e+00 1.985 5.04 8.82 0e+00
10000000 5000000 half min 1.961 5.10 8.93 0e+00 1.976 5.06 8.86 0e+00
10000000 2500000 float sum 1.950 5.13 8.98 1e-06 1.970 5.08 8.88 1e-06
10000000 2500000 float prod 1.952 5.12 8.96 9e-08 1.989 5.03 8.80 9e-08
10000000 2500000 float max 1.969 5.08 8.89 0e+00 1.991 5.02 8.79 0e+00
10000000 2500000 float min 1.964 5.09 8.91 0e+00 1.992 5.02 8.79 0e+00
10000000 1250000 double sum 1.971 5.07 8.88 0e+00 2.002 4.99 8.74 0e+00
10000000 1250000 double prod 1.975 5.06 8.86 1e-16 2.000 5.00 8.75 1e-16
10000000 1250000 double max 1.967 5.08 8.90 0e+00 1.991 5.02 8.79 0e+00
10000000 1250000 double min 1.964 5.09 8.91 0e+00 1.992 5.02 8.79 0e+00
10000000 1250000 int64 sum 1.972 5.07 8.87 0e+00 1.995 5.01 8.77 0e+00
10000000 1250000 int64 prod 1.963 5.09 8.91 0e+00 1.990 5.03 8.79 0e+00
10000000 1250000 int64 max 1.962 5.10 8.92 0e+00 1.981 5.05 8.83 0e+00
10000000 1250000 int64 min 1.969 5.08 8.89 0e+00 1.986 5.04 8.81 0e+00
10000000 1250000 uint64 sum 1.962 5.10 8.92 0e+00 1.978 5.06 8.85 0e+00
10000000 1250000 uint64 prod 1.969 5.08 8.89 0e+00 1.995 5.01 8.77 0e+00
10000000 1250000 uint64 max 1.965 5.09 8.90 0e+00 1.985 5.04 8.81 0e+00
10000000 1250000 uint64 min 1.965 5.09 8.91 0e+00 1.988 5.03 8.80 0e+00

[root@localhost nccl]# ./build/test/single/reduce_scatter_test 10000000 8
# Using devices
# Rank 0 uses device 0 [0x04] Tesla M40
# Rank 1 uses device 1 [0x06] Tesla M40
# Rank 2 uses device 2 [0x07] Tesla M40
# Rank 3 uses device 3 [0x08] Tesla M40
# Rank 4 uses device 4 [0x0c] Tesla M40
# Rank 5 uses device 5 [0x0d] Tesla M40
# Rank 6 uses device 6 [0x0e] Tesla M40
# Rank 7 uses device 7 [0x0f] Tesla M40

# out-of-place in-place
# bytes N type op time algbw busbw delta time algbw busbw delta
10000000 10000000 char sum 8.289 1.21 8.44 0e+00 8.286 1.21 8.45 0e+00
10000000 10000000 char prod 8.336 1.20 8.40 0e+00 8.330 1.20 8.40 0e+00
10000000 10000000 char max 8.244 1.21 8.49 0e+00 8.260 1.21 8.47 0e+00
10000000 10000000 char min 8.233 1.21 8.50 0e+00 8.240 1.21 8.49 0e+00
10000000 2500000 int sum 8.175 1.22 8.56 0e+00 8.196 1.22 8.54 0e+00
10000000 2500000 int prod 8.173 1.22 8.56 0e+00 8.193 1.22 8.54 0e+00
10000000 2500000 int max 8.198 1.22 8.54 0e+00 8.216 1.22 8.52 0e+00
10000000 2500000 int min 8.230 1.22 8.51 0e+00 8.259 1.21 8.48 0e+00
10000000 5000000 half sum 8.539 1.17 8.20 2e-02 8.525 1.17 8.21 2e-02
10000000 5000000 half prod 8.489 1.18 8.25 1e-03 8.539 1.17 8.20 1e-03
10000000 5000000 half max 9.832 1.02 7.12 0e+00 10.085 0.99 6.94 0e+00
10000000 5000000 half min 9.893 1.01 7.08 0e+00 9.906 1.01 7.07 0e+00
10000000 2500000 float sum 8.238 1.21 8.50 1e-06 8.243 1.21 8.49 1e-06
10000000 2500000 float prod 8.208 1.22 8.53 9e-08 8.231 1.21 8.50 9e-08
10000000 2500000 float max 8.228 1.22 8.51 0e+00 8.257 1.21 8.48 0e+00
10000000 2500000 float min 8.222 1.22 8.51 0e+00 8.244 1.21 8.49 0e+00
10000000 1250000 double sum 8.193 1.22 8.54 0e+00 8.212 1.22 8.52 0e+00
10000000 1250000 double prod 8.196 1.22 8.54 3e-16 8.222 1.22 8.51 3e-16
10000000 1250000 double max 8.273 1.21 8.46 0e+00 8.292 1.21 8.44 0e+00
10000000 1250000 double min 8.260 1.21 8.47 0e+00 8.283 1.21 8.45 0e+00
10000000 1250000 int64 sum 8.287 1.21 8.45 0e+00 8.302 1.20 8.43 0e+00
10000000 1250000 int64 prod 8.330 1.20 8.40 0e+00 8.346 1.20 8.39 0e+00
10000000 1250000 int64 max 8.287 1.21 8.45 0e+00 8.295 1.21 8.44 0e+00
10000000 1250000 int64 min 8.268 1.21 8.47 0e+00 8.294 1.21 8.44 0e+00
10000000 1250000 uint64 sum 8.281 1.21 8.45 0e+00 8.286 1.21 8.45 0e+00
10000000 1250000 uint64 prod 8.322 1.20 8.41 0e+00 8.338 1.20 8.40 0e+00
10000000 1250000 uint64 max 8.295 1.21 8.44 0e+00 8.330 1.20 8.40 0e+00
10000000 1250000 uint64 min 8.271 1.21 8.46 0e+00 8.282 1.21 8.45 0e+00
 
  • Like
Reactions: MiniKnight

William

Well-Known Member
May 7, 2015
789
252
63
67
I think I have run this server before :)
You have better GPU's tho :)

So what does all this mean, in simple terms ? :)
 

MiniKnight

Well-Known Member
Mar 30, 2012
3,073
976
113
NYC
@dhenzjhen I had never heard of NCCL but it looks really interesting for those of us playing with the GRID M40's if it gives us a dirt cheap way to better utilize those CPUs. Is there a machine learning application?
 

dhenzjhen

Member
Sep 14, 2016
38
55
18
San Jose, California
I think I have run this server before :)
You have better GPU's tho :)

So what does all this mean, in simple terms ? :)
Hey there bossman, It's too complicated to explain that's why I just put the NCCL link at the very top of the OP.
Yeah very noisy machine hate it :D


@dhenzjhen I had never heard of NCCL but it looks really interesting for those of us playing with the GRID M40's if it gives us a dirt cheap way to better utilize those CPUs. Is there a machine learning application?
Hey Miniknight, Sorry, haven't encountered any machine learning application yet. But I just I did the benchmark as a special
request for X customer because they want to see some data.