8x 4090 motherboard recommendations

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

rtech

Active Member
Jun 2, 2021
317
115
43
8 * 16 is 128 no board actually exposes that amount of pcie lanes unless helped with pcie switches so do you need full 16x lanes?
You have practically speaking two options pcie switches or only using 8x pcie lanes which you can accomplish using retimers / splitters.
Another option would be waiting for Turin it is supposed to have 50% bump in available pcie lanes

RTX 4090 gaming GPU are pretty big cards i do not think you can even fit into 4U case so you will have to use some other solution perhaps you could get one of the mining rig frames for this.
 

bayleyw

Active Member
Jan 8, 2014
315
103
43
Use case? A lot depends on whether you need x16 to each GPU. Training does, and furthermore benefits from a topology where all 8 GPUs share an x16 uplink to the CPUs via two levels of PCIe switches. That's an expensive setup - you need to put down $3K a piece for the special Chinese dual slot 4090s, plus $15K for a PCIe 4.0 host with the right topology. But, if you can fit your workload in 192GB the whole rig is about the same price as a single H200, and about 4x as fast. You might be able to get it to work with the right risers and Rome or Genoa, but I am not sure what the PCIe root complexes look like on Rome. geohotz claims he'll sell you a large computer with 6x 4090 on a Rome based board, but I have no idea what the allreduce performance looks like on it.

If you don't need x16 4.0 to each GPU, you can pick up something like a SYS-4028 ($600 used) and dangle the GPUs off risers. But, unless you are hosting eight individual application instances, performance may be pathological - SYS-4028 with the default X9DRG-O-PCIE mezzanines run four GPUs off each socket with two GPUs behind each switch, so inter-GPU communication is mediocre.

If you are hosting eight individual instances, you are better off building two 4-GPU nodes - it should be possible to fit a 4090 in a 4U server with a little effort and 4x 4090 in a 4U node comes out to about 500W/U which is decently dense.

Finally, if you can't saturate the 4090's FLOPs anyway (you're bandwidth bound), you can save a bunch of money by buying 3090s, which have the same bandwidth. You lose out on fp8 support, but by the time fp8 support really picks up Blackwell will probably be out anyway, making it a moot point.
 

adaboost

New Member
Mar 26, 2024
7
1
3
It is indeed for training models - They are BF16/FP16 mixed precision, the parameters fit in VRAM (< 1B), with PCIE comms of O(GB) gradients syncing across to the "main GPU" (randomly chosen when the job starts) every ~5 seconds (with gradient accumulation). The 4090s are fortunately saturated with a batch size of 4 (anything below and they are bottlenecked by gradient syncs - i.e. average watts sit at 200W rather than 350W).

I'm leaning towards a bitcoin mining rig with a WRX80E-SAGE (7 PCIE 4.0 slots @8x) and TRPro 3975wx/5955wx with risers. Though I also do have a SYS-4028 but seem to find that the single core performance is a bottleneck (i.e. the threadripper/sage setup is about ~15% faster training, which matches the peak GHZ diff between the processors). It's just that running 8 16x risers (with one slot bifurcated) is so messy - sounds like there aren't any newer boards with SlimSAS 8x 4.0 PCIE support :confused:
 

BlueFox

Legendary Member Spam Hunter Extraordinaire
Oct 26, 2015
2,122
1,538
113
New systems are going to be MCIO and there are plenty around with enough connectors to do that. I've seen Epyc Genoa ones with 20 x MCIO and 4/5th gen Xeon Scalable with 22 x MCIO, each good for PCIe 5.0 x8. Should get you what you need. Non-proprietary boards have fewer connectors, but still enough for 8 GPUs.
 

serverhardware

New Member
Feb 5, 2024
4
2
3
I have server on ROME2D32GM-2T. Yes, it will support 8 videocards. I almost finished mine which will have 4x3090(currently 3). If 8 cards, I will have issue with power supply here and it is just too costly.
I think you can find resellers in your place for this mobo, if not, contact sales of Asrock and they will help you to find ones or will ship it directly.

I use 2 cases Thermaltake CTE C750, one has only mobo and SSF8654 boards for NVMe disks(4 boards - 8 disks). Another case has only 3 RTX 3090(I believe it can be filled up to 9 3090 cards if apply mounting skills) and second power supply unit. If you buy CTE C750 Air and CTE 750 glass, you can unite 2 cases perfectly without holes, just remove the front cover of the glass case and the rear cover of the 'air' one.
And If you want to use this case, you have to make new drills and mounts in the case for this motherboard, this mobo is too big and has 'proprietary' size.
 

bayleyw

Active Member
Jan 8, 2014
315
103
43
It is indeed for training models - They are BF16/FP16 mixed precision, the parameters fit in VRAM (< 1B), with PCIE comms of O(GB) gradients syncing across to the "main GPU" (randomly chosen when the job starts) every ~5 seconds (with gradient accumulation). The 4090s are fortunately saturated with a batch size of 4 (anything below and they are bottlenecked by gradient syncs - i.e. average watts sit at 200W rather than 350W).

I'm leaning towards a bitcoin mining rig with a WRX80E-SAGE (7 PCIE 4.0 slots @8x) and TRPro 3975wx/5955wx with risers. Though I also do have a SYS-4028 but seem to find that the single core performance is a bottleneck (i.e. the threadripper/sage setup is about ~15% faster training, which matches the peak GHZ diff between the processors). It's just that running 8 16x risers (with one slot bifurcated) is so messy - sounds like there aren't any newer boards with SlimSAS 8x 4.0 PCIE support :confused:
The loss of performance on the SYS-4028 is probably because the host is limited to PCI-e 3.0 speeds, so your gradient reductions are slower. Have you measured scaling going from 4 to 8 cards on your existing hardware? I somewhat suspect that due to improper topology, you will have less than stellar scaling, but I may be wrong if your model architecture has a high flops to weights ratio.

If I'm reading between the lines correctly you are doing data parallel training with a per-GPU batch size of 4 and each batch (32 samples across 8x cards) takes 5 seconds? How big are the samples? Are they sequences or images?

Also, if this is for work, I *highly recommend* getting a proper 4U server and dual-slot 4090s. You really don't want to be the guy responsible for a $15K contraption that inevitably stops working halfway through a training run, NCCL does not do error checking and does not like flaky hardware. Your training job will just stall and leave you very confused and angry.
 

adaboost

New Member
Mar 26, 2024
7
1
3
Good guess! Yes - sequences, 4k tokens max (T5/GPT2 models). Batch size is somewhere between 32 - 512 depending on gradient accumulation settings. It's the non flash-attention T5 implementation from huggingface.

Tell me more about this NCCL error checking business. You've also predicted correctly that I am frustrated with my models underfitting, so I'm increasing parameters & supplementing with data augmentation to combat the inevitable overfitting. Would NCCL errors manifest in bad gradient updates (and consequently poor model performance) - or is this just a hardware error and the training job would just get stuck?

As an update, I ended up doing what you suggested and getting multinode to work (2x 6U boxes with GPUs split between them) over 10gbe and I lose about 10% performance vs. having them all run off the same motherboard. I'm going to try a 100gbe connection to see if this is a inter-node bandwidth issue or just the cost of doing multinode itself - though could be anything as the boxes are not homogeneous, but this seems to be an easier problem to solve than fighting motherboard topologies
 
Last edited:

bayleyw

Active Member
Jan 8, 2014
315
103
43
How many tokens per second do you get per GPU? As a reference, I get about 8K tokens per second training phi-1.5 using Huggingface's device_map="auto" (so only one GPU runs at a time) on two V100-16GB. 4090s should be about twice as fast.

I've never seen a NCCL error give wrong results before - usually, the collective times out after 1800s. If you haven't gotten any timeouts you're probably safe.

4K sequence length almost certainly requires Flash/SDPA to be efficient; otherwise you materialize and write out the 16M entry (64MB) attention matrix *every attention head*. The original GPT2 sequence length was something like 512 tokens.

If you're OK with an autoregressive model (GPT2 is autoregressive, T5 is not) there are much better 1B param models now; TinyLlama is probably the one with the closest pretraining profile since it is trained on filtered but still natural web data. The Phi models are also very good, but as they are distilled from GPT4 (almost all of the pretraining data was synthesized via GPT4) they might have different behavior in your application.
 
Last edited:
  • Like
Reactions: adaboost

adaboost

New Member
Mar 26, 2024
7
1
3
Not sure about tokens per second, last I checked (last year) I could reproduce the #s in the exllama repo. I'm actually not running a language model in the traditional sense but tokenizing audio embeddings (hubert, audiomae) and using that to predict RVQs (a la MusicGen / encodec). I've been able to somewhat reproduce Bark but not having success with other datasets (hence rolling my own clusters/tokens on top of audiomae embeddings).

Thanks for the pointers for the latest models! I'll see if I can adopt one of the ones you've mentioned. I seem to be partial to the huggingface infra (i.e. I take all the academic repos that I think might be useful and port them over), though shouldn't take much to switch over to pytorch lightning (tinyllama), other than my laziness since it will be a 2 step process where 1) I confirm the code is working e2e by training on a known dataset and seeing it converge, then 2) training it on a custom dataset (resuming where I am right now). But if its going to save me tons of time / electricity cost then it makes 100% sense.

GPT2/T5 both work as I don't do anything with the encoder hidden states - Though I wonder if its actually the same as the prompt tokens (vocab: 10k) are semantically different from the response (vocab: 1k), but I can technically train it decoder-only style anyway?
 
Last edited:

bayleyw

Active Member
Jan 8, 2014
315
103
43
TinyLlama should be available on HF as well (it's internally just a LlamaForCausalLM). I wouldn't bother with the optimized trainers until you prove out that you have your dataset wrangled and your model architecture defined - HF trainer is slow but very reliable and for an exotic domain like this reliability helps until you confirm your model architecture actually works.

If you're converting audio tokens to quantized codes, does initializing the model with something trained on English text actually help?
 

adaboost

New Member
Mar 26, 2024
7
1
3
There is no official one, but you can buy modded ones from China. I haven't gone this route either because I don't think I would get a good resale value for it later
 

bayleyw

Active Member
Jan 8, 2014
315
103
43
I wouldn't be so sure about that. Especially as hardware gets discontinued, the more obscure variants are worth more. I bet the SFF guys would pay top dollar for a dual slot 4090 from a US seller!
 

iron-bound

New Member
Jul 7, 2019
18
3
3
Ignoring the labor cost but there was a story on China about buying pallets loads cards, striping them down and replacing memory and heatsync.

Given the cost of data center cards this could be a good way for small AI companies or universities to build a cluster.
 

bayleyw

Active Member
Jan 8, 2014
315
103
43
Actually building a cluster of 4090's would be incredibly confusing, because the GPUs only have an x16 link which is shared between intra-node comms and inter-node comms. During FSDP each GPU sends 8 * (number of parameters) bytes per iteration; the theoretical upload speed is 32GB/sec so if you are training an 8B parameter model you are going to spend 2-3 seconds per iteration communicating. FSDP can overlap the comms and computation to a large extent, but the 4090 is really fast; you could probably get 2000 tokens/sec/card on an optimized framework so you are looking at, at minimum, 6-8K tokens per GPU to hide the comms latency.

The high per-device batch size causes a few problems: it basically forces activation checkpointing (a 30% performance penalty), but that might not be so bad since checkpointing is mandatory for the biggest models anyway. It also prevents you from scaling to high GPU counts: say you have nodes with 4x GPU and 4x NIC running 8K tokens per GPU. If you want a global batch size of 256K tokens, you can't go over 8 nodes. If you do, the global batch size becomes too large and convergence becomes bad.

The biggest problem is actually going to be finding a collectives library that supports your weird non-RDMA/p2p enabled topology. My guess is that NCCL is not going to put up with it. Don't get me wrong, it's almost certainly possible to do large-scale pretraining using 4090s, but it's not going to be as seamless as doing it on 8x A100 nodes which are a battle-tested configuration. Especially for architectures with high flops:weight ratios like classic residual networks it would work fine, since you can checkpoint the activations to save memory and the weights + states are small.

Everything is much easier if you are building a single node, since you can omit the GPU<>NIC comms and use geohotz's hacked peermem driver to pass data between the cards and have a host which does 128 x16 4.0 lanes from a single root.
 
Last edited: