For better or for worse I figured it out, it is Amphenol G846A24211T4EU with mating receptacle G842AH242T4EU. Of course, the receptacle is out of stock everywhere, so the adventure continues...
Anyone know what part number the front panel header is on the S5652? Tyan seems to have switched to a higher density connector on the most recent boards...
If you're just getting started I would not go SFF. You pay a massive premium over a normal card (the 4090 is something like four times faster than the RTX 4000 Ada...) and performance is crap because you're limited to 70W.
If you must use a 70W card for personal reasons, get a RTX 2000 Ada. You...
so this could be a memory issue...I have a consumer Alder Lake system that is overclocked via external clockgen and it really does not like POSTing. after a ton of fiddling around it ended up being a memory issue - even with manual timings there was something causing the DRAM training to fail...
I feel like lack of ReBAR and bifurcation are not the board's fault, given they never told you these features were supported. As for the 10gbe dropping, once every several days means it's not a thermal problem, since nothing on the board should have a thermal time constant of several days.
Actually building a cluster of 4090's would be incredibly confusing, because the GPUs only have an x16 link which is shared between intra-node comms and inter-node comms. During FSDP each GPU sends 8 * (number of parameters) bytes per iteration; the theoretical upload speed is 32GB/sec so if you...
I wouldn't be so sure about that. Especially as hardware gets discontinued, the more obscure variants are worth more. I bet the SFF guys would pay top dollar for a dual slot 4090 from a US seller!
TinyLlama should be available on HF as well (it's internally just a LlamaForCausalLM). I wouldn't bother with the optimized trainers until you prove out that you have your dataset wrangled and your model architecture defined - HF trainer is slow but very reliable and for an exotic domain like...
How many tokens per second do you get per GPU? As a reference, I get about 8K tokens per second training phi-1.5 using Huggingface's device_map="auto" (so only one GPU runs at a time) on two V100-16GB. 4090s should be about twice as fast.
I've never seen a NCCL error give wrong results before -...
The loss of performance on the SYS-4028 is probably because the host is limited to PCI-e 3.0 speeds, so your gradient reductions are slower. Have you measured scaling going from 4 to 8 cards on your existing hardware? I somewhat suspect that due to improper topology, you will have less than...
Use case? A lot depends on whether you need x16 to each GPU. Training does, and furthermore benefits from a topology where all 8 GPUs share an x16 uplink to the CPUs via two levels of PCIe switches. That's an expensive setup - you need to put down $3K a piece for the special Chinese dual slot...
heh, that's a big change from "the ram usage will be pretty low" a few posts up. if your requirements are 512GB RAM and 16 lanes of nvme then yes, you are forced onto an enterprise platform. insofar that you accept performance is bad anyway compared to the desktop platforms, Skylake-SP is not a...
oof, did you buy the Epyc already? The 7313 is a full 2GHz and one major generation (~15% IPC deficit) behind the 7950X in single core performance and ~1 GHz behind in multicore. It's not a good use of money unless you are looking for RDIMM and lots of PCI-e lanes - the target use case is high...
7950X. Much more efficient than a 14900K, no E-cores to cause you scheduler trouble, and no ISA disadvantage since the consumer Intel parts do not support AVX-512.
But, we need more details. Is your Jupyter work GPU accelerated? How much RAM do your databases need? The consumer parts only...
Titan V/Quadro GV100 are the last fp64-capable cards with display outputs so they have some value for scientific simulations, especially if you're a researcher running commercial software that doesn't like living on the cloud. For language modeling, in rough order of viability:
3090/3090 Ti...
Language models can be partitioned across multiple GPUs *with the caveat* that only one GPU is active at any one time. This is a huge caveat, because for regular mortals (and even minor startups) this caps your memory bandwidth at about 1 Tbyte/sec and therefore puts an upper limit on your token...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.