Given the rush to buy GPU nodes for AI and the massive bandwidth required to scale. I'd expect a bunch of 100G gear to hit the market at some point as people upgrade
Make sure the version of the Nvidia user space package is a matching version 550.54.15 and skip kernel module install with flag
`sh ./NVIDIA-Linux-[...].run --no-kernel-modules`
Otherwise you need to merge code to use a newer one Multi-GPU Tinygrad Patch
Ring Attention with modified softmax split's the kv cache across devices, it also does a better job than fsdp, as it allows all devices to compute and pass the results.
https://arxiv.org/abs/2311.09431
Ignoring the labor cost but there was a story on China about buying pallets loads cards, striping them down and replacing memory and heatsync.
Given the cost of data center cards this could be a good way for small AI companies or universities to build a cluster.
Solved the power issue, aftermarket EPS cables aren't using the standard pinout
https://www.corsair.com/uk/en/p/pc-components-accessories/cp-8920284/600w-pcie-5-0-12vhpwr-type-4-psu-power-cable-cp-8920284
I ordered a low power GPU so I'll test that soon, I'm guessing populating only the 2nd slot isn't a problem.
Could also borrow a multimeter and test the gpu power rails.
Hey so I got this gigabyte g292-z20 server with aim to run 2-4 GPU’s via the eight pcie slots.
Unfortunatly when testing with a single pcie card both the cards I have 7900xtx and 4090 fail to detect in lspci… so I’m stuck
image1000×760 50.3 KB
Its running microsemi pci switches which I hope...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.