Use case? A lot depends on whether you need x16 to each GPU. Training does, and furthermore benefits from a topology where all 8 GPUs share an x16 uplink to the CPUs via two levels of PCIe switches. That's an expensive setup - you need to put down $3K a piece for the special Chinese dual slot 4090s, plus $15K for a PCIe 4.0 host with the right topology. But, if you can fit your workload in 192GB the whole rig is about the same price as a single H200, and about 4x as fast. You might be able to get it to work with the right risers and Rome or Genoa, but I am not sure what the PCIe root complexes look like on Rome. geohotz claims he'll sell you a large computer with 6x 4090 on a Rome based board, but I have no idea what the allreduce performance looks like on it.
If you don't need x16 4.0 to each GPU, you can pick up something like a SYS-4028 ($600 used) and dangle the GPUs off risers. But, unless you are hosting eight individual application instances, performance may be pathological - SYS-4028 with the default X9DRG-O-PCIE mezzanines run four GPUs off each socket with two GPUs behind each switch, so inter-GPU communication is mediocre.
If you are hosting eight individual instances, you are better off building two 4-GPU nodes - it should be possible to fit a 4090 in a 4U server with a little effort and 4x 4090 in a 4U node comes out to about 500W/U which is decently dense.
Finally, if you can't saturate the 4090's FLOPs anyway (you're bandwidth bound), you can save a bunch of money by buying 3090s, which have the same bandwidth. You lose out on fp8 support, but by the time fp8 support really picks up Blackwell will probably be out anyway, making it a moot point.