Ring Attention with modified softmax split's the kv cache across devices, it also does a better job than fsdp, as it allows all devices to compute and pass the results.

arxiv.org

Striped Attention: Faster Ring Attention for Causal Transformers
To help address the growing demand for ever-longer sequence lengths in transformer models, Liu et al. recently proposed Ring Attention, an exact attention algorithm capable of overcoming per-device memory bottle- necks by distributing self-attention across multiple devices. In this paper, we...
