B blang New Member Feb 27, 2024 1 0 1 Feb 27, 2024 #1 Wondering if there's anyone know the roughly broken rate for GPUs in cluster?
B bayleyw Active Member Jan 8, 2014 302 99 28 Mar 1, 2024 #2 the best public data I know of is the OPT-175B training log and it's rough metaseq/projects/OPT/chronicles/OPT175B_Logbook.pdf at main · facebookresearch/metaseq my understanding is that it's not so much the GPUs failing as transient communications errors causing collective operations to hang Reactions: T_Minus
the best public data I know of is the OPT-175B training log and it's rough metaseq/projects/OPT/chronicles/OPT175B_Logbook.pdf at main · facebookresearch/metaseq my understanding is that it's not so much the GPUs failing as transient communications errors causing collective operations to hang