The issue is that distributed training needs high bandwidth and very low latency...

The issue is that distributed training needs high bandwidth and very low latency to be efficient. In a single computer you can fit about 8-10 GPUs, or if you go to extremes like in this system you might fit 16. To scale beyond that, you connect multiple computers in the same rack via Infiniband (a optical fibre network solution, the system in the article comes with a 400G Infiniband network adapter).

But systems that can host many GPUs tend to be expensive, and electricity is expensive, so at scale the expensive GPUs make sense. For a homebrew solution you can stick four consumer GPUs in a case and might save a buck.