Compare this to Nvidia’s H100, where there is 900GB/s from every GPU to every other GPU. This is because the H100’s doesn’t connect directly, but instead connect through the NVSwitch.
The worrying point is that AMD’s GPU to GPU IO is limited and non-uniform. The GPUs connect directly to each other. Nvidia has another chip, the NVSwitch which is providing full IO speed from one GPU to any other GPU. Nvidia’s solution enables most developers to treat 8 GPU servers as 1 massive GPU in AI. The topology complexities influence on software are still opaque on MI300.
It is also because Nvidia utilizes ethernet style SerDes which are faster, denser, and longer reach than the PCIe style SerDes that AMD uses. We believe this is an engineering tradeoff mistake for AMD, but that could be argued the other ways. AMD’s solution enables lower latency, but the bandwidth offered is significantly worse. Furthermore, Nvidia can do out of node NVLink, while AMD is limited to going over much slower Ethernet or Infiniband.
What AMD gains is they have supreme flexibility with their IP. Because each AID has 36 lanes of PCIe 5.0 SerDes, they can be configured for xGMI (GPU to GPU),
, and PCIe flexibly. How AMD deploys these is incredibly important for closing the gap with Nvidia’s H100.
On paper, AMD with all these lanes can offer a total
1152GB/s of IO.