An EPYC Miss? Microsoft Azure Instances Pair AMD's MI300X With Intel's Sapphire Rapids

imaginary_num6er@alien.top · 1 year ago

An EPYC Miss? Microsoft Azure Instances Pair AMD's MI300X With Intel's Sapphire Rapids

DevAnalyzeOperate@alien.top · 1 year ago

Intel’s accelerator strategy and focus on memory bandwidth is paying off huge.

First time in awhile I’ve seen Intel execute something well and catch AMD with their pants down, despite sapphire rapids being a lemon in most respects.

BatteryPoweredFriend@alien.top · 1 year ago

It’s more likely MS didn’t think taking a whole bunch Geona systems out of use elsewhere would be worth it. The backlog for Genoa throughout most of h1 made them nearly as unobtainable as H100s.

Most of the CPU time in these sort of systems is usually taken up by relatively basic PCIe traffic management. More likely, SPR and Geona are basically interchangeable as far as this is concerned and SPR Xeon just had less opportunity cost.

If there was actually any special sauce that made a tangible difference with this type of setup, there would be an epic bumrush by everyone to buy up SPR Xeons to host all their H100s, but they’re clearly not. Nvidia would have also made a far bigger and more public stink over Intel’s failure to deliver SPR on time, due to DGX-H100.

HippoLover85@alien.top · 1 year ago

Amd dc sales are taking off and intel is still struggling.

Spr is not a competitive product for the vast majority of workloads. Its fine here because who cares about the cpu performance. Cloud probably paying premium for bergamo chips, and you dont need a powerful 128 core here.

GrandDemand@alien.top · 1 year ago

This^

SPR is cheaper than Genoa and Bergamo, and supply of those EPYC chips has not been as abundant as SPR.

There’s advantages to SPR over Zen 4 EPYC in ML/AI workloads, and while MI300X will be doing the grunt of the training and inference, some model weights/parameters could be offloaded to the CPU with minimal performance loss in the event the VRAM buffer overflows to system memory. CPU only inference could also tested for model performance on weaker hardware or be utilized if all MI300X are busy and there’s unused CPU cycles (which is likely for these workloads). SPR generally outperforms Genoa in inference so there’s some merits for its selection over the latter.

Regardless though this decision by Microsoft just boils down to cost and availability