On the Annual Supercomputing occasion 2021, Intel shared some attention-grabbing particulars concerning its Ponte Vecchio HPC graphics card (accelerator). The graphics core, reminiscence, cache subsystem, I/O, and the method nodes powering the next-gen HPC GPU have been detailed. Along with this, the Sapphire Rapids-SP lineup was additionally touched upon. The variations between the usual and HBM variants have been highlighted, with a have a look at how SPR and PV will quickly energy the quickest supercomputer ever designed.
Beginning with Ponte Vecchio, every GPU has 128 Xe cores or Compute models for a complete of 1,024 vector models per GPU. These are paired with 1,024 matrix engines, 128 ray-tracing models, 64MB of L1 cache, and an enormous 408MB of L2 cache. By way of reminiscence, the GPU is paired with 128GB of HBM2e reminiscence (throughout 8 stacks?). Every GPU is related to seven others through the Xe Hyperlink, a Excessive-Velocity Coherent Unified Material (Intel’s Infinity Material). There’s assist for PCIe Gen 5, and the varied compute, reminiscence and cache tiles are related utilizing Foveros 3D stacking and the EMIB high-speed interconnect. Lastly, there’s the matter of the method nodes. A few of the tiles (almost certainly the compute tiles) are fabbed on Intel’s 7nm node whereas the remainder might be fabbed utilizing TSMC’s 5nm (N5) and 7nm (N7) course of nodes. It’s price noting that the previous is corresponding to N7 whereas being a good inferior to N5.
On the CPU facet, we’ve the 4th Gen Xeon Scalable Sapphire Rapids processor. With a complete core depend of as much as 56 and a quad-chiplet design, it’ll additionally make the most of on-die HBM2e reminiscence, related to the I/O and cores utilizing EMIB. The HBM variants of SPR will pack as much as 64GB of on-die HBM2e reminiscence. These might be distributed throughout 4 8-Hello stacks of 16GB every, with an total bandwidth of as much as 1.640 TB/s. Every SPR CPU might be paired with 4 Ponte Vecchio compute tiles within the Aurora Supercomputer utilizing EMIB.
Sapphire Rapids-SP will be run in three modes. The primary is the HBM-only which makes use of solely the on-die HBM reminiscence as the primary system reminiscence, ignoring the accessible DDR5 reminiscence. That is perfect for workloads requiring much less or equal to 64GB of bodily reminiscence.
The second is Cache-mode. Right here, the HBM reminiscence is used as a cache for the DDR5 system reminiscence (LLC). It’s not seen to the software program, and as such, doesn’t require any extra programming.
The ultimate is the Flat-mode or normal mode. It pairs the DDR5 and HBM reminiscence right into a contiguous addressable reminiscence house. The system will doubtless fill the HBM reminiscence first, and if the necessities cross the 64GB, mark, then the extra functions are offloaded to the system reminiscence.
The Sapphire Rapids-SP CPUs and Ponte Vecchio accelerator will collectively energy the Aurora Supercomputer. Greater than 18,000 Xeon processors might be paired with 54,000+ Ponte Vecchio GPUs. Let’s take a look at Intel, AMD, and Intel’s accelerators facet by facet and analyze how they stack up in comparison with each other:
|Intel Ponte Vecchio||AMD MI250X||NVIDIA A100 80GB|
|Compute Items||128||55 x2||108|
|Matrix Cores||128||2 x 440||432|
|INT8 Tensor||?||383 TOPs||624 TOPs|
|FP16 Matrix||?||383 TOPs||312 TOPs|
|FP64 Vector||?||47.9 TFLOPS||9.5 TFLOPS|
|FP64 Matrix||?||95.7 TFLOPs||19.5 TFLOPS|
|L2/L3||2 x 204 MB||2 x 8 MB||40 MB|
|VRAM Capability||128 GB||128 GB||80 GB|
|VRAM Kind||8 x HBM2e||8 x HBM2e||5 x HBM2e|
|Bandwidth||?||3.2 TB/s||2.0 TB/s|
|Course of Node||Intel 7
|TSMC N6||TSMC N7|
What stands out virtually instantly is the quantity of L2 cache leveraged by Ponte Vecchio: 408MB vs simply 16MB on the Intuition MI200 and 40MB on the A100. Nonetheless, by way of uncooked compute, AMD has much more vector models: 7,040 throughout 110 CUs, leading to an total throughput of 95.7 TFLOPs, in comparison with simply 19.5 TFLOPs on the NVIDIA A100. Nonetheless, every of Intel’s CUs might be higher fed with a lot greater cache hit charges and wider XMX matrix models. The MI250X has an 8192-bit large bus paired with 128GB of HBM2e reminiscence able to switch charges of as much as 3.2TB/s. Intel hasn’t shared any particulars concerning the bus width or reminiscence configuration of PVC simply but.