Technology News

NVIDIA Launches NVSHMEM 3.0 with Enhanced Multi-Node GPU Communication

08 September 2024

Paikan Begzad

Summary

NVIDIA has officially released NVSHMEM 3.0, the latest iteration of its parallel programming interface designed to boost communication efficiency across NVIDIA GPU clusters. As part of NVIDIA Magnum IO and based on OpenSHMEM, this new version brings key improvements aimed at enhancing scalability, application portability, and compatibility across various systems.

The standout features in NVSHMEM 3.0 include multi-node and multi-interconnect support, host-device ABI backward compatibility, and CPU-assisted InfiniBand GPU Direct Async (IBGDA), which significantly improves GPU cluster communication.

With this version, NVSHMEM supports connectivity between GPUs within a node via NVIDIA NVLink or PCIe, and across nodes through RDMA interconnects like InfiniBand and RDMA over Converged Ethernet (RoCE). This multi-node, multi-interconnect feature now also extends to systems connected by RDMA networks, including NVIDIA GB200 NVL72 racks.

Another key highlight of NVSHMEM 3.0 is its backward compatibility across minor versions. Applications developed with older versions of NVSHMEM can now run on newer systems without the need for recompiling, enabling easier updates and smoother transitions in evolving infrastructure environments.

Additionally, CPU-assisted IBGDA allows control plane responsibilities to be split between the GPU and CPU, easing adoption on non-coherent platforms and minimizing configuration requirements in large-scale deployments.

NVSHMEM 3.0 also brings a new object-oriented programming (OOP) framework that streamlines memory management across different types of symmetric heaps, including static and dynamic device memory. This framework simplifies integration with advanced features and enhances data encapsulation for better organization in large-scale applications.

Further improvements in this release include optimizations in IBGDA setup, block-scoped on-device reductions, system-scoped atomic memory operations (AMO), and team management functionalities.

With these upgrades, NVSHMEM 3.0 offers a more flexible and powerful solution for developers and administrators working with NVIDIA GPU clusters. Whether it’s improving the communication efficiency between multi-node systems or enabling seamless software updates, this release promises smoother performance and easier management in large-scale parallel computing environments.