InfiniBand Fabrics for AI

Details

Goals & objectives

Hardware & software

Solution overview

This lab provides a guided exploration of an AI-focused InfiniBand fabric managed through NVIDIA Unified Fabric Manager (UFM). Rather than performing configuration tasks, participants examine how a modern InfiniBand environment operates and how UFM provides visibility into fabric health, topology, telemetry, and management functions.

The lab begins by establishing the architectural differences between InfiniBand and Ethernet. InfiniBand relies on a Subnet Manager (SM) to automatically discover devices, assign addresses, compute routes, and program switch forwarding tables. As a result, once a valid Clos topology is physically cabled, much of the fabric configuration occurs automatically. This contrasts with Ethernet environments, which typically require extensive manual or orchestrated configuration of addressing, routing, and traffic-management policies.

Participants then review key InfiniBand concepts including GUIDs, LIDs, Linear Forwarding Tables (LFTs), PKeys, Service Levels (SLs), Virtual Lanes (VLs), Host Channel Adapters (HCAs), routing engines, Adaptive Routing, and Congestion Control. The lab also highlights AI-specific capabilities such as GPUDirect RDMA, SHARP in-network computing, and self-healing network technologies that enable efficient large-scale GPU communication.

The primary focus of the exercise is the UFM interface. Through a series of read-only observations, participants examine fabric topology, device inventory, ports, cables, partitions, telemetry, alarms, system health, job management, and administrative settings. Special attention is given to the Subnet Manager configuration, telemetry collection, topology validation, and detection of non-optimal links, demonstrating how UFM transforms a highly automated fabric into an observable and manageable platform.

By the end of the lab, participants should understand how InfiniBand fabrics are discovered and managed, how AI and HPC workloads benefit from native RDMA and in-network acceleration, and how UFM provides the operational visibility required to monitor, troubleshoot, and maintain large-scale GPU clusters.

Lab diagram

Goals and objectives

The goal of this lab is to introduce the architecture, operation, and management of AI-focused InfiniBand fabrics through a guided exploration of NVIDIA Unified Fabric Manager (UFM). Participants will examine how modern GPU clusters use InfiniBand to deliver high-performance, low-latency communication and how UFM provides the visibility and operational tools required to manage these environments at scale.

Upon completion of this lab, you will be able to:

Explain the architectural differences between InfiniBand and Ethernet fabrics, particularly in the context of AI and HPC workloads.
Describe the role of the Subnet Manager (SM) and its responsibilities for discovery, addressing, routing, and fabric configuration.
Identify and explain core InfiniBand concepts, including GUIDs, LIDs, Linear Forwarding Tables (LFTs), PKeys, Service Levels (SLs), Virtual Lanes (VLs), and Host Channel Adapters (HCAs).
Understand how InfiniBand delivers high-performance GPU communication through native RDMA, transport offload, lossless transport, Adaptive Routing, and Congestion Control.
Recognize the purpose and benefits of AI-focused technologies such as GPUDirect RDMA, SHARP in-network computing, and self-healing network capabilities.
Navigate the NVIDIA UFM interface and locate key operational views including topology maps, device inventories, ports, cables, partitions, telemetry, events, alarms, and system health dashboards.
Interpret fabric health information, link status, telemetry metrics, and fault conditions using UFM monitoring and diagnostic tools.
Understand how UFM supports topology validation, change tracking, performance monitoring, and troubleshooting in large-scale GPU environments.
Identify common indicators of fabric issues, such as unhealthy ports, degraded links, cable problems, firmware inconsistencies, and non-optimal link negotiations.
Describe how administrators use UFM to monitor and operate highly automated InfiniBand fabrics rather than manually configuring network behavior.

By completing this lab, participants will gain a practical understanding of how modern InfiniBand fabrics are deployed and managed in AI environments and how UFM provides the operational visibility necessary to maintain performance, reliability, and scalability across large GPU clusters.

Hardware and software

ufm_appliance UFMAPL_1.13.2.3_UFM_6.22.2.3 2025-08-28 07:59:39 x86_64

MQM9700-NS2F X86_64 3.12.6000

Ubuntu Server 24.04 (for the network traffic generators)

Solution overview

Lab diagram

Goals and objectives

Hardware and software

Technologies