Solution overview

This lab provides a guided exploration of an AI-focused InfiniBand fabric managed through NVIDIA Unified Fabric Manager (UFM). Rather than performing configuration tasks, participants examine how a modern InfiniBand environment operates and how UFM provides visibility into fabric health, topology, telemetry, and management functions.

The lab begins by establishing the architectural differences between InfiniBand and Ethernet. InfiniBand relies on a Subnet Manager (SM) to automatically discover devices, assign addresses, compute routes, and program switch forwarding tables. As a result, once a valid Clos topology is physically cabled, much of the fabric configuration occurs automatically. This contrasts with Ethernet environments, which typically require extensive manual or orchestrated configuration of addressing, routing, and traffic-management policies.

Participants then review key InfiniBand concepts including GUIDs, LIDs, Linear Forwarding Tables (LFTs), PKeys, Service Levels (SLs), Virtual Lanes (VLs), Host Channel Adapters (HCAs), routing engines, Adaptive Routing, and Congestion Control. The lab also highlights AI-specific capabilities such as GPUDirect RDMA, SHARP in-network computing, and self-healing network technologies that enable efficient large-scale GPU communication.

The primary focus of the exercise is the UFM interface. Through a series of read-only observations, participants examine fabric topology, device inventory, ports, cables, partitions, telemetry, alarms, system health, job management, and administrative settings. Special attention is given to the Subnet Manager configuration, telemetry collection, topology validation, and detection of non-optimal links, demonstrating how UFM transforms a highly automated fabric into an observable and manageable platform.

By the end of the lab, participants should understand how InfiniBand fabrics are discovered and managed, how AI and HPC workloads benefit from native RDMA and in-network acceleration, and how UFM provides the operational visibility required to monitor, troubleshoot, and maintain large-scale GPU clusters.

Lab diagram

Loading

Technologies