AI Infrastructure EngineersAI PractitionersHigh-Performance ArchitecturesAI Proving GroundApplied AIATCAI & Data
Learning path
InfiniBand for AI Fabrics
Skill Level
Fundamentals
Duration 1 hour 30 minutes
Updated Jun 15, 2026
About this learning path
Understand InfiniBand AI fabric through its lossless architecture, SHARP in-network computing, and real-world economics. Then experience a full operational lifecycle from day-zero design through UFM deployment and predictive maintenance, reinforced with hands-on lab practice. Learn how self-driving operations and InfiniBand technologies are shaping the next generation of AI factories.
Your instructors
Craig KemmererWorld Wide TechnologyTech Solutions Arch II, ATC
Chirag Keshav PatelWorld Wide TechnologyTech Solutions Eng III, ATC
Chris NugentWorld Wide TechnologyTech Solutions Arch III, ATC
Prerequisites
- A basic understanding of primary network concepts, VLANs, IP addresses, and Gateways.
- Having a conceptual knowledge of network fabric and fabric design is preferable.
What you'll learn
- Explain InfiniBand's architectural advantages over Ethernet for AI and HPC workloads, including the Subnet Manager's role in discovery, addressing, and routing, and core constructs such as GUIDs, LIDs, LFTs, PKeys, Virtual Lanes, and HCAs.
- Describe how InfiniBand delivers high-performance GPU communication through native RDMA, hardware transport offload, lossless transport, Adaptive Routing, and Congestion Control.
- Recognize the purpose and benefits of AI-focused fabric capabilities, including GPUDirect RDMA, SHARP in-network computing, and self-healing network resilience.
- Navigate the NVIDIA UFM interface and locate key operational views, including topology maps, device inventories, port and cable status, partitions, telemetry, events, alarms, and system health dashboards.
- Interpret fabric health information, link status, telemetry metrics, and fault conditions to identify common indicators of fabric issues such as degraded links, cable problems, firmware inconsistencies, and non-optimal link negotiations.
- Understand how UFM supports topology validation, change tracking, performance monitoring, and troubleshooting to enable automated, scalable operation of large-scale InfiniBand fabrics in AI environments.