NVIDIA GPU Operator

Details

Goals & objectives

Hardware & software

Solution overview

This lab provides an intensive, hands-on deep dive into the NVIDIA GPU Operator and Multi-Instance GPU (MIG) partitioning, designed to bridge the gap between standard Kubernetes administration and high-performance AI/ML infrastructure. It transforms a single physical GPU into a flexible, multi-tenant resource, ensuring engineers can maximize hardware utilization while maintaining strict performance isolation.

What This Lab Offers

End-to-End GPU Lifecycle Management: Participants move from raw infrastructure to a fully automated GPU stack using the NVIDIA GPU Operator to manage drivers, container toolkits, and device plugins.

Hardware-Level Partitioning (MIG): Practical experience in reconfiguring A100/H100 GPUs into up to seven independent hardware instances, providing dedicated compute and memory for varied workloads.

Production-Ready Workload Patterns: Hands-on deployment of PyTorch batch jobs and persistent inference services, complete with resource limit optimization and node targeting.

Infrastructure-as-Code Mastery: Usage of Helm and Kubernetes Custom Resources (ClusterPolicy) to define and enforce GPU configurations across a cluster.

Lab diagram

Goals and objectives

By the end of this lab, engineers will be able to:

Automate GPU Operations: Deploy the full NVIDIA software stack via a single operator, eliminating manual driver and toolkit management.

Optimize Resource Allocation: Configure GPUs for exclusivity or high-density sharing based on specific workload needs (e.g., small inference vs. large training).

Implement MIG Strategies: Master the transition between single and mixed MIG strategies to advertise resources effectively to the Kubernetes scheduler.

Navigate GPU Observability: Use node labels to identify hardware capabilities and provide resource allocation.

Hardware and software

Ubuntu
NVIDIA A100 GPU
Kubernetes
NVIDIA GPU Operator