AI Infrastructure Engineers

Explore

10 results found

NVIDIA DGX BasePOD

In this learning path, we cover NVIDIA's DGX systems and BasePOD infrastructure, detailing the setup, licensing, and management of Base Command Manager and DGX OS for high-performance AI workloads. They explain hardware requirements, network configurations, and system provisioning, emphasizing efficient resource management, scalability, and optimized AI model training across NVIDIA's cutting-edge computing platforms.

Learning Path

NVIDIA AI Enterprise

NVIDIA AI Enterprise (NVAIE) offers a robust suite of AI tools for various applications, including reasoning, speech & translation, biomedical, content generation, and route planning. It features community, NVIDIA, and custom models. NVAIE provides essential microservices such as NIM and CUDA-X used for security advisory, enterprise support, cluster management, and infrastructure optimization. Designed for cloud, data centers, workstations, and edge environments, NVAIE ensures scalable, secure, and efficient AI deployment.

Learning Path

NVIDIA DGX SuperPOD and DGX BasePOD Day 2 Operations

This Learning Series was created for NVIDIA DGX admins and operators to explore things you would use on Day 2 when administering your NVIDIA DGX SuperPOD and BasePOD environments with BCM (Base Command Manager). It will detail how to update firmware, patch systems, run jobs against the infrastructure, and integrate other parts into BCM (Switches, AD, Cloud, etc.).

Learning Path

High Performance AI/ML Networking

Today, network engineers, especially in the data center space, must acquire AI/ML infrastructure skills and be able to discuss the required infrastructure upgrades and the reasoning for the upgrades with upper management. At WWT, we are committing $500 million to help our customers with AI/ML, and we have launched a new series of Learning Paths to help the reader navigate complex AI topics. By mastering these areas, data center network engineers can effectively contribute to successfully implementing and managing advanced AI and HPC infrastructure, aligning technological capabilities with business objectives while maintaining a robust and secure network environment.

Learning Path

Building Cisco RoCE fabric for AI/ML using NEXUS Dashboard

The user of this learning path will learn the components of RoCE and why it is essential for clean, fast, and reliable AI/ML compute communication.

Learning Path

AI High-Performance Computing

High-performance computing (HPC) is a rapidly evolving field that enables researchers, scientists and engineers to solve complex problems and drive innovation across various domains. As the demand for computational power continues to grow, professionals with skills in HPC are becoming increasingly valuable in today's job market. This learning path is designed to provide you with a comprehensive understanding of HPC concepts, technologies and best practices, empowering you to harness the power of supercomputers and parallel processing to tackle the most challenging computational tasks.

Learning Path

High Performance Storage for AI

Explore the critical role of high-performance storage in AI infrastructure. Gain insights into storage requirements for AI/ML workloads, architectures like distributed file systems and all-flash arrays, and strategies to optimize storage for model training and inference. Stay ahead with emerging trends shaping the future of AI storage solutions.

Learning Path

NVIDIA Run:ai for Platform Engineers

Welcome to the NVIDIA Run:ai for Platform Engineers Learning Path! This learning path is designed to build both foundational knowledge and practical skills for platform engineers and administrators responsible for managing GPU resources at scale. It begins by introducing learners to the key components of the NVIDIA Run:ai platform, including its Control Plane and Cluster, and explains how NVIDIA Run:ai extends Kubernetes to orchestrate AI workloads efficiently. The learning path then covers essential topics such as authentication and role-based access, organizational management through projects and departments, and workload operations using assets, templates, and policies. Learners will also explore GPU fractioning to understand how NVIDIA Run:ai maximizes GPU utilization and ensures fair resource allocation across teams. All this builds toward a hands-on lab experience designed to reinforce your learning and give you practical experience working directly with NVIDIA Run:ai.

Learning Path

NVIDIA DGX SuperPOD and DGX BasePOD Day 3 Operations

This Learning Series was created for NVIDIA DGX admins and operators to explore things you would use on Day 3 when administering your NVIDIA DGX SuperPOD and BasePOD environments with BCM (Base Command Manager). It will go into advanced topics of cmshell, cloud bursting from BCM, HA for headnodes, IB setup and testing of worker nodes, active directory integrations, as well as advanced workload topics of deploying Kubernetes from Base Command Manager.

Learning Path

Introduction to NVIDIA NIM for LLM

This learning path introduces NVIDIA NIM for LLM microservices, covering its purpose, formats, and benefits. You'll explore deployment options via API Catalog, Docker, and Kubernetes, and complete hands-on labs for Docker and Kubernetes-based inference workflows—building skills to deploy, scale, and integrate GPU-optimized LLMs into enterprise applications.

Learning Path