Overview
Explore
Labs
Services
Events
Partners
Select a tab
27 results found
Using PFC and ECN queuing methods to create lossless fabrics for AI/ML
Widely available GPU-accelerated servers, combined with better hardware and popular programming languages like Python and C/C++, along with frameworks such as PyTorch, TensorFlow and JAX, simplify the development of GPU-accelerated ML applications. These applications serve diverse purposes, from medical research to self-driving vehicles, relying on large datasets and GPU clusters for training deep neural networks. Inference frameworks apply knowledge from trained models to new data with optimized clusters for performance.
The learning cycles involved in AI workloads can take days or weeks, and high-latency communication between server clusters can significantly impact completion times or result in failure. AI workloads demand low-latency, lossless networks, requiring appropriate hardware, software features, and configurations. This article will explain advanced queueing solutions used by all the major OEMs in the Network Operating Systems (NOS) that support ECN and PFC.
Article
•Jun 25, 2024
6 Steps to Understanding Cisco ACI
When understood, these six concepts will help anyone new to ACI to understand a more detailed technical discussion.
Article
•Jun 28, 2023
Understanding Data Center Quantized Congestion Notification (DCQCN)
RoCEv2 is a solution for achieving swift data throughput and minimal delay in modern data centers. It incorporates features like Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) to establish a lossless network environment. PFC manages data flow at the interface level, while ECN detects and mitigates congestion before PFC activation becomes necessary. The combination of ECN and PFC, known as Data Center Quantized Congestion Notification (DCQCN), optimizes congestion management in RDMA networks. Careful tuning of queue thresholds is crucial to prevent hot spots and ensure low Job Completion Times (JCTs). The use of ECN and PFC is necessary for maintaining a lossless fabric in GPU-to-GPU communication during AI/ML training runs.
Article
•Jun 16, 2024
Introduction to Arista's AI/ML GPU Networking Solution
AI workloads require significant data and computational power, with billions of parameters and complex matrix operations. Inter-network communication accounts for a significant portion of job completion time. Traditional network architectures are insufficient for large-scale AI training, necessitating investments in new network designs. Arista Networks offers high-bandwidth, low-latency and scalable connectivity for GPU servers, with features like Data Center Quantized Congestion Notification and intelligent load balancing. Arista's AI Leafs and Spines provide high-density and high-performance switches for AI networking. Different network designs are recommended based on the size of the AI application. A dedicated storage network is recommended to handle the large datasets used in AI training. Arista's Cloud Vision Portal and AI Analyzer tools provide automated provisioning and deep flow analysis. Arista's IP/Ethernet switches are well-suited for AI/ML workloads, offering energy-efficient interconnects and simplified network management.
Article
•Jun 25, 2024
Introduction to NVIDIA's AI/ML GPU networking solutions
This article discusses the importance of deploying AI applications and training models using distributed computing and the need for significant computational resources. It highlights the role of network efficiency and scalability in large-scale AI deployments.
Article
•Aug 6, 2024
eBook: Infrastructure Built for Tomorrow
An in-depth guide for IT decision makers navigating the complex IT landscape.
eBook
•Oct 27, 2023
Use the NEXUS Dashboard Free Trial to Proactively Monitor Your ACI Fabric
Learn how to create a 90-day POC to verify fabric performance, troubleshoot issues and validate the usefulness of the NEXUS Dashboard and day 2 operations suite.
White Paper
•Apr 9, 2025
The Risk of End of Support (EoS) Infrastructure in Your Data Center
This article examines what End of Support/End of Life means, how it can affect your business and steps to making a plan to refresh your data center.
Article
•May 2, 2023
MP-BGP EVPN VXLAN for the beginner
The article below covers VXLAN encapsulation and how MP-BGP is used to learn and forward layer two and layer three traffic across the encapsulation. To better understand the content, we recommended that readers have prior knowledge of BGP and the MP-BGP routing protocol. For those unfamiliar with these concepts, we suggest reading our foundational learning path's "MP-BGP for the beginner" article beforehand. It's important to note that MP-BGP EVPN VXLAN may initially seem intimidating, but this beginner's guide will clearly understand how it works. It's a technology that has gained popularity over the last few years, with many companies adopting it.
Article
•Apr 23, 2025
Segmenting Complex Environments Using Cisco ACI
ACI is a powerful technology offering rich features for SDN to include application-centric security segmentation, automation and orchestration in the data center.
White Paper
•Jun 3, 2023
The Future of Intent-based Networking and Multi-domain Architectures: Part II
The second in our series, this article explores what intent-based networking (IBN) is and how organizations can leverage it to build multi-domain architectures.
Article
•Apr 9, 2025
Optical Data Center Interconnect: Connecting Your Data Centers With Private DWDM Technology
Optical Data Center Interconnect (DCI) provides a cost saving, high density, flexible alternative to leased circuits for connecting geographically separated data centers.
Article
•Aug 28, 2023