Priority Flow Control: The Stop Light of the AI Networking Highway
In this blog
Introduction
You may recall in my last post about Explicit Congestion Notification (ECN), we talked about slowing down to speed up and how that applied to Ethernet networking for Artificial Intelligence (AI)/High-Performance Compute (HPC) environments.
Now, we are going to talk about completely stopping the flow of traffic to help manage network congestion. Not all traffic is stopped; only specific flows are halted. Think of it like a traffic light controlling multiple lanes of traffic. At times, you may have traffic flowing in one lane (left turn), while everyone else is stopped. This is done to allow people to turn left; otherwise, there is too much congestion to do so.
This is where Priority Flow Control (PFC) comes into play. PFC is a mechanism designed to prevent data loss during network congestion by pausing specific data flows. It acts like a traffic light system, controlling which data packets can pass during peak times to prevent collisions and bottlenecks. PFC is a key component in creating lossless Ethernet fabrics, which are critical for applications such as Remote Direct Memory Access (RDMA).
This is the third in a series of blogs covering the challenges of using Ethernet for GPU networking. The first two blogs targeted issues around ECMP and ECN; this will focus on PFC.
Priority Flow Control (PFC) history
PFC is an enhancement to the existing Ethernet flow control pause command. The original IEEE 802.3x Ethernet PAUSE mechanism, defined in 1997, would halt all traffic on a link when a buffer was full, regardless of the traffic type. This could negatively impact flows that were not causing congestion.
Think of it as a red traffic light for all lanes in all directions. All we get to do is look, honk, or yell at each other because we want to get to our destination. To me, this can be as bad as every light being green and trying to figure out how to get through the intersection without being clobbered. That just makes me think of a scene in "Live Free or Die Hard" with Bruce Willis, but I digress.
To address this limitation, the IEEE 802.1Qbb standard introduced PFC as part of the Data Center Bridging (DCB) suite and formalized it in 2011. PFC allows a device to send a pause frame for a specific traffic class, identified by its 802.1p priority level, rather than pausing all traffic on the link. This means you can have up to eight separate virtual links on a single physical link and pause any of them independently to create a no-drop class of service. I recall working with this before the official standard was released to support Fibre Channel over Ethernet (FCoE) back in the day.
In the context of AI network architecture, PFC offers several significant benefits. One of the primary advantages is its ability to provide a granular, link-level flow control mechanism, which prevents packet loss on a per-priority basis. This is particularly important in AI workloads, where large datasets and GPU clusters are used for training deep neural networks and any packet loss can degrade performance. By temporarily pausing traffic for specific classes of service in response to congestion, PFC helps to maintain lossless networks, which are crucial for AI networks that utilize RoCE for storage and backend traffic.
As I mentioned in the ECN blog, there are complexities in deploying some of the congestion control mechanisms. PFC does not have nearly the same complexities as ECN. The Meta study I referred to in previous blogs even showed how they ended up turning ECN off and running only PFC. Having that lossless network was more important than the congestion notification features.
How does PFC work?
Priority Flow Control (PFC) is an enhancement to the original Ethernet flow control mechanism, designed to create lossless Ethernet fabrics. While Explicit Congestion Notification (ECN) is a reactive end-to-end congestion notification system, PFC takes a different approach by temporarily stopping the flow of traffic. Think of PFC as a traffic light system. It doesn't stop all traffic on a link; instead, it only halts specific data flows (or "lanes") that are causing congestion. This mechanism is crucial for applications like RDMA where data loss must be prevented.
PFC-enabled ports will have several defined thresholds. One is xOFF, which defines the buffer limit when a pause frame is sent for that class of traffic. The other is xON and is a threshold for stopping pause frames from being sent, as shown by the dotted red line in Figure 1. Once pause frames are sent, the buffer will eventually drain to the point where traffic can resume flowing in a normal fashion. xON is the tool that allows that to happen and is represented by the green line.
Traffic will be classified into various classes using both Layer 2 and Layer 3 methods. Layer-2 classification will be using 802.1p Class of Service (CoS), which is three bits and has eight values of 0-7 located in the Ethernet header. Layer-3 classification will happen using Differentiated Services Code Point (DSCP), which is six bits in the IP header. In many cases, RoCEv2 traffic will be assigned to Priority Class 3, which is defined as lossless. Priority Class 3 will also equate to the 802.1p CoS 3. That means RoCEv2 traffic that matches the DSCP value that gets mapped to Priority Class 3 will get treated differently from HTTPS traffic with a lower priority DSCP value. That HTTPS traffic would then get treated with best effort and be susceptible to packet drops and lossy characteristics.
In our example, we will have traffic flowing from host A and C to host B in Figure 2. Everything is going good from the leaf switches up to the spine, but once all that traffic reaches leaf 2, we start to see congestion, as shown in Figure 2. The buffer for CoS 3 is to build and has reached the xOFF threshold. The switch reacts by sending a PFC pause frame, but only for that CoS to slow down the RDMA frames. Any other traffic is subject to packet drops. At this point, S1 will pause sending packets down that uplink that are classified in CoS 3.
Now that S1 has stopped sending CoS 3 classified packets to L2, they only have one place to go: the buffer. That will build until it hits the xOFF threshold on S1. Once that happens, it will send PFC frames down to the leaf switches that are sourcing that CoS 3 traffic. That is shown below in Figure 3.
As you can guess, this same process happens at leaf 1 and leaf 3 as the buffers fill on those ports. This means a PFC pause frame is sent to the original senders. Both host A and C should stop sending that traffic up to the leaf switches for that priority class, as shown in Figure 4.
As the PFC pauses frames work through the fabric, the buffers will eventually clear as packets are sent to the destination. This process enables us to send RDMA packets in a lossless manner, as pause frames are sent at a threshold rather than when full. This allows packets to be drained from the buffers instead of filling them and dropping packets. Once the buffers are drained and reach that xON threshold, a PFC pause frame with a zero-time value is sent to the host or neighboring switch to let them know traffic for the CoS can restart.
The implementation of PFC is critical for achieving the lossless transport required by RoCEv2, yet this hop-by-hop backpressure mechanism inherently introduces stability risks. Persistent congestion or specific traffic patterns can lead to a condition known as a PFC pause storm or, in the worst case, a PFC-induced deadlock, where queues across multiple switches become permanently stalled, collapsing the low-latency fabric. The PFC Watchdog is therefore a mandatory operational component, acting as the final line of defense by continuously monitoring PFC-enabled queues for extended periods of inactivity under pause assertion; upon detecting this stalled condition, the Watchdog performs a mitigating action—typically disabling the affected queue and dropping packets—to intentionally break the deadlock or pause storm cycle, thereby restoring data path movement and preserving the overall health and availability of the RDMA network. Each of the configuration examples below will show how to implement the PFC Watchdog.
How do we implement this?
The following sections will explore how to implement PFC for Cisco, NVIDIA, and merchant silicon provider networking hardware and infrastructure. These are not comparisons of the technologies or switch vendors. We are referring to their publicly available configuration guides on implementing the technology. No single technology or OEM is considered better than the others in this section.
Merchant silicon providers
Numerous merchant silicon providers exist, and each switch can support different capabilities. Here we will show the basic configuration of PFC on this switch. First, we need to inform the switch that queues three and four will be lossless queues, and all other traffic is mapped to queue 0, which is the best-effort queue. Those two queues map directly to 802.1p code points or Class of Service (COS) 011 and 100, respectively. We must map it with the classifiers command so that any incoming traffic is classified into the proper lossless classes. Next, we create a congestion-notification-profile (CNP) lossless-cnp that maps the 802.1p code-point priority of three (011) to queues three and four, and configure the PFC watchdog to prevent PAUSE storms. Lastly, we assign the CNP to an interface as shown in Figure 6.
Cisco
Deploying PFC on a Cisco switch is well-documented, dating back to the days of FCoE. It uses the Modular QoS CLI (MQC). Here we will focus on the PFC configuration. The ECN configuration was covered in my ECN Blog and will be abbreviated here, in Figure 7, to keep the configuration brief to help provide context.
First, we have to define the network-qos so that queue three supports traffic from 802.1p CoS 3. This will state that anything in queue three has a CoS of three and will use pause frames as required to provide a lossless fabric. Please note that DSCP 24 (RoCE traffic) is getting classified into CoS 3. Then we will apply the network-qos policy to the system QoS. Lastly, we have to attach the QoS service policy to the interface and enable the interface for PFC support with a watchdog using the default of 100 ms. The example is shown in Figure 8 below.
Implementing ECN and PFC in Nexus Dashboard is quite easy. It is essential to include the PFC commands in certain versions of the Nexus Dashboard, as illustrated in Figure 9.
NVIDIA
NVIDIA® Cumulus® supports both PFC and traditional 802.3x link pause. It is important to follow the documentation closely as 802.3x is not what you want to be using. Remember, 802.3x pauses all traffic, not a specific queue! Before you configure PFC, you will want to configure the buffer pool memory allocation, as shown in Figure 10 below.
To configure PFC, we first need to define the switch priority that will use PFC. In this case, we are using three to support CoS 3. Next, we will instruct the switch to enable the sending and receiving of pause frames. In many cases, a cable length of 50 is sufficient in smaller AI networks. When they span multiple pods, you should consider this setting. Lastly, we need to configure the PFC Watchdog. Remember, it is always important to prevent pause storms on a PFC-enabled network. The example is shown in Figure 11.
As discussed in my previous ECN blog, the NVIDIA's Spectrum™-X Ethernet architecture has helped address some of the challenges of configuring and tuning the QoS parameters to support RoCEv2. The nv set qos roce command has made deployment significantly easier, as shown in Figure 12.
Summary
PFC is a crucial part of deploying a lossless AI/ML network. Enabling it allows traffic in specific queues or Class of Service to be paused, eliminating packet loss, rather than relying on retransmission methods within the protocols themselves.