Managing Network Congestion in AI: The Role of Explicit Congestion Notification (ECN)

Sometimes you have to slow down to speed up

Many years ago, I raced superbikes around the Midwest. One of the key pieces of advice I received then was the concept of slowing down to speed up. Sometimes, going as fast as you can into corners leads to either exiting too slowly or ultimately crashing.

Ethernet can have the same problem when you start experiencing congestion. This congestion can kill performance on an artificial intelligence (AI) or high-performance computing (HPC) network. Instead, you back off the pace a bit before the corner to have better control. Ethernet does the same thing. It will slow down the sending of packets to help alleviate network congestion and improve performance. This is where Explicit Congestion Notification (ECN) comes into play.

With ECN, we have a tool to help the network react to congestion points within the fabric. Some of us who have been around in the Frame-Relay days may recall a similar function with FECNs and BECNs. ECN is often paired with Priority-based Flow Control (PFC) and is frequently called Data Center Quantized Congestion Notification (DCQCN)

This is the second in a series of blogs covering the challenges of using Ethernet for GPU networking. The first blog targeted issues around ECMP; this will focus on ECN. PFC will be referred to occasionally, but that will be focused on in a separate blog in the near future.

History of Explicit Congestion Notification (ECN)

ECN is an extension to the Internet Protocol (IP) and Transmission Control Protocol (TCP), allowing end-to-end network congestion notifications without dropping packets. ECN was initially established for TCP/IP networks in RFC 3168, published in 2001. The primary goal of ECN is to improve network performance by signaling congestion before packet loss occurs, allowing for more efficient and stable data transmission.

In the context of AI network architecture, ECN offers several significant benefits. One of the primary advantages is its ability to provide long-term congestion feedback, which facilitates the adaptive regulation of data transmission rates. This is particularly important in AI workloads, where large datasets and GPU clusters are used for training deep neural networks. By modulating traffic flow in response to congestion signals, ECN helps to maintain low-latency, lossless networks, which are crucial for the efficient training and inference of AI models.

Implementing ECN in AI network architecture also presents several challenges despite its benefits. One of the main challenges is the complexity of configuring and tuning ECN parameters to achieve optimal performance. This requires a deep understanding of network behavior and careful calibration of ECN thresholds and marking algorithms. In many cases, ECN is paired with PFC to help address congestion control in AI networks. A Meta study showed they experienced this complexity when testing a GPU cluster with 24k nodes.

Ultimately, they turned ECN off and found that to be more performant. The challenge of implementing ECN has driven AI fabrics to implement different controls to reduce complexity. Some vendors are extending ECN to the NIC, where the AI/HPC workload resides, to participate in end-to-end telemetry and monitoring. This allows the fabric to update ECN values to address hotspots in the network actively.

How does ECN work?

ECN is a reactive end-to-end congestion notification system between a pair of Ethernet endpoints. That means both the sender and receiver must support ECN. It is enabled on the switches' output queues, which utilize packet buffers to manage congestion. These output queues should also have weighted random early detection (WRED) enabled to help proactively drop packets instead of just tail-dropping all packets once they reach the high water mark. Ultimately, ECN works by proactively dropping packets due to congestion management, and the sending endpoint will reduce the packet rate.

How is this accomplished?

The IP packet's Type of Service (ToS) field is eight bits in length. Differentiated Services Code Point (DSCP) uses six of those bits to support various levels of prioritization. That leaves two bits for ECN (Figure 1).

A diagram of a diagram

AI-generated content may be incorrect., Picture — *Figure 1: IP Packet Header*

Two bits allow ECN to have four values, as shown in Table 1. In reality, three values are used. If the ECN bits are 00, the packet is not ECN-capable. When the bits are 01 or 10, the endpoint is ECN-capable. These bits will be marked 11 when congestion is experienced.

ECN Bits	Definition
00	The packet is not ECN-capable
01	The packet is ECN-capable
10	The packet is ECN-capable
11	Congestion experienced

Table 1 – ECN Bit Values

I found Cisco's Data Center Networking Blueprint for AI/ML Applications to have a fantastic visual that helps explain how ECN works. The first diagram below (Figure 2) shows hosts A and B communicating with host X. Leaf switches L1 and L2 have no congestion, along with the spine switch S1. During this traffic flow towards host X, the ECN bits are set as 0x10 until they reach switch LX. Unfortunately, both flows aggregate at leaf switch LX, creating congestion on the output buffer to host X. The switch buffer has reached the WRED minimum threshold and will require switch LX to reset the ECN bits to 0x11 and forward the packets to host X.

A picture containing text, diagram, plan, screenshotDescription automatically generated, Picture — *Figure 2: ECN Congestion to Host X*

Figure 3 shows the response to Host X receiving a packet with the ECN bits set to 0x11. Host X will transmit a Congestion Notification Packet to both hosts A and B. This is where the fun kicks in. If only a few CNP packets are received here and there, the hosts will continue to transmit as they were. Once the WRED maximum buffer threshold is reached, all packets will be marked with 0x11. This will generate significantly more CNP packets, which will cause the sending host to react to the congestion and reduce its transmission rate until the number of CNP packets received is reduced.

A picture containing text, diagram, plan, lineDescription automatically generated, Picture — *Figure 3: Host X is sending CNP packets to senders*

How do we implement this?

The following sections will explore how to implement ECN for Cisco, NVIDIA and merchant silicon provider networking hardware and infrastructure. These are not comparisons of the technologies or switch vendors. We are referring to their publicly available configuration guides on implementing the technology. No single technology or OEM is considered better than the others in this section.

Merchant silicon providers

Numerous merchant silicon providers exist, and each switch can support different capabilities. In most cases, they will support the use of ECN. The question will be whether they support static and/or dynamic ECN.

In this first example, we will demonstrate the configuration of static ECN. To do this, we must establish a profile for WRED packet drops. We will define this profile using a starting point of 25 percent and an endpoint of 75 percent. The minimum drop rate will be zero percent, and the maximum rate will be 60 percent. Next, we create a scheduler for ECN and associate it with the WRED profile we just created. Here, we give it a buffer size and transmit rate of 25 percent. Now, we need to map that ECN scheduler (ecn-sched) to a best-effort forwarding class (ecn-map). Lastly, we need to add the best-effort forwarding class to the ETS forwarding class profile group and associate the scheduler map with it. Below is an example of the static ECN configuration (Figure 4). As you can see, configuring static ECN is not for the faint of heart.

Text Box 2, Textbox — *Figure 4: Merchant silicon static ECN configuration*

Another option is to use Dynamic ECN. Unlike the static approach above to setting trigger thresholds, Dynamic ECN adjusts based on real-time conditions. Below is an example of Dynamic ECN (Figure 5). First, you will configure it like the static ECN example in Figure 4. Then, you define the Dynamic ECN profile and assign it to interfaces.

Cisco

A Cisco Validated Design (CVD) documents the ECN configuration supporting their blueprint for AI/ML applications. First, we need to classify the traffic into two classes. RoCEv2 traffic will be classified using a DSCP value of 24 (CS3), while CNP traffic is classified as DSCP 48 (CS 7). Since CNP traffic is the ECN control plane traffic, it needs a higher quality of service associated with it (Figure 6).

Next, we will configure the ECN components in Figure 7 (see below). Queue 7 will receive strict priority queuing as it supports CNP traffic. Queue 3 will be allocated 60 percent of the bandwidth, with WRED configured with a minimum threshold of 150KB, a maximum of 3000KB, and a drop probability of 7 percent. The default queue will receive the remaining 40 percent of the traffic and be treated as a best effort.

Now, we need to attach the queueing policy we created above (custom-8q-out-policy) as a system-wide QOS policy in Figure 8. Lastly, we will attach the QoS classification policy (QOS_classification_policy) to the interfaces that need to support the RoCEv2 traffic. You will notice that some commands in the blueprint have been omitted. Those are specific to the PFC configuration and will be covered in the next blog post.

Implementing this configuration using Nexus Dashboard Fabric Controller (NDFC) is even easier. When editing a fabric, you select the advanced tab. There, you can select pre-defined QOS policies for an AI cluster. This will provide just about everything you need. Remember that you must include a few CLI commands in the additional configuration box. Doing this will allow NDFC to push down the QoS configuration for ECN and PFC to the entire fabric in one operation, as shown in Figure 9.

Picture 20, Picture — *Figure 9: Cisco NDFC QoS Configuration*

Keep your eye out for even more developments from Cisco to help deploy these features. Soon, you will see Hyperfabric AI deploy entire fabrics and congestion control from end-to-end in just a matter of a few clicks, much like NDFC. Also, Cisco recently announced its partnership with NVIDIA to deploy some of NVIDIA's capabilities within Cisco switching architectures. We do not have the details available for public consumption yet, but we look forward to discussing them as they become available.

NVIDIA

Figure 10 below is an example of explicit ECN configuration on the NVIDIA Spectrum SN5600 switches running Cumulus. First, you would have the switch trust DSCP values and map DSCP 26 into switch priority 4. RoCEv2 traffic uses DSCP 26. The default-global parameter will apply this QoS configuration across all ports. Next, we set the minimum and maximum buffer threshold values. Once the buffer reaches the threshold, it will mark the ECN bit in the packets. Lastly, we will enable Random Early Detection (RED). By default, Cumulus will tail-drop packets when the buffer is full. Enabling RED will allow the switch to drop packets that are above the minimum buffer threshold randomly. This can improve performance compared to simply tail-dropping packets.

NVIDIA Spectrum™-X architecture has helped address some of the challenges of configuring all the QoS parameters needed to support AI/HPC workloads. The nv set qos roce command has made deployment significantly easier. The most common change from the default RoCE configuration is adjusting the percentage of buffer space for RoCEv2 traffic compared to the rest. In the example below, we will configure the switch buffer at 90 percent for RoCE traffic and 10 percent for the rest (Figure 11). Please note that the NVIDIA Spectrum-X architecture also has the endpoint with the NVIDIA Bluefield®-3 SuperNIC or NVIDIA ConnectX®-8 SuperNIC participating in adaptive routing and congestion control. This means the SuperNICs are also configured to support ECN and PFC. Those configurations are outside the scope of this blog but will be covered in a future blog.

Summary

ECN is a key but not mandatory component of the DCQCN architecture that assists with providing reactive congestion control. When congestion occurs, the ECN-enabled receiver or switch fabric will notify the sender of the congestion, hoping to reduce the amount of traffic put onto the fabric until the congestion is alleviated. As mentioned, ECN can be challenging to configure and fine-tune, depending on the Ethernet fabric size and architecture. Many switching providers have used configuration tools or scripts to implement the QoS parameters based on their best practices for the Ethernet fabric design and hardware being utilized.

The next part of this journey will cover the details of PFC, followed by scheduled Ethernet fabrics. In the future, Ultra Ethernet will help address some of these complexities and simplify the deployment of lossless fabrics. These articles will be followed up with testing in WWT's AI Proving Ground. The testing will show how Ethernet performs with traditional ECMP and apply those same tests to each of the different features that help make Ethernet a viable GPU networking option.