In this ATC Insight

Summary

Imagine standing up a new data center. The part where you start transitioning applications from your legacy data center, which contains your old makeshift server farm is in full effect.  You are moving these applications to your brand new, shiny, dedicated high-tech data center, but things are not going well. 

In fact, users are complaining that applications are taking 3 – 10 times longer to load than before. Some of the applications are freezing up and are basically unusable. The network monitoring tools you are using are not pointing to equipment failures or any performance issues. Further inspection indicates there's a significant amount of packet loss and it appears to be within the data center. 

Has this ever happened to you? It seems most architects and engineers have had these situations happen in their careers. At World Wide Technology, the way we are able to troubleshoot and find issues in an efficient and effective manner is by leveraging the Advanced Technology Center (or ATC). The above example is a real example from one of our engineering organizations that helped a customer through this issue.

ATC Insight

The Problem

Using an open source network performance measuring tool installed on laptops, it was determined that the network was dropping jumbo packets used for server to server communications explicitly within the data center.  More precisely, only packets that had an MTU greater than 5500 appeared to be dropping.  With this being the case, logic said it must be some universal underlying issue, like an infrastructure issue. 

Depiction of Customer Data Center Design
(Figure 1) Depiction of Customer Data Center Design

Although the data center (Figure 1) was engineered to support multi-context security zones with a high throughput capability, it didn't have that much hardware.  Or more precisely, it didn't have many different types of hardware. 

Within the datacenter we have the following components:

  • 1 Pair Cisco Nexus 7700's
  • Aggregation/Spine VDC with 12 VRF's
  • 5 Pair of Cisco Nexus 5600's (Leaf)
  • Legacy Servers Pod with Nexus 2000 FEX
  • Storage Segment Pod (NetApp)
  • UCS Pod (Cisco UCS)
  • A/B side Pod redundancy
  • Services Layer Pod
  • 4 Cisco ASA 5585 (Clustered)
  • Each ASA has 2x10G Uplink and 2x10G Cluster Links)
  • 1 Pair F5 Big-IP (Load Balancers)

As you can see, there was nothing unusual here. Nothing stood out as not being capable of supporting jumbo frames.  The only pause for thought was potentially the Cisco ASA Firewalls.  Note: At the time of this testing, clustering Cisco ASA's was a newer feature and a possible concern.  The clustered ASA's were not just controlling all northbound and southbound traffic, but they were also responsible for eastbound and westbound traffic within the data center security domains.  This was accomplished by having each security domain be its own Virtual Routing and Forwarding Instance (or VRF). Routing between the VRFs was done by the layer 3 functions of the Cisco Nexus 7700's. The ASA's (in transparent cluster mode) then inspected/controlled all traffic in and out of the security zone. For example, traffic initiated on a VRF in the legacy servers pod destined for the storage pod would traverse the clustered ASA's because all routing between security zones occurred on the Nexus 7700's. So this became the focus of what was tested and validated.


The ATC Mock-Up

Much of what we design in the ATC really depends on understanding the ultimate result we are looking for.  This is especially important as the ATC does not contain unlimited resources (labor, hardware, software, space, cooling, testing tools). It was not realistic or even necessary to duplicate the entire data center design in the ATC. The reason for this is because initial troubleshooting determined that the area not processing jumbo packets greater than 5500MTU's was contained in the security services pod where the clustered ASA's and F5 load balancers existed.

The ATC created a mock of the data center's security services pod as well as the aggregation core, because all the routing between the security zones occurred on the Nexus 7700's.

After reviewing the available hardware to the ATC, only one issue arose. The ATC was not able to come up with 4 ASA 5585's each with 4x10Gig interfaces within the desired timeline.  The ATC did have 2 ASA 5585's each with 2x10Gig and 4x1Gig interfaces.  Best practice for clustering ASA's is to use 2 interfaces for the Cluster Data Link (forwarding), and 2 interfaces for the Cluster Control Link. All of these links should be at the same speed.  This limited us to using the 4x1Gig interfaces. This was not an issue because the jumbo packets traversing the ASA cluster was the test, not so much the maximum throughput. 

To minimize the differences between the ATC mocked-up lab and the customer data center design, all devices were loaded with the same operating systems, and connected on the same interfaces.  We then copied each device configuration from the data center to its appropriate counterpart in the ATC.  Only minor changes were made for things such as management access controls etc. See Figure 2.

 

ATC Mock-UP design built in the ATC
(Figure 2) ATC Mock-UP design built in the ATC


Testing and Results

Note: Testing for this customer occurred in the ATC back in 2015. As you can see, the ATC has a rich history at World Wide Technology for been a lynchpin for the engineering teams. 

See what we've tested over the years with our customers! PoCs

 The testing that occurred in the ATC revealed a different kind of truth.  The issues seen with MTU in the initial troubleshoot testing at the customer data center were not present in the ATC testing.  The truth was proven out to be other issues causing the application performance issues after the data center migration. To see the test cases and troubleshooting results in great detail: 

Conclusion

As a result of testing a mock-up of the data center in the ATC with dedicated testing equipment (Ixia) it was determined that the open source testing software and laptops used in the initial testing at the customer data center simply were not capable of testing accurately with the parameters used.

Once confidence was instilled in the data center hardware and design (proven out), troubleshooting was refocused on the issue of the slow applications.  Many things were discovered, such as bad Java code, some data storage connections that were not properly load balanced, as well as the main culprit, a specific load balancer.  This was not the F5 load balancer in data center, but one that was at the primary facility were many power users worked from.  The user traffic at this primary facility was unknowingly being forced through this load balancer, one that was not designed for these types of applications.  Once the load balancer was reconfigured and removed from the path for certain applications user experience drastically improved.

Migrations are stressful, exhausting, and usually have some issues that are brought to light.  Imagine what the cost and impact would have been if confidence in the data center design was never regained? In this situation the ATC was able to validate that the customer data center design was sound, and more than capable of supporting the applications necessary to run the business. 

Test Plan/Test Case

Testing

As was mentioned in the Problem section in the ATC Insight section, testing at the data center used an open source network performance tool loaded on 2 laptop PCs.  The laptops were connected to unused 1Gb/s interface on one of the services pod Cisco Nexus 5600's as shown in Figure 2.  The laptops were assigned to various VLANs associated with the ASA, F5, and a network one router hop away.  The laptops were then configured with appropriate IP addresses associated with the assigned VLANs.  Packets were sent from one laptop to the other with various MTU lengths.  The big question that loomed over this troubleshoot testing was whether the laptops (and open source tool used) and results were considered reliable enough to diagnose the issue.  

Testing within the ATC duplicated the scenarios and testing that was accomplished at the data center.  This allowed us to compare results between tests from the customer data center and tests from the ATC. There would however be one big difference, the use of purposely designed testing hardware from Ixia. Four Ixia endpoints were connected to unused 10 Gb/s interfaces, one on each of the Cisco Nexus switches, and assigned to the same VLAN and IP address that the laptops used.  See Figure 2.   

Test Results

All Ixia tests were  configured to use UDP packets with a 9000 MTU data load. For a logical layout of the testing scenarios refer to Figure 3.

(Figure 3) Logical depiction of the environment built in the ATC.

Test 1-A tests switch (7700) to switch (5600) connectivity within VLAN 16

  • Drops/Errors 0/0, 9750 Mb/s throughput. Expected for 10Gig Interfaces.
  • Note: Laptops only had 1G interfaces so only had a 926Mb/s throughput.

Test 1-B tests switch to switch connectivity within VLAN 116. VLANS 16 and 116 are on the same network, they are the outside/inside interfaces of the Passive ASA's.

  • Drops/Errors 0/0, 9750 Mb/s

Test 2 tests traversing the ASA (VLAN 16 ↔ 116)

  • Drops/Errors 0/0, 975 Mb/s

Test 3 tests traversing a L3 routed hop (VLAN 48 ↔ 16)

  • Drops/Errors 0/0, 975 Mb/s

Test 4 tests traversing a L3 routed hop and the ASA (VLAN 48 ↔ 116)

  • Drops/Errors 0/0, 975 Mb/s

Test 5 tests traversing a L3 routed hop through the F5 (VLAN 116 ↔ 17)

  • Drops/Errors 0/0, 975 Mb/s

Test 6 tests traversing the ASA and a L3 routed hop through the F5 (VLAN 16 ↔ 17)

  • Drops/Errors 0/0, 975 Mb/s

Test 7 Bonus Test (not done at data center)

  • Full mesh test between all Ixia
    • VLAN 48 ↔ 16, 116, 17
    • VLAN 16 ↔ 48, 116, 17
    • VLAN 116 ↔ 16, 48, 17
    • VLAN 17 ↔ 16, 116, 48
  • All Tests - Drops/Errors 0/0, @ 400 Mb/s  See Figure 4.
(Figure 4) IXIA Toolset Test Results

As you can see there were no indications of any issues with any of the hardware not being able to process jumbo packets greater than 5500 MTU.

Test Tools

Ixia IXNetwork

Used to simulate real L2/L3 workloads in the ATC Mock Up customer data center troubleshooting environment.

 

 

 

 

 

 

 

 

Technologies