UCS 5108 Power and Thermal Testing in the ATC for Data Center Capacity
We had a customer who wanted us to do thermal and power testing around a compute solution. The Proof of Concept (POC) was set up to verify or validate power and thermal readings for the Cisco Unified Computing System (UCS) components (specifically the 5108 chassis, 2408 IO Modules, and 6454 Fabric Interconnects) of the FlashStack design.
In This Insight
Our customer wanted to know what the power consumption looks like on a Cisco UCS 5108 chassis when it is fully loaded with B200 M5 blade servers. Additionally, what happens when there are different load capacities happening? They also needed to understand the thermal capacity of the gear in the rack which would help them with a physical design. Based upon power consumption and thermal readings, they were able to understand how many they could squeeze into a rack, which leads to how many racks in a row, and finally the capacity of the solution in their full data center design
Specifics of the Testing Environment
To level set around testing, it is important to know the factors that went into this test.
- All Bios settings were set to Cisco Platform default. What does that mean? In UCS Manager there is a bios policy where you can change specific parts of the bios like Processor C States, Hyperthreading, etc. For more specifics around bios tuning for B200's Cisco has a pretty good article that describes how to tune for certain scenarios. Cisco Whitepaper Reference
- It is also important to note that the thermal pictures that were taken in our datacenter for this testing come from a data center design with an open-air concept with a raised floor no hot aisle containment, but chiller's in each row. Why mention this? Well if you were using a hot aisle containment style datacenter you could see higher temperatures on the exhaust or colder temperatures on the intake.
- Load Generation for the servers was pushed by PXE booting the B200 M5 blades and using a software called BurnIn Pro to drive workloads for Memory and CPU on the server. No storage workload was applied to the servers during this testing.
- All power tests with load were run on N + 1 power setting in UCS except one test to compare Grid vs N + 1 at 100%.
- Solarwind's was used to monitor or Emerson Smart PDU's giving a third party review of the actual wattage instead of using UCS/Cisco based graphs and power readings and getting the data straight from the PDU.
- Servers used during the test can drastically change the average watts per minute based on the CPU's and amount of RAM. For our testing the specific hardware is documented just below the HLD.
Below is very simple HLD (High Level Design) for the environment in question.
Results of Power and Thermal Testing
Now that we have a level set on the testing criteria, let's dive right into the results. For this testing, we increased load in 10% increments for a duration of 30 minutes per load. To get a good baseline we powered off all blades in the chassis and recorded the power draw of a chassis basically sitting at idle with no blade workload running.
0% Load Test aka Baseline
As you can see from the chart the average watt per minute of a chassis sitting at idle is 435 watts. That is pretty substantial, but what you have to understand with that is the chassis is still powering the 2 x 2408 IO modules, any internal parts like fan's etc. Below are the thermal images of both the intake and the exhaust.
Now that we have baseline lets see what load does to the not only the watts but the thermal as well.
10% Load Test
50% Load Test
100% Load Test
N + 1 All tests Average's
This chart was created by taking a 30-minute sample size from each PDU while loads were running. We then added all the data being recorded then divided by 30 to get the average watt per minute that the chassis is producing.
N + 1 Resiliency Testing
The below chart shows how with other N + 1 testing that one power supply remains in power save mode until it is needed because of a power supply failure. Once the power supply is removed the passive power supply comes active. A few minutes later we reinserted the power supply then removed another active power supply and the wattage flip flopped back.
Grid vs N + 1
As you can see in the above charts Grid and N + 1 operate much differently when it comes to power distribution. As both of them state N + 1 uses all the power from three of the PDU's and has one in power save mode while Grid uses all of the power supplies and distributes the load across all 4 pretty evenly.
To sum up, power testing is dependent on a lot of different variables ranging from the data center type i.e. hot aisle containment vs. raised floor, CPU and RAM combination, and what load you expect to put on the servers. While this ATC Insight is not going to give you an exact answer on your specific use case it can give you a glimpse around expectations from a UCS chassis while giving you the idea of what we can test for you in the WWT Advanced Technology Center (ATC).
The customer wanted to see what the wattage requirements would be to run a Cisco 5108 chassis (loaded in a specific way) in their data center. Based upon that data that we delivered, they potentially could use it to plan a scaled out design of the solution in their data centers.
- Cisco UCS 5108
- 6454 Fabric Interconnects
- 2408 IO Modules
- 8 x B200 M5 blades
Solarwinds Network Performance Monitoring (NPM) software for power consumption and PDU monitoring.
BurninTest Pro Software used to drive memory and CPU load on blades from 10% to 100% load.