As the volume of data being generated and stored grows in data centers, the need to develop efficient conduits to support extensive data operations has also become of paramount. 

Creating technology to support a sustainable data center has become a growing interest for many prominent industry leaders and researchers. In this post, we focus on key efficiency metric for data centers, known as Power Utilization Effectiveness (PUE). We propose a methodology based on thermodynamic principles, which is used to calculate PUE at a more granular level circumventing the need for additional power meter installations. 

After the PUE of the data centers is evaluated, our study explores optimizing this metric by improving the cooling system's efficiency, leveraging the power of machine learning algorithms. We aim to develop a top-performing model designed to predict PUE based on cooling system attributes and incorporate simplified AI techniques for better understanding power consumption patterns in data centers.

Background

Extreme climate incidents are increasing in frequency and intensity across the world. With these rising risks, there is renewed focus and demand for doing business sustainably. This need is particularly acute in the technology space given the increasing thirst for computing power. For instance, energy consumption from the latest technologies, such as 5G-related networks, is projected to rise 37% over today's levels by 2030. Data center energy usage is also projected to skyrocket with those growing computing needs—data centers use an estimated 200 terawatt hours (TWh) each year and are projected account for 14% of the world's emissions by 2040.

Businesses across the globe have an opportunity to reduce their emissions and realize significant cost savings through targeted energy efficiency measures. An existing metric like the Power Utilization Effectiveness (PUE) measures the ratio of total power consumed by IT equipment to the total power consumed by the data center to maintain the facility. 

However, while PUE provides an easy metric to measure energy efficiency, it is limited in its ability to provide granular and actionable information.  For instance, the virtualization of servers would be more energy efficient but would not be reflected in PUE unless there is a change in IT load or infrastructure. Similarly, redundancy might be a business need that increases availability but reduces load across multiple systems, which is less energy efficient but necessary from an operational perspective. Hence, PUE is an essential energy efficiency measure, but more factors and granular data are needed to make an informed decision. 

Data centers and cooling systems

The first step in evaluating and implementing data center energy efficiency metrics is measurement. The framework developed was tested in the Advanced Technology Center (ATC) at WWT. Through experimentation on the ATC data centers, a methodology for measuring energy efficiency was formulated, and certain recommendations for data points to be measured were developed.

Data centers owned by different companies have different layouts and infrastructures. The efficiency of any data center depends on the heat management policies and cooling system layout associated with that data center. The cooling system and heat management is a unique part of the setup and mainly consists of components like fans, vents, water supply systems, condensers, etc. A sample view of a data center in WWT's ATC labs is elaborated in figure 1 where components with alphanumeric labels are the server racks and the intensity of its color represents rack utilization.,

   

Figure 1: Schematic view of ATC DC 2 in building 56 (Source: wwt.com)

Data from ATC data centers were utilized for the present analysis. These data centers fulfill customer needs on an ad-hoc basis and are used for internally for research and development purposes. Their behavior, unlike that of a production data center with a consistent output, provides a good testing ground to understand and model how various factors impact changes in power consumption. 

The selected experimental setup in ATC data center consisted of a single building with three built-in data centers and a single chiller system. The chiller system cools the air in all the data centers by running cold water through the pipes connected to each data center. The main contributors to the total power consumed by the building are the power consumed by electrical devices in each data center and the power consumed by the chiller system.

Power and cooling system-related data get stored mainly in two data sources: (1) a Splunk system; and, (2) an internal building automation system (BAS) database. Along with data from these sources, electricity bills from the service providers were also compiled, which provided information on the total power (KWh) consumed by the building. This electricity data was considered as the ground truth data for the total energy consumed by the facility. More details on data set collected and used is provided in further sections.

Power Usage Effectiveness (PUE) formulation

The most common method to measure power usage effectiveness (PUE), used by multiple data center owners in the industry, involves evaluating the amount of total power consumed by the data center per unit of energy consumed by the IT equipment where IT equipment means all units/servers supporting compute operations for a user. 

At first glance, this evaluation is simple to perform if the data center managers have installed accurate power meters, which measure the power consumed by the data center (i.e., HVAC system and IT equipment). Given below is the basic formulation of PUE used in the industry today.

  • Total power used to run the data center: PowerTotal.
  • Total power used by IT equipment in the data center: PowerIT_Equipment.
  • Power Usage Effectiveness (PUE) = PowerTotal / PowerIT_Equipment.

The PUE was evaluated for the entire building using the electric bill data and sub-meters installed in HVAC system of the building. However, the challenge is to isolate the power consumed by the cooling system for individual data center, especially when multiple data centers share a common cooling system. 

To address this challenge, WWT developed a methodology based on the laws of thermodynamics, enabling us to estimate the power consumed by the cooling system for each data center. It is also important to mention that the type of data available is a huge factor in determining the methodology for power estimations and PUE calculations. 

The methodology derived uses the inflow and outflow water temperature and volumetric flow rate of water for the HVAC system and for each data center sharing that HVAC system. This enabled evaluation of the amount of heat utilized by each data center, which was used to estimate the total power consumed by the HVAC system for each data center. Assuming the total power consumed by the building (utility meter reading) only consisted of power consumed by IT equipment in each data center and power consumed by the HVAC system, the methodology mentioned above was used to estimate the power usage effectiveness for each data center in the building. 

Below is a schematic diagram of the data center and cooling system of the building considered. The submeters installed gave us total power consumed by each data center and HVAC system in the building. 

Figure 2: Schematic diagram depicting data sources utilized.

The technique mentioned above helps in estimating PUE at the data center level. Measuring PUE at the data center level would be a good starting point for data center owners as it requires minimal infrastructure and data collection. Further sections in this study aims to understand PUE at a deeper level to learn and identify different factors effecting the efficiency specific to that data center, i.e., infrastructural layout, temperature setpoints, rack utilization, server load, etc.

PUE estimation using machine learning 

In the previous section, a way to estimate PUE at a data center level was developed to help understand the cooling system's efficiency for different data centers. But to optimize the cooling system installed in data centers, an understanding of the dependence of PUE on cooling system parameters is necessary. 

To achieve this, a machine learning model was used to estimate the PUE of the data center (or the entire building) by using internal parameters of the cooling system as input variables. It is also important to realize that this exercise is only feasible if relevant data is captured and stored for cooling systems. With the high availability of data points from sensors in the data center, this study assumes that a machine learning (ML) model can be used to estimate the PUE of the data center using these data points as input variables. 

This study adopts a neural network-based ML model, which is a favored approach to effectively manage the number of model parameters necessary to achieve the required accuracy score. This exercise uses a supervised learning approach for a neural network where we use sensor data from 16 different sensors installed in our data center. These data points include various sensor readings such as outdoor temperature readings, temperatures of incoming and outgoing water flow in cooling system, volumetric flow rate of cold water, deployed chiller capacity of each chiller in cooling system, pressure readings in each chiller, etc. 

Since data from sensors were available at different frequencies, the least frequent variable in the dataset was chosen as the granularity of estimation. Note that we also account for the frequency of power-meter readings available to us in the datacenter which was relatively higher (5 minutes) when compared to data from other sensors. 

The power-meter readings were used to calculate PUE for entire building which was considered the ground truth dataset for our estimation model. Finally, PUE was estimated every 15 minutes using chosen feature variables as input and trained using the calculated PUE for the entire building.

Figure 3: Schematic diagram depicting Neural Network Model predicting PUE.

TensorFlow library was used to build the neural network model. Each subsequent layer in the model has a different number of neurons which converges to give a single neuron in the final layer, and backpropagation was performed using loss functions like Mean Squared Error (MSE) or Log MSE. 

Data cleaning and standardization steps were performed before the training exercise was performed. The dataset was divided into training and test sets, where the training set was further bifurcated into a validation set. This validation set is used to evaluate the validation loss of the model, which is then used in each training epoch to compare the model performance. While training, Tensorflow callbacks were utilized as a part of the exercise. These callbacks include Early Stopping, Learning rate scheduler and Model Checkpoint.

Hyper-parameter tuning

To get the best neural network model with the most optimized set of parameters, hyperparameter tuning was performed using an open-source library/framework named Optuna

Optuna uses eager search spaces and state-of-the-art algorithms to efficiently search large search spaces and prune unpromising trials for faster results. The parameters to be optimized using this technique include the number of layers, the number of units in the first layer, the backpropagation optimizer, and the activation function. 

It's important to mention that the number of units in each layer of the network is not tuned; instead, only the number of units in the first layer is tuned. This is because only the shape of the network is taken to be consistent over all the trials in the tuning exercise for simplicity (i.e., each subsequent layer will have half the number of units in the previous layer). This shape is like a reverse pyramid or encoder part of an autoencoder. The shape of the network is assumed to limit the number of parameters to be trained on since the dataset is not extensively large enough.

Next, the search space is considered for each hyper-parameter to be tuned. Since an extensively deep neural networks is not desired so as to limit the number of parameters, the number of hidden layers is limited to a maximum of three. 

Similarly, the number of units in the first layer is limited to a maximum of 256 with a minimum of 32. The choice between Adam and Stochastic Gradient Descent (SGD) techniques is considered for the optimizer. Lastly, the model chooses from Linear, ReLu, and Sigmoid activation functions for the activation function. A summary of the search spaces is given in the table below.

SNo.

Hyper-parameter

Search Space

Distribution Type

1

Number of hidden layers

[1,3]

Integer range

2

Number of units in the first layer

[32, 256]

Integer range

3

Optimization Technique

Adam, SGD

Categorical selection

4

Activation Function

Linear, ReLu, Sigmoid

Categorical selection

Table 1: Search spaces for each hyper-parameter to be tuned.

Model explainability techniques for PUE optimization

Once the best set of hyper-parameters is identified from hyper-parameter tuning, the next focus is to infer relationships between PUE and input variables using the model. Since the primary goal of this study is to improve efficiency, it is crucial to look for variables that can help bring down the PUE considerably. Knowing the type of correlation is essential (i.e., positive or negative, between the variable and PUE, and getting an estimate of each variable's average impact on PUE is needed).


SHAP library is used in the analysis, which helps estimate the average impact of each variable in calculating PUE utilizing the model. SHAP is an open-source library available in Python and refers to Shapely Additive exPlanations. Its theory is derived mainly from game theory, which assumes all input variables to be players in a game where we predict model output or PUE. In simpler terms, SHAP values are calculated when only one of the input variables is a variable; the rest of the input variables are kept constant at the mean values. 

This helps get a partial dependency plot of output (PUE) with each input variable separately. Using this plot, SHAP values are further evaluated, which implies that the value of the output variable is greater than the expected output value when positive and vice versa. A partial dependency plot and SHAP value plot have been shown below.

Figure 4: Partial dependency plot of feature x0.            

             

Figure 5: Corresponding SHAP values of x0.

Apart from these plots, a bar chart is used to indicate the mean SHAP values over the entire test for each feature. This helps locate the most important feature impacting PUE, where the focus is on getting further information by plotting dependency plots. These plots have been displayed in next sections along with other results.

Results and discussion 

Many studies indicate that the average annual PUE of data centers reported by IT and data center managers ranges from 1.5 to 1.8. In this analysis, where PUE numbers for the data centers at WWT are calculated, a similar range of numbers was realized for each data center investigated. However, it is interesting to note that when comparing the PUE results the data centers in the same building, we observe that PUE of datacenters can go up to 6 % higher and 23 % lower when compared to the entire building's PUE. This difference can be attributed to each data center's layout and load conditions, even when they share a common cooling system.

After generating PUE numbers acceptable to our SMEs, training a machine learning model to predict the PUE numbers was taken up using the internal parameters of the cooling system as input. For this analysis, a dataset with one year of measurements was used, totaling up to 30k data points. The data were divided into test and training data, and log MSE was used as a comparison metric to get the best model from the hyper-parameter tuning experiment. The results from hyper-parameter tuning are the following:

  1. Number of neurons in the first layer: 256
  2. Number of layers: 2
  3. Activation function: ReLU
  4. Optimizer: Adam

Using a trained model with above mentioned hyper-parameters a comparison plot of predicted PUE values and calculated PUE values in the test set was generated and is displayed in the figure below.

Figure 6: Shows comparison between predicted PUE values and calculated PUE values.

Using the same model, SHAP plots were used to determine the relationship and impact of input feature variables on the model output (i.e., PUE predictions). The figure below shows the summary plot, which is the mean absolute value of impact of each feature on the model output.

Figure 7: SHAP summary plot showing the average impact of each feature on PUE.

Partial dependency plots of feature variables were created that have a high impact on the PUE predictions, and these plots were used to formulate suggestions to reduce the PUE. 

The recommendations are made considering that the SHAP value for each feature needs to be minimized. Each point in the scatter plots represents a data point which was used to corresponding SHAP values of our predictive model. The color represents a second which that may have an interaction effect with the feature that we are plotting.

Figure 8: Outside Atm. Temp vs. SHAP value. 

         

   Figure 9: Water Supply Temp vs. SHAP value.

              

   Figure 10: Chilled Water (CW) Flow vs. SHAP value.

               

Figure 11: ACC-1 Capacity vs. SHAP value.

Below are the suggestions formulated to reduce PUE by understanding these dependency plots.

  1. Higher average chiller capacity has higher SHAP values, indicating that decreasing chiller capacity would reduce PUE.
  2. Higher OATemp (Outside Atmospheric temperature) has higher SHAP values suggesting a reduction in ambient temperature outside will improve the system's efficiency. Although this is not feasible, the model is accurately identifying it as impactful feature variable.
  3. Higher water supply temperature gives us negative SHAP values, which means increasing the supply temperature set points will reduce the PUE.
  4. Lower chilled water flow rate at lower tank temperature may also reduce PUE of our data centers.

PUE visualization dashboard

After producing tangible suggestions with model examinability techniques, a dashboard was developed for engineers and subject matter experts at the data centers to increase accessibility and responsiveness. The dashboard aims to contain all relevant results and information.

Based on the type of information/methodology involved, our dashboard has two pages. First, the "Power Utilization Dashboard" page gives all relevant information regarding the PUE of the building and each data center within the building. There is also another line chart on this page that displays PUE's historical values on a time axis. This is the default main page for the operators as it can quickly inform the current PUE of any data center, which is the key deliverable of our PUE methodology.

The second page, or the "Machine Learning Dashboard," shows the operator all the results from the model training experiments to estimate PUE using a regression-based technique on the calculated PUE of the building. The results displayed on this page are produced after performing hyper-parameter tuning on the machine learning model. So, the first section on this page displays the PUE prediction vs. actual PUE in line plot, optimized model size, and best validation loss from all the hyper-parameter tuning experiments. 

The following section on the dashboard is about model examinability and sensitivity analysis. It displays some relevant SHAP plots which indicate feature importance of the model and how each feature impacts the output of the model or PUE. This will help operators understand and derive a few suggestions to reduce their PUE.

              

Challenges of data and methods used

While working on this problem, the primary challenge was getting relevant, high-quality data that has been well managed and monitored. This study stored data from different sensors or sources at various locations (i.e., in Splunk and internal servers). While some of the data was reliable, well maintained, and could be used in the analysis, the rest of the data was difficult to retrieve and understand.

Another major challenge is regarding the reliability of the dataset used. The data points that gave us the same information but stored at different locations were sometimes conflicting and did not match. For example, PDU data from Splunk provided different measures of energy used when compared to power data available from installed utility meter data stored in the internal servers. This study used the utility meter data as ground truth which was used in the analysis.

Another challenge was that the methodology developed was specific to the data center infrastructure and design under study. While the principals involved in developing that methodology are broad and well accepted within the industry, new research and formulations might be applied to reproduce this analysis on any other location with different infrastructure.

Future work

  1. Integrate drift monitoring and experiment tracking using MLOps tools.
  2. Expand modeling techniques to achieve better performance.
  3. Implement this solution for multiple data centers and develop infrastructure agnostic solution for high scalability.

Appendix/references