HCI Storage Services Trade-Offs Analyzed: Are They Worth It? - Part I

Welcome to our multi-part series around storage services in an HCI environment. If you have any questions with the information provided below, please reach out and we'll be happy to help clarify. Over the next few weeks, we'll dive into some testing results in an attempt to bring clarity to the impacts of turning on different storage services within an HCI environment.

Let's get the disclaimers out of the way.

Performance disclaimer

The metrics and results found in the following articles are not meant to show what is possible in terms of these systems' maximum performance, how performant one HCI platform is over another, or to compare HCI technologies to a more traditional storage appliance. The information helps provide insights and cautions to the reader before attempting anything in a production environment. The results are also not meant to show "hero" numbers where the technology is set up to show sub-millisecond latency response. This is a small configuration (more on that below) under heavy stress, pushing the upper boundaries of the configuration in order to show the implications of enabling storage services.

Purpose and problem statement

For every action, there is a reaction. On the surface, storage services appear great (and can be); however, for every service enabled, there are implications that should be understood and considered.

WWT has done some extensive testing to understand the implications of turning on different storage services and their effect(s) on an environment. We understand there are a lot of "what if" or "what about" scenarios that could and should be considered which could change the results, positively or negatively. All datasets are different, hardware configurations are different, and how tests are performed vary. If there are such scenarios, please reach out to your WWT account team and inquire how we can help with testing it in our ATC.

Even though HCI has been around for several years with countless performance metrics released directly from vendors or independent third parties, we couldn't find anything that truly tried to explain the implications of turning on different storage services. This is our attempt to bring clarity to the subject.

The gear and methodology

For testing, we used an all-flash five node cluster running vSphere 7.0.2 dual socket platinum Intel processor and 2 x 25Gbps NICs. The workload generator used was Vdbench. We pushed the metrics into a Grafana instance for logging and analysis after the tests were performed.

We tried to keep our methodology simple and repeatable.

Environmental specifics:

Ensured the environment was consistent with hardware, software and firmware configurations
Ensured the environment was healthy from any hardware and software alerts before and after each test
Using the default storage configuration (RAID1 / RF2), the environment was filled to approximately 50% of the total available storage (~13.2TB)
10 VM workers (workload generators) with eight VMDKs per VM

Testing specifics:

Ran the environment at a 'steady state' (more on this in the section below)
Failed a host by using the power off function in the IPMI interface
Monitored the start/end times of the 'data rebuild' process
Let the workloads run for approximately one hour after the data rebuild completion and turned the failed host back on
Monitored the 'data rebalance' process
Exported and analyzed the data

Baselining

Baselining is one of the most important parts of the process. The baseline metrics are what we used as the constants throughout the tests. As mentioned above, we filled the environment to approximately 50% of the usable storage, which was around 13.2TB out of 25.6TB using the default out-of-the-box settings for storage configuration. This is the first constant.

The second constant is IOPs. It's important to note here that no advanced settings were changed that could affect performance. The term "IOPs" (Input/output Operations Per second) is an arbitrary number with many factors. The workload workers were configured to use a 75/25 read/write ratio, 50/50 sequential/random, and a block distribution that equaled an average size of 27,000 with a working set size of 10%.

To get to our baseline, we ran several curve tests with different thread counts. Curve tests are more or less trying to figure out what the max IO a system can handle with acceptable latency. Prior to starting any curve tests, we ran a continuous workload for approximately 45 minutes to ensure the system was not sitting idle, which can skew performance metrics.

Our results showed that eight threads gave the best IOPs to latency ratio. The system was able to consistently sustain a maximum of 214,947 IOPs with an average read latency of 4.02ms and write latency of 2.58ms. For a five node cluster with minimal flash drives, these are some impressive numbers. While we're on the topic, this is a reminder that this is not a performance test.

Hypothetical scenario

With the results above in hand, a baseline was established and the following hypothetical scenario has been laid out with customer requirements:

Sustain 107,000 IOPs in a 75/25 R/W ratio (this is ~50% of the maximum IOPs attained in the curve test)
Storage latency within the 5ms range
Applications require 7TB of space (this is ~50% of the total usable space with no savings from storage services)
N+1

The requirements above show a normal level of details often received from customers. As mentioned previously, a lot of factors are often left out when considering IOPs and as architects, we have to make judgment calls for proper sizing of environments. As we move through our testing and results, some of the recommendations are based on judgment and our years of experience. Let's look at how this workload behaves in a stable environment using different storage services and space savings associated with each.

Storage savings

Based on our configured workload, turning on different storage services showed the following space savings:

Space savings across different storage configurations

The numbers above display an expected behavior (also a behavior that will vary based on a customer's specific data). As we turn on different storage services, we expect more space savings. The baseline is the default storage configuration (least space-efficient) and RAID5 / Erasure Coding with deduplication and compression enabled sits at the top with a total of 4.7TB saved (35.8%)

An important note with the space savings above — the only "guarantee" that a customer can get, no matter how dedupable or compressible the data is, comes with the conversion from RAID1 / RF2 to RAID5 / Erasure Coding.

IOPs and latency

IOPs steady state across different storage configurations

During normal operations, the production workload IOPs across the different storage configurations remained the same. The image above shows that the system can keep up with the active workload (a requirement) no matter which combination of storage services have been enabled. We observe an increase in latency, especially with the writes; however, as we move to the right on the graph with RAID5 / Erasure Coding, we do slightly violate the 5ms latency threshold.

On the surface, these metrics show no real concerns for customers. The storage services save a good amount of space and the IOPs and latency remain stable. Based on these metrics alone, it is reasonable to think that the system can accept additional workloads beyond what was scoped. Unfortunately, there is no such thing as a free lunch, and as we move across different scenarios in the next articles, this will become more apparent.

In the next article, we will compare the results of turning on the different storage services using the default storage configuration of RAID1 / RF2.