HCI Storage Services Trade-Offs Analyzed: Are They Worth It?

Welcome to the second article in the series of HCI storage services. If this is the first article you've stumbled upon, we highly recommend (require) reading from the beginning of the series as the information found in the previous article provides the baseline relating to the information below. In this week's edition, we look at RAID1/RF2 and the impacts of turning on the different storage services. Let's dive into it.

Steady state

In the last article, we provided a glimpse of this but as a quick recap, below are the metrics with how the system is performing during a steady state for RAID1/RF2. The 'steady state' refers to no failures in the environment and everything is nominal.

The graph above shows that in a steady state, the environment is able to sustain the workload with no issues. This is expected behavior for all scenarios.

Definitions

To keep things as straight to the point as possible, below are explanations of the different phases. These phases are detailed out for each scenario below.

Steady state: All hosts are online and in working conditions. Storage space savings are documented here for quick reference.
Host failed: This shows how the applications are reacting immediately following a host failure but before any data rebuild operations begin.
Data rebuild: This is the state where the storage software begins rebuilding the data that has been deemed "lost" from the failed host. This action takes a high priority as it is assumed that having slower applications is more desirable over a potential data loss event. In this specific environment when one of the five nodes is offline, the cluster is essentially losing 20% of its total possible IOPs. This section also includes an important metric around how long it took the system to rebuild the data.
Workload catchup: This is how the system is reacting post data rebuild phase. This section gives insights into how the applications would perform once the system stabilizes with a host down. In most cases, queued IO is catching up during this phase which is the reason a higher than 107K IOPs (the baseline) may be observed. This gives insights into how close to maximum performance the system is being pushed.
Data rebalance: Once the host has been repaired and brought back into production, the storage software needs to rebalance the storage capacity and IOPs to utilize this host's resources. Utilization of storage resources for the repaired host is not an immediate event. The data rebalance is a low-priority back-end task but is still required to return the environment to maximum potential with storage tasks, even if virtual machines are running on the host and using its memory and CPU resources. This section also includes an important metric around how long it took the system to rebalance the data.

The conclusions below are referring to the requirements that were laid out in Part I of the series and observed increases over the baseline numbers within the same phase (unless otherwise noted) with the default RAID1/RF2 storage configuration.

RAID1/RF2 only

This section details how the environment reacts during a single host failure in the cluster. These metrics can also be considered when putting a host in maintenance mode with some slight differences depending on the actions taken.

Graph displays metrics during a single host failure

The above graph has several metrics that need to be broken down and explained.

Steady state – The performance of the system is expected behavior.
Failed host – The graph shows that the system can fully sustain the workload and is well within an acceptable range for workload latency. In a properly sized environment, this is expected behavior.
Data rebuild – Several things to note within this section:
- Total IOPs drop by 15.8% from steady state.
- Read latency increases by 2.62ms to 3.89ms from steady state.
- Write latency increases by 4.92ms to 6.09ms from steady state.
- Time in this state: 73 minutes
Workload catchup – The data rebuild operation has been completed; however, the host is still down. The applications are catching up on IO during this phase. A small increase in read/write latency is observed from the stable state. This is expected behavior.
Data rebalance – A few things to note within this section:
- The IOPs show a 1.8% increase over steady state. This is expected behavior as the applications are still catching up from the data rebuild process.
- Read and write latencies are well within an acceptable range.
- Time in this state: 175 minutes

RAID1/RF2 with compression enabled

This section details how the environment reacts during a single host failure in the cluster when only the compression service has been enabled. These metrics can also be considered when putting a host in maintenance mode with some slight differences depending on the actions taken.

Graph displays metrics during a single host failure when compression has been enabled

The above graph has several metrics that need to be broken down and explained.

Steady state – The performance of the system is expected behavior. Enabling compression saved 1.1TB (8.4%) of used space.
Failed host – The graph shows that the system can fully sustain the workload and is well within an acceptable range for workload latency. In a properly sized environment, this is expected behavior.
Data rebuild – Several things to note within this section:
- A difference in IOPs of -3.7% (drop) from baseline.
- Read latency decreases by 0.47ms to 3.33ms.
- Write latency increases by 5.49ms to 11.58ms.
- Time in this state: 91 minutes
Workload catchup – The data rebuild operation has been completed; however, the host is still down. The applications are catching up on IO during this phase. A small increase in read/write latency is observed from the stable state. This is expected behavior.
Data rebalance – A few things to note within this section:
- The IOPs show a 9.5% increase over steady state. This is expected behavior as the applications are still catching up from the data rebuild process.
- Read and write latencies are within an acceptable range.
- Time in this state: 82 minutes

RAID1/RF2 with deduplication and compression enabled

This section details how the environment reacts during a single host failure in the cluster when deduplication and compression have been enabled. These metrics can also be considered when putting a host in maintenance mode with some slight differences depending on the actions taken.

Graph displays metrics during a single host failure when deduplication and compression have been enabled

The above graph has several metrics that need to be broken down and explained.

Steady state – The performance of the system is expected behavior. Enabling deduplication and compression saved 2.92TB (22.1%) of used space.
Failed host – The graph shows that the system can fully sustain the workload and is well within an acceptable range for workload latency. In a properly sized environment, this is expected behavior.
Data rebuild – Several things to note within this section:
- A difference in IOPs of -20.36% (drop) from baseline.
- Read latency decreases by 2.01ms to 1.9ms.
- Write latency increases by 11.69ms to 17.77ms.
- Time in this state: 104 minutes
Workload catchup – The system is more or less back to normal operations during this phase with a slight increase in read/write latencies from the stable state. Note that because there was a significant drop in IOPs for 104 minutes during the data rebuild phase, the workload catchup would have taken a long time to stabilize back to the 107K IOPs baseline.
Data rebalance – A few things to note within this section:
- The IOPs show a 17% increase over steady state. This is expected behavior as the applications are still catching up from the data rebuild process. The increase in IOPs over the workload catchup phase indicates that the newly recovered node started accepting storage capacity and IOPs.
- Read and write latencies are well within an acceptable range.
- Time in this state: Three 20-minute intervals spread out over four hours

Notable observations of RAID1/RF2 storage services

The main caution observed with these tests is that on the surface, enabling storage services has no impact on steady state performance. However, breaking down what happens during a failure or maintenance window changes the outcomes.

In all cases, during the data rebuild process, the write latency requirement is not met, increasing in severity as different services are enabled. Once the process is complete, the systems quickly stabilize, meeting all requirements even as queued IO is being processed. As mentioned previously, the data rebuild process takes a high priority to avoid a potential data unavailability/loss event.

Enabling only the compression service has a small impact on the environment. The trade-off between an increase in latency to 11ms for 91 minutes during a rebuild but a net gain of 8.4% used storage is an overall net positive gain.

When enabling both deduplication and compression, a greater increase in latency is observed during the data rebuild phase; however, the systems do eventually stabilize and are able to sustain the workload appropriately. The trade-off between an increase in latency to 17.88ms for 104 minutes during the data rebuild operation for a 22% increase in storage space needs consideration and understanding. In a pinch, enabling both of these services can help until additional resources can be acquired.

As always, please reach out with questions.

The next article will cover RAID5/Erasure Coding metrics in the same fashion.

HCI Storage Services Trade-Offs Analyzed: Are They Worth It? – Part II

In this article

Steady state

Definitions

RAID1/RF2 only

RAID1/RF2 with compression enabled

RAID1/RF2 with deduplication and compression enabled

Notable observations of RAID1/RF2 storage services