In This Article

Welcome to the third article in the series of HCI storage services. If this is the first article you've stumbled upon, we highly recommend (almost require) reading from the beginning of the series (Part I) as the information found in the previous articles (Part II) provides the baseline and context relating to the information below. 

In this week's edition, we look at RAID5/Erasure Coding and the impacts of turning on different storage services. The comparisons below are calculated from using the baseline metrics and the respective phases established in the previous articles, which used the default storage configuration of RAID1/RF2. This is important to remember. 

The image below is a reminder of the metrics this article will compare against.

Graph displays metrics during a single host failure - RAID1/RF2

Definitions of each of the phases detailed below were previously provided in Part II of the series. Refer back to the article if something is unclear.

Steady state

In the first article, we provided a glimpse of how the system performed for RAID1/RF2 and the different storage services during a steady state (nothing failed). Below are the metrics comparing the different RAID5/Erasure Coding steady states with the baseline included.

 IOPs steady state across different storage configurations

The graph above shows that in a steady state, the environment is able to sustain the workload requirements defined in the first article with no perceived issues. There is a small increase in latency as we move into the compression and deduplication services, one of which (RAID5/Erasure Coding with both storage services enabled) already violates the latency requirement.

RAID5/Erasure Coding only

This section details how the environment reacts during a single host failure. These metrics can also be considered when putting a host in maintenance mode with some slight differences depending on the actions taken.

Graph displays metrics during a single host failure - RAID5/Erasure Coding

The above graph has several metrics that need to be broken down and explained. As a reminder, the comparisons below are to the default storage configuration of RAID1/RF2 with no storage services enabled within the respective phases unless otherwise noted.

  • Steady state – Running RAID5/Erasure Coding has a very small impact on the latencies during steady state; however, the system saved ~4TB (30.6%) of used space. This is expected behavior.
    • Read latency increases by 0.05ms to 1.32ms.
    • Write latency increases by 1.79ms to 2.95ms.
  • Failed host – The graph shows that the system is unable to sustain a host failure without an impact on performance. The following metrics are observed:
    • IOPs drop by 16.07%.
    • Read latency increases by 4.04ms to 5.47ms.
    • Write latency increases by 7.01ms to 8.32ms.
  • Data rebuild – Several things to note within this section:
    • IOPs drop by 21.35%.
    • Read latency increases by 2.50ms to 6.39ms.
    • Write latency increases by 5.24ms to 11.33ms.
    • Time in this state: 66 minutes
  • Workload catchup – Due to a large drop in IOPs during the data rebuild phase (21.35% from baseline data rebuild phase and 33.78% from baseline steady state phase), we observed an increase in write latency that will have an impact on application performance.
  • Data rebalance – A few things to note within this section:
    • The IOPs show a 6.86% increase over steady state. This is expected behavior as workloads shift back to the recovered host.
    • Read latencies are in an acceptable range.
    • Write latencies show an increase of 7.94ms to 10.14ms until the rebalance is complete.
    • Time in this state: 117 minutes

RAID5/Erasure Coding with compression

This section details how the environment reacts when enabling compression during a single host failure. These metrics can also be considered when putting a host in maintenance mode with some slight differences depending on the actions taken.

Graph displays metrics during a single host failure when compression has been enabled

The above graph has several metrics that need to be broken down and explained. As a reminder, the comparisons below are to the default storage configuration of RAID1/RF2 with no storage services enabled within the respective phases unless otherwise noted.

  • Steady state – Running RAID5/Erasure Coding with only the compression service enabled has a small impact on the latencies during steady state; however, the system saved ~4.13TB (31.31%) of used space.
    • Read latency increases by 0.96ms to 2.23ms.
    • Write latency increases by 3.44ms to 4.61ms.
  • Failed host – The graph shows that the system is unable to sustain a host failure without an impact to performance. The following metrics are observed:
    • IOPs drop by 26.72%.
    • Read latency increases by 4.99ms to 6.42ms.
    • Write latency increases by 8.55ms to 9.86ms.
  • Data rebuild – Several things to note within this section:
    • IOPs drop by 12.75%.
    • Read latency increases by 2.65ms to 6.42ms.
    • Write latency increases by 4.56ms to 10.65ms.
    • Time in this state: 87 minutes
  • Workload catchup – Due to a large drop in IOPs during the data rebuild phase (12.75% from baseline data rebuild phase and 26.52% from baseline steady state phase), we observed an increase in write latency which will have an impact on application performance. A very important metric observed in this stage is that the system is only able to sustain 105,679 IOPs. This is lower than the required 107,676 defined in the first article as a customer requirement and will continue until the failed resources are brought back online and accepting workload.
  • Data rebalance – A few things to note within this section:
    • The IOPs show a 9.83% increase over steady state. This is expected behavior as workloads shift back to the recovered host.
    • Read latencies are in an acceptable range at 3.74.
    • Write latencies show an increase of 6.45ms to 8.65ms until the rebalance is complete and the workload has caught up.
    • Time in this state: 101 minutes

RAID5/Erasure Coding with deduplication and compression

This section details how the environment reacts when enabling both deduplication and compression storage services during a single host failure. These metrics can also be considered when putting a host in maintenance mode with some slight differences depending on the actions taken.

Graph displays metrics during a single host failure when deduplication and compression have been enabled

The above graph has several metrics that need to be broken down and explained. As a reminder, the comparisons below are to the default storage configuration of RAID1/RF2 with no storage services enabled within the respective phases unless otherwise noted.

  • Steady state – Running RAID5/Erasure Coding with both deduplication and compression services enabled has a small impact on the latencies during steady state; however, the system saved ~4.72TB (35.78%) of used space.
    • Read latency increases by 1.41ms to 2.68ms.
    • Write latency increases by 4.16ms to 5.33ms.
  • Failed host – The graph shows that the system is unable to sustain a host failure without an impact to performance. The following metrics are observed:
    • Total IOPs drop by 28.96%.
    • Read latency increases by 2.38ms to 6.27ms.
    • Write latency increases by 8.99ms to 10.30ms.
  • Data rebuild – Several things to note within this section:
    • Total IOPs drop by 23.50%.
    • Read latency increases by 2.38ms to 6.27ms.
    • Write latency increases by 9.82ms to 15.90ms.
    • Time in this state: 124 minutes
  • Workload catchup – Due to a large drop in IOPs during the data rebuild phase (23.50% from baseline data rebuild phase and 35.58% from baseline steady state phase), we observed an increase in write latency which will have an impact on application performance. A very important metric observed in this stage is that the system is only able to sustain 101,682 IOPs. This is lower than the required 107,676 defined in the first article and will continue until the failed resources are brought back online and accepting workloads.
  • Data rebalance – A few things to note within this section:
    • The IOPs show a 1.23% decrease over steady state. This is not expected behavior.
    • Read latencies are in an acceptable range at 4.58ms.
    • Write latencies show an increase of 7.18ms to 9.38ms until the rebalance is complete and the workload has caught up.
    • Time in this state: 67 minutes

Notable observations of RAID5/Erasure Coding storage services

There are a lot of data points to unpack with the observed data above as it relates to the requirements and the scenario established in the first article. We focus on the major findings in this summary.

On the surface, transitioning from RAID1/RF2 to RAID5/Erasure Coding with any of the storage services enabled doesn't appear to have much of an impact during normal operations and can save upwards of 35% of used storage. We do observe a latency requirement violation during steady state with both storage services enabled which is a cause for concern. This storage space efficiency can be appealing as storage is consumed over time; however, how the systems perform during major events such as a host failure or putting a host in maintenance mode needs to be addressed.

RAID5/Erasure Coding with no storage services guarantees a 30% storage saving over the default RAID1/RF2 configuration. This space-saving guarantee is inherent to RAID5/Erasure Coding regardless of how dedupable or compressible the data set is which is the main reason this storage service can be appealing to customers. 

Analyzing the data suggests that customers should strongly consider the impact during a host failure/maintenance event as both read and write latencies increase throughout the different phases. More importantly, the write latency requirement is not met throughout all the phases, which lasts for over four hours. The data shows that the cluster would eventually stable out and sustain the workload requirements during a failure/maintenance. In this scenario, enabling RAID5 only can be considered a net gain as long as the latencies are understood and accepted risk with strong caution.

Enabling only the compression service with RAID5/Erasure Coding provided some unexpected results with respect to performance. The configuration saved 31.31% storage, which is only an increase of 1% over no storage services enabled with RAID5/Erasure Coding. The 'failed host' phase showed a drop in IOPs when compared to no storage services; however, the system performed slightly better during the 'data rebuild' phase but took 21 minutes longer. Both tests (RAID5/Erasure Coding and RAID5/Erasure Coding with compression enabled) were re-run with similar results. The most important thing to note with the results of this test is that during the 'workload catchup' phase, the system was unable to meet the baseline minimum requirements of 107,000 IOPs. This means a direct impact to application performance via increased latency will be experienced until the failed resources are brought back online and accepting workload. In this instance, it would not be recommended to enable compression for a 1% increase in space efficiency over having an undersized and underperforming environment during a failure or maintenance of a host.

Finally, enabling both compression and deduplication storage services had the most negative results of all the different tests performed. The configuration saved 35.78% storage, which is an increase of 5% over no storage services enabled with RAID5/Erasure Coding. The 'data rebuild' stage write latencies were consistently held to 15.90ms (average); more importantly, during the 'workload catchup' and 'data rebalance' phase, the system was unable to meet the baseline minimum requirements of 107,000 IOPs. Applications will not only be severely impacted for the duration of the different phases but will also take a long time to stabilize back to 'steady state' even after the failed resources are back online and accepting workload in the form of increased read and write latency. In this instance, it would not be recommended to enable both storage services for a 5% increase in space efficiency over having a severely undersized and underperforming environment during either a failure or maintenance of a host.

The next article in the series will focus on overall observations and recommendations as HCI is deployed and sized in customer environments. As always, please reach out with questions. 

Continue to part IV of the series