HCI Storage Services Trade-Offs Analyzed: Are They Worth It?

Welcome back! This is the last article in the series of analyzing storage services for HCI environments. If this is the first article you've come across, we highly recommend reading the others which will provide context and necessary explanations.

Part I – Part II – Part III

Throughout the series, the question we tried to answer has been: Is it worth enabling storage services in an HCI environment? As we mentioned before, every customer, every data set and all risk tolerance levels are different, which will affect the results and accepted thresholds.

This article summarizes our findings and gives some general recommendations for customers and architects to consider before enabling or designing HCI environments with storage services.

The scenario

In Part I of the series, a baseline was established to test against. Several fictitious requirements were laid out to create a customer scenario, and tests were performed with different storage services enabled. The data and results gathered with these requirements were shown and explained in Part II and Part III. Analyzing the data and the results for this specific scenario and considering the observed impacts, the table below gives insights into which storage services should be considered.

RAID1/RF2: Recommended – This is the default configuration and will always be the most performant; however, this configuration will not save any storage space. The requirements were met during every phase except for the data rebuild phase, where the write latency averaged 6.09ms for 73 minutes. We consider this acceptable as the systems stabilized after this phase and met all requirements.
RAID1/RF2 with compression: Recommended – Enabling compression saved just above 8% of used storage space. The requirements were met during every phase except for the data rebuild phase, where the latency averaged 11.58ms for 91 minutes. We consider this acceptable as the systems stabilized after this phase and met all requirements.
RAID1/RF2 with deduplication and compression: Recommended with caution – Enabling both deduplication and compression services saved just above 22% storage space. The requirements were met during every phase except for the data build phase, where latency averaged 17.77ms for 104 minutes. This increase in latency is three times over the requirement, which should be considered. We consider this somewhat acceptable as the systems stabilized immediately after the rebuild was completed, and the systems met all requirements.
RAID5/Erasure Coding: Caution – Enabling RAID5/Erasure Coding with no storage services saves a guaranteed 30% storage space. As mentioned before, this is inherent to the storage technology. The requirements were not met during any phase outside of steady state. The read and write latencies were both above 5ms during the failed host and data rebuild phases. Write latency continued to be above the defined requirement through the data rebalance phase, which experienced an increase of two times above the threshold. These observed degradations continued for over four hours.

While the requirements were not met for this scenario, from a latency standpoint, it would not necessarily be a bad thing to enable this storage configuration depending on the usable storage situation. We would, however, caution a customer on moving in this direction. Ultimately, the systems would be able to sustain the workload in a host down situation within an acceptable latency threshold, but adding more storage resources would be the preferred route.
RAID5/Erasure Coding with compression: Strong caution – Enabling compression saved just above 31% of used storage space. When compared to RAID5/Erasure Coding only results, during the data rebuild phase, the system performed marginally better with compression enabled than without. This test was performed several times with similar results. The requirements were not met during any phase outside of steady state. It is also worth noting that even within steady state, the systems were encroaching on the latency threshold.

The write latency during the workload catchup phase is two times the requirement; however, what is important to note with this phase is that the systems were unable to meet the minimum application IOPs requirement, which does not meet the N+1 requirement. The systems would only begin to stabilize after the resources were brought back online and started accepting IO. This leaves the environment in a degraded state (elevated write latencies) for a significant amount of time as the data rebalance phase completes and queued workload IO catches up. In this scenario and given the data points above, we would strongly caution this customer against going this route as the minimal increase in storage efficiency over RAID5/Erasure Coding only does not justify the inability of the systems to meet the minimum requirements.
RAID5/Erasure Coding with deduplication and compression: Not recommended – This configuration saves 35% of used storage space. In this scenario, we would not recommend the customer move forward with RAID5/Erasure Coding and both storage services enabled. From the beginning, in steady state, the write latency threshold was violated, albeit slightly. Much like the compression-only storage services with RAID5/Erasure Coding, the systems were unable to meet the minimum requirements with N+1. The data also shows that application responsiveness would continue to degrade in the form of elevated read and write latency for much longer than would be acceptable for a properly sized environment.

The answer

In the first article, it was observed that during normal operating conditions, enabling any of these storage services would not have much of an impact on the workloads and that systems would be able to keep up with the demands of the applications.

The images below represent results from the curve tests performed which depict sustained maximum performance across the different storage services utilizing the same hardware.

Graph displays results from curve tests performed across different storage services

Table displays results from curve tests performed across different storage services

Important note: The images above do not imply whether a customer should turn storage services on or not, but that there are considerations that should be taken into account prior to enabling them, in the form of expected storage savings vs. performance impact. As an example, if our fictitious scenario only required 20,000 IOPs, the systems would have been able to easily keep up with any of the storage services enabled.

Recommendations

To avoid unexpected experiences with your HCI environment (and quite honestly, any storage technology), below are some recommendations that should help avoid pitfalls.

Understanding your workload. Typical conversations around performance are brief, and the provided metric is singular: IOPs. Read/write ratios, block size, sequential vs. random, and working set size are overlooked but will absolutely impact how systems perform when brought into production. There are several tools out there that can help monitor workloads so you can get a better understanding of the requirements. Our services team is also able to help on this front.
Cost analysis. The performance impact of enabling some storage services to get more usable space should be weighed against adding more disks or nodes to the environment. In the long term, the operational cost of an environment can be lowered if there are fewer problems.
Flexibility. Not all HCI platforms offer flexibility in regards to turning on or off storage services, and that's okay. However, for those that do, the ability to turn these services on or off for specific workloads needs to be considered vs. a broad stroke approach. Workloads requiring high performance can stay with higher-performing configurations vs. large storage applications with lower performance needs (a file server as an example) can utilize something like RAID5/Erasure Coding. An HCI briefing or an HCI workshop can help answer some of these questions.
Testing. We keep reiterating the point that everyone's requirements are different. Prior to purchasing HCI, it is highly recommended to test the specific use case(s). This allows expectations to be set for the technologies being evaluated. WWT's ATC is a great resource for these efforts.
Baselining. Once the technology is deployed, regardless of the ability to tweak storage services, we recommend baselining the system to gauge its true potential. There are many tools offered by both independent parties or directly through the vendor that can help make the process painless. This process can also provide a gauge to help make decisions in the future for either turning on storage services or the need for resource expansion.
Training. Last but certainly not least, do not overlook training, especially for monitoring and management of the environment. Proper training will help alleviate accidental misconfigurations and operational missteps.

Thank you for following along. If you have any questions, please reach out to your WWT account team or post the question below. We hope this information was useful.

HCI Storage Services Trade-Offs Analyzed: Are They Worth It? – Part IV

In this article

The scenario

The answer

Recommendations