In this article

As you'll come to know (if you don't already), we rely on the Advanced Technology Center (ATC) as a home for innovation, and HCI is no stranger to new tech and processes. Within the ATC, we've created several HCI instances where customers can perform customized on-demand testing and validation.

The HCI lab environment can be used as a one stop shop to test any HCI need, whether that be through a proof of concept (POC), on-demand lab or an HCI workshop.

HCI Labs and PoC uses

There are a lot of things that will differ across POCs for different customers and rightfully so, especially as additional digging is done into performance requirements. However, there are a lot of things that will be similar across most that can be leveraged to speed up the decision making process. 

The important part of a POC is to define what is considered a success or the end goal. In order to do this, we have several tools at our disposal: workshops, score cards, test plans and benchmarking tools such as VDBench for workload simulation.

Test plan

The main portion of a POC is to develop an appropriate test plan. Below you'll find an outline of a typical test plan. Each vendor tested will have its own results to be recorded within the test plan for quick reference and comparison. 

The most important thing is that this is the foundation of the POC. We can easily customize these tests to meet customer requirements as needed. Below is an example of a test plan table of content.

Table of contents example

Test tools and output

Talk to any vendor, and they'll have a set of tools they would prefer us to use for testing. For example, Nutanix would love for us to use X-Ray, and VMware would love for us to use HCIBench. Unfortunately, there's a perception problem with this. 

We're not saying that vendors write tools to make others not perform optimally, but why even take that slight risk? This is one of the main reasons why our performance generating tool of preference is VDBench.

VDBench is a command line tool we use to specifically generate disk I/O to simulate, as closely as possible, customer storage performance requirements. It allows us to have the ability to get very granular and prescriptive on how we perform these tests. Generating workloads is easy, but having the proper methodology behind it is equally as important and we want to address some of that here. Brace yourself, we're going into the weeds.

VDBench – The fill

Whenever performance testing is required, we strive to fill 40 to 60 percent of a system's total capacity. This concept is important. First and foremost, and especially true with spinning media, to test a system at 1 percent capacity will look significantly different than testing at 50 percent.  

When looking at flash media, the first write to flash will always be the fastest. Every re-write thereafter will require an erase operation, then the write. While this may be minimal, it is important to measure. The real performance capability of a system is not when it is brand new, but has been around for some time with workloads reading and writing data.

VDBench – Aging

We talked about the filling process above and to help save time, large sequential block sizes are typically used. Generating 10s or 100s of terabytes of data is very time consuming if we try to use a random set of block sizes. 

This is where aging comes in to save the day. The aging process randomizes the block size to provide a more accurate and realistic testing environment.

VDBench – The curve

Some may argue that the curve isn't the best representation of a platform's capabilities, calling them "hero numbers" or "unrealistic workloads." Honestly, they're probably be right, but there are certain things that can be deduced from running these workloads. 

Below is an example output of a curve test. The curve test essentially drives the system as hard as it can and stair steps in 10 percent increments from there, mapping latencies with each increment.

Curve Test example

 

Curve Test example

Why is this important? Using the examples above (specifically the first image), it's easy to see that this particular HCI system handles 4K blocks very well. If we know the specific workload requirement for this HCI system is smaller block sizes, it puts the mind at ease.

Conversely (looking at the second image), if we know that the workload being evaluated for the HCI platform has mostly large block sizes, perhaps a different solution should be used.

There are a lot of caveats behind these tests and they need to be addressed as part of a much bigger discussion, but we strive to ensure our customers walk out of POCs with a clear understanding of what these are and why they matter.

Resiliency testing – The drive pull

Last, but certainly not least, testing resiliency and how systems react is often something our customers want to see. Some tests are pretty simple (pull the network, fail a node, etc.). However, there is one specific test that keeps coming up over and over again which generates a lot of discussion: the drive pull.

You might ask yourself, well, why does this generate any discussion? This is normal, let's simulate a drive failure. To some extent that logic is right, but not entirely.

Think about a drive failing in the data center. It wasn't unexpectedly pulled from the system. There is a surprising amount of logic, tests, etc. written in these systems to detect and report the reliability of the drive. If you're unfamiliar with these, there's a great Wikipedia article on S.M.A.R.T. testing and monitoring. All of that to say that 95 times out of 100, a drive is somewhat gracefully removed from a system.

This is very important to understand, however, as mentioned in a previously written article, most data center outages are typically due to human error. If you don't know someone who's got firsthand experience, you've definitely heard horror stories of someone pulling the wrong cables, accidentally hitting the emergency shut off button or pulling the wrong failed drive. Understanding how systems react, how quickly they are to recover and what you expect to impact performance is also important.

Take advantage of our HCI expertise

We often times take for granted the things that appear to be easy on the surface, but when we dig under the covers, we find that there has been a lot more thought put into it than is even imaginable. 

That's where the expertise of our ATC team and experience testing different HCI solutions can help save a lot of time and future headaches. If you're evaluating HCI and want to perform a POC, reach out to your WWT account team and schedule time with our experts.

Ready to jump in? Schedule an HCI Workshop. Learn More

Technologies