ATC Insight • April 1, 2020 • 13 minute read

Resiliency Testing VxRail

One of our customers wanted to look at testing resiliency on a vSAN cluster. More specifically, they wanted to test the resiliency of a VxRail solution. For these tests we used an 8 node VxRail that was available inside the Advanced Technology Center (ATC). Below is a high-level design of the environment. The goal of this testing was to see how VxRail and vSAN would react to disruptive and non-disruptive tests.

In this ATC Insight

ATC Insight

Resiliency Testing (VxRail)

This ATC Insight captured the results of several tests we normally execute around the resiliency of compute platforms. We specifically tested VxRail in this case. During the resiliency testing we ran a workload with Vdbench of 18k IOps against the VxRail. Why the 18K IOps? This was a customer requirement around 450 IOps/TB of space on the total vSAN datastore which was filled to 40% of overall capacity.

**This resiliency testing with this VxRail solution occurred in December 2019 and carried into January of 2020**

See below exert from Vdbench config file if you wanted to know more details around the actual workload:

**wd=wd1,sd=sd,xf=8k,rdpct=66,rhpct=22,skew=80 (8k block size, 66% read and 34% write, 22% of the time cache hit, 80% of the time run this workload)
wd=wd2,sd=sd,xf=64k,rdpct=25,seekpct=0,skew=20 (64k block size, 25% read and 75% write, all sequential writes no random, 20% of the time run this workload)**

We could have run this workload at a much higher rate, but held true to our customer's specific workload requirement. Additionally, this workload ran during each of the tests performed below. The workload was stopped when the cluster looked normal again. Then, the workload was restarted for each subsequent test.

Deploy and Clone VM's

One of the requirements was to clone and deploy 7 VM's against the cluster while the workload was running. This test was orchestrated inside the vSphere client, and (as you can see below) is displayed via a Grafana output. This output shows the stats from vCenter as well as the running workload. During deployment time, we saw bandwidth progress as clones were sent to other machines in the VxRail cluster. As expected, there was no impact to any operations.

* bandwidth progress as clones sent other machines in the VxRail cluster

Pull Capacity Drive for less than allotted repair time (under 5 minutes)

Below is a screen shot showing how we changed the object repair timer from the default 60 minutes to 5 minutes in the test plan. Why did we do this? First, we didn't want to wait 60 minutes. Second, we wanted to see what was going under the covers. What we wanted to understand is what vSAN does when a drive stops working and is removed from the running system for under the allotted repair timer setting.

Capacity Drive Test 1: Pull drive, then re-insert drive before 5 minute object repair timer kicked in. This test was executed on node two.

As soon as the drive was pulled virtual center alerted us below:

Screen shot from VxRail alerting the same fault below:

My observations and thoughts in output (below):

The Grafana output below shows two latency spikes. One latency spike was when the drive was pulled, and the other latency spike was when the drive was re-inserted (the higher spike below). At the time the drive was reinserted, vSAN tried to figure out what was different within the last four minutes (while the drive was out of the system). Then, vSAN quickly tried to sync the data that was missing, and also slowed down the process when it figured out that not much data was actually missing.

The Grafana output shows two latency spikes

Additional alarms from vCenter when the drive was pulled below:

vSAN Object Health Alerts and Disk Alerts from drive being pulled below (3 screenshots in VxRail):

Details of Drive pull from the iDRAC below:

My overall thoughts on the Capacity Drive Pull Test:

Below is the overall display of time from when the drive pull started to the time when vSAN finished the re-sync operation. The test performed exactly the way it should in my opinion. There was no data loss, no real effect on IOps, and no bandwidth constraints. All around, this was a successful test with good notifications from vCenter, VxRail, and iDRAC that told the same story.

Snapshot Test

In this test, we had to create a snapshot of a virtual machine for standard operations. This is a very simple test contained in resiliency testing but should be included every time. The snapshot completed successfully with no noticeable impact on the Vdbench IOps workload as displayed via VxRail output (below).

vMotion Test

This was specifically a compute vMotion test (as a storage vMotion test cannot be completed on a VxRail cluster with as it only has the one vSAN datastore). No significant impact was noticed on the Vdbench workload during this vMotion compute test, and you can see the completed test as displayed via VxRail output (below).

Pull Capacity Drive for more than allotted repair time (over 5 minutes)

In this test, the capacity drive from node 4 was pulled. The capacity drive was pulled for approximately nine minutes. As with the previous test around pulling the capacity drive, vCenter alerted us on the drive pull as well as from the VxRail manager you see below:

When the capacity drive was pulled from node 4, there was no effect on the cluster (other than a spike in IOps for just a brief second) as is depicted from the Grafana output below:

Once we hit the 5 minute repair timer, vSAN started to re-sync objects as described in the screen shot from the VxRail output below:

Additionally, from the iDRAC screen scrape we show the capacity drive being removed and reinserted into node 4 after approximately nine minutes below:

My observation and thoughts In output (below):

After 5 minutes, the object timer kicked in, and you can see the increase in both bandwidth and CPU (when it starts to copy data). Once the drive was re-installed, vSAN was able to throttle itself during the repair process. This repair process includes vSAN being able to see what had changed since the drive was last pulled, as well as being able to correct the object repair timer (as depicted below).

A great takeaway (and one of the most important themes from this test) is vSAN throttling. When vSAN throttling kicks in, it is basically protecting you from another problem of saturating the link. Without the throttling, the possibility does exist that the link could become saturated which could cause traffic issues and ultimately a loss of connectivity to critical application traffic. It is outstanding to see vSAN throttling work as designed in this test.

Pull Cache Drive allotted repair time (under 5 minutes)

As was the case with the Capacity Drive Pull Test, we changed the object repair timer from the default 60 minutes to 5 minutes in the test plan for the cache drive pull test. We didn't want to wait 60 minutes, and we wanted to see what happens under the covers. Again, we wanted to understand what vSAN does when a drive stops working and is removed from the running system for over the allotted repair timer setting.

For this test, we pulled the cache drive on node 8. Just like in the capacity drive pull test, vCenter alerted us within 40 seconds that the cache drive was pulled. Then, VxRail manager also alerted us that the cache drive had been pulled as the output shows below:

VxRail manager alerting that the cache drive had been pulled

The cache drive pull did cause a small spike in latency as depicted from this Grafana output below:

Additionally, here is the iDRAC screen shot depicting the cache drive being pulled and re-inserted under 5 minutes below:

Here is the object repair timer starting to calculate time of repair (total resyncing ETA) from the VxRail screen scrape below:

Object repair timer calculating time of repair

My observations and thoughts in output (below):

The Grafana output (below) shows the object repair timer kicking off and starting to sync data from the cache drive. In the cache drive pull test vSAN performed as expected. vSAN handled the re-sync of data properly, and there was no evidence that our IOps workload was negatively impacted.

The same holds true for the cache drive pull test that is re-inserted after the allotted 5 minute repair timer. Our testing reflected the same results as the capacity drive pull test that was conducted and represented earlier in this ATC Insight.

Object repair timer starting to sync data from cache drive

Pull host with HA enabled

The next test was to simulate a hardware failure of a host. We dropped host 3 for this test. Before we started this test, we verified that the environment was normalized from previous failure testing as noted on VxRail output below:

We had an IOps workload running at 18K in accordance to the customer test plan with local host enabled on Vdbench to show the IOps dip when the host went offline. The Grafana output shows this below:

Below is the vCenter screen scrape which verified HA was turned on before starting the host failure test.

Once the host failure was executed, you can see the loss in IOps when four of the worker VMs were lost that resided on the host from the Grafana output below:

Drop in IOps due to losing host 560x4 worker VMs

Verification in vCenter screen scrape (below) that host 3 was down:

Verification in vCenter screen scrape (below) that HA events of restarting VMs on another host are happening:

Verification HA events of restarting VMs on another host

The Grafana screen scrape shows how the object timer kicked in as expected (after 5 minutes) to move/sync data from the host being offline (below):

Object timer kicked in after 5 minutes to move/sync data from host being offline

Then, the VxRail screenshot below shows the re-syncing of objects in vSAN:

This Grafana screenshot below shows the gap in reporting when we had to restart the test to get the Vdbench worker VMs load in order to get the 18K IOps back.

The screenshot (below) shows the ESXi host booting back up and going through the process of syncing the data with vSAN:

ESXi hosting booting up and syncing data with vSAN

Then, the Grafana screenshot (below) shows the spike when the host rejoins the vSAN cluster and starts syncing data:

Spike when host rejoins vSAn cluster and syncing data

The VxRail screenshot (below) shows the object timer when the vSAN host is rejoined with the cluster:

Object timer when vSAN host rejoined with the cluster

Then, the Grafana screen scrape (below) shows what happened during the re-sync process. Displayed below is what happens when the host rejoined the cluster and throttled the bandwidth consumption:

The host rejoined the cluster and throttled the bandwidth consumption

This was the end of the HA Test. The VxRail screenshot (below) confirms everything is back to green after the re-sync is complete:

My overall observations and thought in writing this ATC Insight:

If you made it all the way to the bottom of this read, you may be asking yourself "Why does this matter to me?" This is a good question! This whole ATC Insight is intended to give you a glimpse of what vSAN (and specifically VxRail) can really do with "REAL" testing scenarios we have used to put VxRail through its paces. These tests give glimpses into what you can expect when bad things happen.

The overall testing effort around VxRail went really well. The specific testing around the High Availability (HA), vSAN, and vCenter performed as expected. This testing really showed me that Hyper-converged Infrastructure (HCI) has continued to grow in maturity.

This is just one testing effort out of hundreds of lab efforts we handle each year in the Advanced Technology Center (ATC). If you are looking to dive deep into how a product reacts with your workloads please reach out to your account team to schedule a Proof of Concept (POC) with the ATC Lab Services Team.

Test Plan/Test Case

VxRail Hardware Specifics

8 x VxRail P570F
384 GB of RAM Per Server
Intel(R) Xeon(R) Platinum 8160M CPU @ 2.10GHz x 2
2 x NVMe 745 GB drives for Cache - 6 x 1.75 TB SAS SSD's for Capacity
4 Ports of 25 gig (2 dedicated for vSAN only - 2 dedicated for vMotion, VM, and ESXi Mgmt traffic)

During the resiliency testing, the vSAN object timer was set to 5 minutes in order to speed up our tests (By default the object repair timer in vSAN is set for 60 minutes). The ATC Insight section shows a play-by-play of the results we gathered from our tests.

Resiliency Tests

Deploy Clones
Pull Capacity Drive put back into system before 5 minutes
Snapshot Test
Vmotion Test
Pull Capacity Drive leave out for more than 5 minutes
Pull Cache Drive
Pull host with HA enabled

Test Tools

**Depiction of Graphite/Grafana being used in testing.**

VDBench

VDBench is an I/O workload generator for measuring storage performance and verifying the data integrity of direct-attached and network connected storage. The software is known to run on several operating platforms. It is an open-source tool from Oracle. To learn more about VDBench you can visit the wiki HERE.

Graphite/Grafana

A Graphical User Interface (or GUI) that we use in the Advanced Technology Center (or ATC) to visually depict the results data that we derive in our compute and storage lab efforts. To learn more about this product you can go HERE.