In this blog

VMware released VMware vSAN 7.0 Update 2 earlier in March. If you haven't seen or read up on the giant list of features, they can be found on VMware's blog. A lot of great changes were released as part of this version, one of which is flying under the radar but deserves a spotlight. That feature: enhanced data durability during unplanned failures.

TLDR: When an unplanned failure happens on a vSAN cluster, vSAN now creates a temporary object to send replica writes for the affected virtual machines in order to keep the latest writes redundant.

On the surface, this may not look like a big deal, but as far as the safety of data is concerned, this is an important change. This failure behavior builds off a feature introduced in 7.0 U1, where it was introduced for maintenance mode, but this is now enabled for unplanned outages.

Disclaimer: This article assumes a basic understanding of VMware vSAN, which you can find with our VMware vSAN primer and Dell EMC VxRail primer.

Back in time – the old vSAN

Let's look at what used to happen with vSAN components in prior versions. Below is a 5-node vSAN cluster with 2 disk groups per host with an FTT=1 RAID1 SPBM assigned to a virtual machine. This cluster is currently healthy.

Healthy vSAN cluster
Image 1: Healthy vSAN cluster

The question is, what happens when an unexpected host failure occurs that affects this particular VM?

Fun and misunderstood fact: There are two modes of failure states for components: Degraded and Absent. Degraded mode will start a rebuild of affected components immediately, regardless of the vSAN 60-minute timer (default), and Absent mode will wait to rebuild components based on the configured timer with the assumption that the components are likely to come back online.

In older versions of vSAN, a rebuild of the component would be initiated after 60 minutes, however, new incoming writes from the virtual machine during this wait and rebuild cycle would only be written to the available component (Component 2).

Failed host behavior
Image 2: Failed host behavior

This is important to understand. If the failed host comes back online, vSAN will ultimately make the decision of continuing the rebuild or synchronizing changes since the failure. During this time, if something happens to the host or the disk group with the redundant component (Component 2), customers can be in a data loss scenario as "Component 2" was the only location that new data was written to during the outage.

Component 2 failure
Image 3: Component 2 failure

The new vSAN – Enhanced data durability

vSAN behavior changes with unexpected outages as of 7.0 U2. As of this version, vSAN will now create a temporary component to keep new incoming writes redundant across the cluster. Taking the same scenario as above:

  • Image 1: Normal and healthy vSAN cluster
  • Image 2: Host goes offline

At this point, vSAN automatically creates a temporary object that allows the virtual machine's new writes to stay redundant.

Temporary object created
Image 4: Temporary object created

As the first host in the cluster comes back online, vSAN will still calculate whether to continue the rebuild or synchronize changes since the failure — however — if something happens to "Component 2" during this window, as of vSAN 7.0 U2, it now has a redundant copy of the new writes, "Temporary Component," that it is able to merge into "Component 1."

Data merge
Image 5: Data merge

This behavior is also observed when hosts enter maintenance mode, which was introduced in vSAN 7.0 U1, but is critically important to have for unplanned outages. Customer data should come first and every precaution to avoid a data loss or unavailability scenario should be taken. This is a great step forward for vSAN, and it shouldn't be overlooked when evaluating different platform behaviors.

If you're interested in learning more about vSAN or other HCI topics, reach out to your WWT account manager to schedule an HCI workshop or briefing. We also offer on-demand labs and proof of concepts.

Technologies