There has been a significant marketing push toward end-to-end NVMe for the past five or more years. What is end-to-end NVMe, you ask? It's when a host communicates to the array over an NVMe fabric protocol; the array uses NVMe internally, and data is stored on NVMe drives. 

The marketing behind this setup says it provides incredible performance gains, the likes of which the world has never seen. Many arrays have been all-NVMe since the marketing push began, but the trailing factor in customers' adoption has been the move to NVMe transport protocols in their storage fabrics. NVMe over Fibre Channel is supported at 16Gb/s FC and faster. NVMe over TCP, backed by some industry heavyweights, looks to start the next era of ethernet-based storage. In virtualized environments, a missing piece for completing this end-to-end transformation lives within the VM.

For a VM to access storage, it needs a storage controller. VMware has a few flavors of SCSI controllers, and most these days will be familiar with the LSI Logic SAS controller and the paravirtual SCSI controller. As of vSphere 6.5, a guest NVMe controller is now possible, supporting NVMe's huge queue count and queue depth. When we think about this end-to-end NVMe path, does the controller type make a difference?

We have compared NVMe over TCP against various protocols previously. From our Oracle testing, there was an open question about why datastore-based NVMe over TCP had such a bad showing. In that test, even with two NVMe over TCP virtual HBAs, performance was not much better than virtual iSCSI; upon further reflection, this went against logic. My theory was that VMware's NVMe over TCP implementation was inefficient or the guests' use of a SCSI controller required translation, reducing performance. This unknown cause bothered me, so I set about creating a test environment to probe this deeper.

For this test, we'll use a 32Gb Fibre Channel fabric and NVMe over Fibre Channel. The only thing we'll change in this test is the guests' storage controller, using the VMware paravirtual SCSI controller and the NVMe controller. The array, fabric protocol, and volumes presented will remain the same. Here's a vCenter screenshot of where I changed the storage controller setting for a guest's hard disk.  

VM hardware settings storage controller dialog

Environment setup

Logically, the environment looks like this:

Test environment setup diagram

The hosts and guests are built out like:

  • ESXi hosts
    • 2x Dell R7525.
      • 256GB RAM each.
      • 2x AMD Epyc 7543 CPUs (32C/64T, 2.8-3.7GHz).
      • Dual port 32/64Gb FC HBAs.
      • Dual port 25Gb Ethernet NIC.
    • ESXi 7.0u3.
    • Multipathing via ESXi HPP.
  • Guests
    • 4x RHEL 9.2 VMs with 32GB RAM and 16 vCPU each.
      • 2 VMs running per physical host.
      • Vdbench workload generation software.
  • Storage
    • 80x 100GB volumes total, with 40 going to each ESX server.
    • 20x 79GB vmdks per guest; the capacity was chosen to keep the system from alarming. In any case, there's a 1:1 mapping of vmdk to array volumes.
    • Thin provisioning was used; however, all blocks were filled before testing.
  • Workload 
    • Vdbench, 8 threads per VM worker.
      • 4KB - 100% read, 100% write, 50% read.
      • 16KB - 100% read, 100% write, 50% read.
      • 32KB - 100% read, 100% write, 50% read.
      • 64KB - 100% read, 100% write, 50% read.

Each workload was run for five minutes or 300 one-second samples. Except where otherwise noted, because enterprise workloads are a mishmash of block sizes, all 3,600 samples per protocol were averaged into a single number. The last note before we move to the results regards latency. Latency numbers are not shown here since IOPS is a function of latency. When reviewing the results, the latency chart matched the IOPS chart.

Test results

Guest storage controller average IOPS comparison

I don't think anyone will complain about a free 42K IOPS. We can break those down into reads and writes to try and determine where the advantage is. 

100% read IOPS comparison
100% write IOPS comparison

Across the spectrum of customers' storage array workloads, things skew toward reads between 60/40 and 80/20. As you see in the graphs above, there is an almost 78K IOPS difference in reads or a roughly 20% improvement. With writes, a more modest 7.5% performance improvement is still nothing to sneeze at, recalling this is an average across all block sizes. We get the following chart if we zero in on block size differences in reads.

100% read block size IOPS comparison

At smaller block sizes, the number of blocks to transfer goes up, meaning more work for processors to do. While 64KB shows no improvement, the rest have a performance improvement between 14-25%. If we switch to 100% write workloads, do we see the same performance disparity across the board?

100% write block size IOPS comparison

Small 4KB writes heavily favor the NVMe controller, but for other block sizes we see the balance shift with the SCSI controller performing between 4-14% better.

Final thoughts

The underlying storage can use Storage vMotion to move VMs/vmdks to a new storage protocol. However, moving from pvscsi to the NVMe controller is an offline activity from a storage perspective. You can hot-add the NVMe controller, but transferring any of the vmdks to the new controller requires them to go offline from pvscsi to reconnect them to new NVMe controllers. When and how this is implemented depends on the environment. A logical volume manager running in the guest could also be utilized to live-migrate the data from SCSI to NVMe, but this is time-consuming, tedious and doesn't scale well.

Implementation details aside, this testing shows any performance improvement will be workload-dependent. Regardless, most applications from VDI to general-purpose and OLTP should benefit from new guest storage controllers. The general advice I've used in my time in the storage world is: it's a shared service and everyone needs to be good customers, so I'd look to match my protocols wherever possible.