Speed vs. legacy: NVMe and SCSI storage fabrics compared
The networked storage world has historically been simple: you either need block or file, determined by the upstream application requirements, and from there, the storage team probably has some standardized offerings. File storage is typically some version of SMB or NFS, all over IP. You can choose between iSCSI or Fibre Channel if you need block storage. What both block transports have in common is the storage communication protocol; they both use the same SCSI, just encapsulated differently.
Sure, there are other networked storage protocols, like ATA-over-ethernet, and IP-over-FC was a thing, but block storage and SCSI went hand-in-hand for the longest time. The advent of solid-state media necessitated something more efficient with improved queueing to drive SSDs to their potential. The reasons behind this have been previously covered.
That something better is the NVMe protocol, which has expanded from within the system itself to being networked via IP or Fibre Channel. NVMe over IP is delivered via RDMA with iWARP or RoCE, or without RDMA through NVMe over TCP.
Across WWT's customer base, FC has a large footprint. However, some of these customers have expressed interest in standardizing IP as the storage fabric in the future. The drivers behind this are the adoption of 25Gb/s networks and NVMe over IP, namely TCP. They have been curious about the marketing claims behind NVMe over TCP and whether it is all it's cracked up to be.
With all the previous experimentation culminating in this article, let's see if we can answer that question.
The setup for this environment consists of the following:
- 2x Dell R7525 servers
- 2x 32-core AMD Epyc 7543 CPUs, each (32C/64T, 2.8-3.7GHz)
- 256GB RAM
- Dual port 32Gb FC HBA
- Dual port 25Gb Ethernet adapter
- ESXi 7.0u3
- 2x Dell S5248F-ON ethernet switches
- 2x Cisco MDS-9710 Fibre Channel switches
- 48-port 32Gb/s line cards used with 32Gb/s SFPs
- 1x Dell PowerStore 5200T
- Configured as block-optimized
- 18x 7.68TB NVMe drives
- 2x 4-port 32Gb/s FC (IOM1)
- 2x 4-port 25Gb/s IP (mezz)
- 2x 2-port 100Gb/s IP (IOM0) - not used for this testing
The goal of the testing was to test storage protocols only; the servers, VMs and fabrics remained the same, with only the active protocols changing. To that end, all IP traffic was isolated to the S5248F-ON switches and the MDS-9710s for IP and FC, respectively. Neither the hosts nor the array were configured to route NVMe traffic beyond their dedicated subnets.
For IP traffic within ESX, the goal was to utilize the two uplinks similar to FC multipathing; two VLANs were used to create A/B traffic. To support dual software NVMe-over-TCP initiators (why two?), one vmkernel port was created per VLAN. Finally, vDS policies were used to set one physical uplink as primary per VLAN, such that VLAN 1 used uplink port one as primary and uplink port two as a standby. VLAN 2 used port two as primary and port one as secondary; it looks something like this:
You have probably noticed we have a speed mismatch between FC and IP. From the last performance test, where we ran 2x10Gb/s IP against 2x16Gb/s FC, we saw 27% less performance despite the 37.5% bandwidth difference.
This time, with 25Gb/s IP versus 32Gb/s FC it's a 22% speed mismatch in FC's favor. All things being equal, we'd expect only a 15% difference this time around. The unfortunate thing about that test with SLOB and this one is they're different IO profiles, so not all things are equal. The important takeaway from this is that we have a known speed mismatch of 22%. Any performance difference of less than 22% is a win for the IP protocol.
Four virtual machines were used, all running Red Hat Enterprise 9.2 with the latest patches at the time of testing. These were spread with two VMs per physical host and were configured as follows:
- 16 vCPU, each
- 32GB RAM, each
- 512GB local drive as sda, running from NVMe/TCP on an array not used for performance testing
For storage, the array was filled to 65% so that we make the array work to arrange blocks rather than giving it a completely blank slate. Those fill volumes were put aside, and twenty new volumes were allocated to each guest. When using datastores, there is a 1:1 mapping between datastores and drives allocated to the guests.
Last, vdbench was used to generate workload. It ran with eight threads for 4KB, 16KB, 32KB and 64KB block sizes at 100% read, 100% write and 50% read/50% write. A ten-second warmup allowed IO to stabilize and each iteration was run for five minutes. Unless otherwise noted, all of a protocols' runs were averaged together to produce a single per-protocol number.
Tests were run as follows:
- Fibre Channel
- SCSI datastores to ESX, guests using pvscsi controllers
- NVMe datastores to ESX, guests using NVMe controllers
- iSCSI direct to the guests using the Linux software initiators
- NVMe over TCP
- Direct to the guests using Linux software initiators
- NVMe datastores using dual ESX software initiators
A final note for readers of my previous articles: these results are not directly comparable with my past testing. We're using a new array this time (PowerStore 5200 versus the previous PowerStore 9000) with different core counts and speeds, memory amounts and drive counts.
FC-SCSI vs. FC-NVMe
First, we'll compare two datastore-based FC protocols, both utilizing the native protocol guest controllers (paravirtual SCSI for FC-SCSI, NVMe controller for FC-NVMe). FC-SCSI is widely deployed and well-understood, thus serving as our baseline for comparing FC-NVMe.
Looking at the average of all block sizes side-by-side, 12,000 additional IOPS from FC-NVMe is nothing to write home about. Let's look into some of the individual block sizes and read/write ratios and see if anything stands out.
While there are some minor differences, the performance difference between SCSI and FC is minimal. The 100% write scenarios, particularly at 4KB, show SCSI performing noticeably better. Before you walk out on these mediocre results, the following chart is a little more interesting.
Looking at CPU utilization within the guests, the NVMe controller is a lot more efficient. This means that, for roughly the same amount of IOPS, I need 26-38% fewer VM compute resources to accomplish the same tasks.
Ethernet is ubiquitous in data centers. For many, it carries client/server data, and that's it. The rise of 10Gb/s ethernet saw more widespread adoption of ethernet-based storage, both block and higher-performance file. As mentioned at the top, with customers refreshing to 25 and 100Gb/s spine/leaf ethernet fabrics, some are looking to completely displace their Fibre Channel fabrics in favor of IP storage protocols.
The test configuration consists of:
- 2x 25Gb/s IP ports
- 2x Dell 5248 25Gb/s switches
- 4x dedicated 25Gb/s ports - PowerStore system bond ports were not used for testing
- 2x dedicated VLANs to create A/B traffic.
- vSphere Distributed Switch policies were used to pin a VLAN to an uplink/vmkernel port
- 3x virtual ethernet adaptors, MTU of 9000
- VLAN A - this comes into play with the guest-based initiators
- VLAN B - this comes into play with the guest-based initiators
- For NVMe over TCP, the ANA iopolicy was set to round-robin. This defaults to numa, which does not equally utilize the ethernet adaptors.
This configuration was chosen to create traffic segmentation, similar to a best practice shared-nothing FC setup.
For our first test with ethernet-based storage protocols, we'll start with Fibre Channel's younger cousin, iSCSI. Similarly to starting with FC-SCSI, these are known quantities amongst our customer base, which we'll use to compare the next-gen protocols. For these, we again ran two configurations: guest-based initiator and ESXi software initiator. For both initiators, vSphere monitoring was utilized during testing to verify traffic balancing across both uplink ports.
As an average of all block sizes and read/write ratios, the guest initiator performs 17.6% better. I'd also note that the iSCSI datastore number is 24% lower than the abovementioned FC-SCSI datastore number. Breaking those numbers down into some of their components, we get the following:
The datastore results were so underwhelming at 4KB I ran the test three times to verify my results. You'll notice a meaningful disparity in performance between the two, with the guest initiator being significantly more efficient at small block sizes. As the write workload and block size ramp, datastores start to shine.
NVMe over TCP
Like iSCSI, there are two ways to get NVMe over TCP volumes to a VMware guest. Virtual adaptors can be added to the ESX server configuration, or software initiators are available for Linux virtual machines. Does one perform better than the other? First, we'll look at the two NVMe over TCP initiators side-by-side before stacking those up against iSCSI.
Within vSphere, each ESXi server has two virtual NVMe over TCP adaptors, with one for each VLAN. As noted in the test setup above, virtual distributed switch policies were used to equally utilize my resources, pinning one VLAN to each adaptor. Also, in the case of datastores, guest NVMe controllers were used to match the guest protocol with the underlying storage protocol.
When using the in-guest initiators, the A/B traffic was created with the two dedicated vNICs, mapping to the A/B VLANs in ESXi.
You can see these are basically the same number or close enough that it doesn't really matter. However, just like before, we may see something interesting if we look at the block size extremes and read/write data points.
Here, we see the two mostly performing the same. 4KB 100% read has datastores outperforming the Linux initiator by eight percent, but most are within a few percentage points of the other. We haven't covered any mid-block sizes yet, but the relationships flip here. The guest initiators seem to handle these better, but even at the peak of an 8% difference in one block size and read percentage combo out of twelve doesn't move the needle much for me.
iSCSI versus NVMe over TCP
You have seen the numbers already, but let's look at the side-by-side visual of these next to each other.
With this chart, we can see a definite advantage to updating your datastores to NVMe over TCP.
Tying It Together
For our customers wondering if IP can replace FC, let's examine some final comparisons.
This paints an interesting picture. There was a hypothesis higher up about performance differences between SCSI and NVMe versus the gigabits per second mismatch between FC and IP at 22 percent. What we actually see is that NVMe over TCP datastores are only 3% behind FC-SCSI datastores. Even comparing FC-NVMe to NVMe-over-IP, we see a 6% difference!
As stated in the section above, if you're using iSCSI datastores it makes sense to migrate to NVMe over TCP from a performance perspective. However, the picture looks to be roughly the same between NVMe over TCP and either flavor of Fibre Channel.
If you are a current Fibre Channel user and are happy, keep using it. The protocol is still under active development, with the 128GFC specification complete and a horizon to 256GFC in the future. You could move to FC-NVMe and a more lightweight guest storage controller, continue using a purpose-built storage network with dedicated shared-nothing A/B fabrics and save some CPU cycles. We've tested these two configurations recently and there can be a measurable benefit, as with all things in IT, depending on the workload.
For customers looking to utilize a single data center data fabric that also works in public cloud providers, Ethernet offers that today. We've discussed the performance benefits of NVMe over TCP. While the major storage OEMs and other industry heavyweights are putting efforts into growing its adoption it isn't supported everywhere yet.
Linux and VMware fully support NVMe over TCP; these are significant parts of the enterprise operating system base, but others like AIX and Windows aren't small either. Neither offer native NVMe over TCP initiators just yet, meaning they will continue to rely on iSCSI or Fibre Channel for now.
Our customers evaluate the myriad technology offerings on the market and choose a blend that fits their environmental needs: cost, manageability and performance. This crucial decision-making process is where WWT's technical teams and I step in, offering best-in-class service and expertise to guide our customers to make informed, confident choices.
Our dedicated assistance ensures that you find the most suitable and efficient storage fabric options tailored to your specific needs. Let us empower you to make the optimal decisions for your storage fabric needs, ensuring seamless integration and peak performance.
Feel free to contact me directly or your WWT representative if you want to dive deeper into your storage fabric options.