Introducing Dell PowerFlex 5.0 Ultra: A New Era in Software-Defined Storage

The lifecycle of products follows a steady pattern: release, update, repeat. There are always new products arriving, and those products usually get updates. In the tech space, these products get release numbers to note the iterations. If the update happens to the right of the period (4.x), it's a minor update; if it's to the left (x.2), it's a major release. Sometimes, the public launch of a product iteration is so significant, it even gets a name to go with it, to indicate something big.

Today, I'll introduce you to the new era of Dell PowerFlex, version 5.0 and named Ultra.

Dell PowerFlex Overview

If you're unfamiliar, PowerFlex is Dell's software-defined block storage system capable of massive scale-out and commensurate performance. The features that our customers find valuable are:

Node-based architecture: No more migrations. To me as a storage person, this is big. Migrations take time, a lot of time. Dell PowerFlex allows you to bring new nodes into the cluster, rebalance and then tell the system to remove the old ones. It will non-disruptively perform all the data movement without the hosts seeing so much as a path drop.
Public cloud: Currently, it's the only Dell block storage platform with a public cloud option. Additionally, utilizing a PowerFlex-native feature, Fault Sets, it can ensure data is mirrored across availability zones.
IP-based block storage: We're seeing an increasing adoption of Ethernet as the block storage fabric, displacing the venerable Fibre Channel; PowerFlex is IP-only.
Performance: While I don't think our customers all need millions of IOPS and 0.4ms latency, PowerFlex has the mechanisms in place to deliver them if needed. Perhaps more importantly, the platform's ability to ensure equal utilization of available resources, including the nodes' Ethernet uplinks, is impressive. Furthermore, the client (SDC) natively multipaths with the clients' available ports and, unlike Fibre Channel or iSCSI, the SDC connects to all of the storage nodes, reaching out directly to the node containing the needed block.
Numerous hypervisor integrations: Dell PowerFlex has some tight integrations with Nutanix, ESX, and Azure Local. Also, as a flexible block storage platform, other hypervisors like Proxmox/KVM work, albeit without any sort of tighter product interactions.

Dell PowerFlex was designed to be lightweight, from the storage node software (SDS) to the metadata manager (MDM) to the client (SDC). It utilized the most lightweight method of protecting data within the cluster, mirroring. Like all things in engineering, trade-offs were made. PowerFlex traded overall storage efficiency for speed, and it was very successful. Since it first debuted more than thirteen years ago, though, the amount of processing power available in CPUs has grown tremendously. A common single socket system at the time had 4, 6 or 8 cores, compared to the current 24, 32, 64 and up to 128 cores, if needed (Note: 192 core packages are also possible, but those are condensed cores). Furthermore, memory bandwidth and capacity have increased, and PCI Express bandwidth has doubled twice in that time frame.

Dell's engineers started working on a significant re-imagining of the underlying services several years ago. In talking with them during the beta, they said Dell PowerFlex 5.0 Ultra is the culmination of five years of work. I'll get into the internals further on, but suffice to say, it's impressive.

The release of PowerFlex 5.0 Ultra is not the end of the journey, but rather the start of the platform's new path. The rework takes advantage of the full capabilities of modern hardware, from core counts to x86 instruction sets, and readies PowerFlex for bigger things. Let's dive into some of the components of PowerFlex 5.0 Ultra:

Efficiency

We'll go into the other new pieces, but let's talk about efficiency upfront. With the launch of PowerFlex 5.0 Ultra, erasure coding (EC) is now a pool option. While erasure coding is nothing new, it was invented in the 1960s and is common in file and object platforms; it is typically not applied to high-performance, low-latency block systems. Erasure coding is a forward error correction scheme that allows data to be recovered if a component chunk is missing or damaged in transit.

Does that sound a bit familiar? RAID operates on this principle, only it does it locally within the system itself. The number of EC parity chunks is also flexible. A 4+2 EC scheme means I have four data chunks and two parity chunks and can thus lose two components at the same time without data loss. You could also do 4+4 and, while it's less efficient, you can lose four individual blocks simultaneously and keep running. Within the stable, Dell ObjectScale and Dell PowerScale both utilize EC to protect data, but those are, you guessed it, file and object systems.

Dell is releasing two schemes to start, 2+2 and 8+2. Clearly, 2+2 is about as efficient as the older mirrored medium granularity (MG) pools when looking at raw-to-usable, and an ~11% improvement over fine granularity (FG) pools. What it gains over either of those is dual redundancy, which we'll talk about in a bit, and unlike MG, it supports data reduction (compression only today). However, it has a trick up its sleeve in that, it gets more usable capacity than FG, and compared to MG, it also gets compression, so even 2+2 gets better utilization than the old pool types.

The 8+2 option requires more nodes to start, but makes up for it with more than double the efficiency of FG pools and still better redundancy. To put a finer point on it, if I have 100TB of physical capacity in a rack, here's how the two old and two new pool types compare.

Pool Type	Raw Capacity	Usable Capacity	Compression Rate*	Effective Capacity	Effective to Raw Ratio
MG	100TB	45TB	n/a	45TB	45%
FG	100TB	33TB	2:1	66TB	66%
EC-2+2	100TB	~44TB	2:1	88TB	88%
EC-8+2	100TB	~70TB	2:1	140TB	140%

^ Assumption, just for discussion purposes. Actual rate will vary based on workload.

The net result is that you'll need less hardware, rack space and switch ports with EC for the same amount of capacity. Also, since EC brings dual parity, overall redundancy is improved. During normal operations, the cluster can now sustain two concurrent node failures and keep running without a data-unavailability event:

With FG or MG, it could lose one node before redundancy was lost. Protected Maintenance Mode, where the system creates a second copy of a node before taking one offline for an upgrade, maintains full redundancy during the upgrade, but it seriously extends the length of an upgrade on larger clusters; serial copies for 50TB can take a minute. That problem is also now solved since it can survive two simultaneous failures, or one in the event of an underway code upgrade.

To finish this section, let's discuss the two EC storage schemes available today. It's my suggestion, and I think Dell will agree that 8+2's economics are better, such that, if possible, customers should prefer 8+2. If you don't need the capacity, you can acquire partially populated nodes to reduce the cost. With the performance capabilities of modern drives, a node with even five drives performs admirably. 2+2 still improves over previous pool types, but should ideally be reserved for test/dev/QA or the smallest requirements.

Snapshots

As discussed elsewhere, the threat landscape our customers are dealing with is far scarier than it was fifteen years ago.

In addition to the tornadoes we planned for then, the real extinction event nowadays is a cyberattack. Today's customers are building recovery strategies that include, but are not limited to (I hope), numerous daily snapshots on the primary storage array. If I take a snapshot every ten minutes and keep those for five days, that adds up to 720 snapshots; 4.x allowed for 126 snaps per source, or 21 hours' worth. Dell PowerFlex 5.0 Ultra increases that number to 1022, an eightfold increase. This brings it in line with Dell's current data services champ, Dell PowerMax.

Write acceleration

Compression/fine-granularity pools brought about a requirement for persistent memory in the FG pool nodes. It allowed Dell PowerFlex to decouple the expensive work of compressing an incoming block from the IO flow itself, improving performance. When a host performed an FG pool write, it was mirrored into PMEM in that block's primary and secondary SDS, after which it was acknowledged to the client/SDC. After that, it could make its way through the compression engine on its way to the physical disk.

PMEM reduces drive wear and speeds up write acknowledgement, but also prevents data reduction in public cloud deployments. With Dell PowerFlex 5.0 Ultra, PMEM isn't a hard requirement for compression, and while you'll have it and appreciate the performance boost for deployments in your data centers, it will improve the efficiency of PowerFlex for Public Cloud clusters since they can compress.

Underlying architecture

To accomplish the changes above, Dell engineers significantly redesigned the storage node services, previously called SDS. As it stood pre-5.0, SDS made sense when PowerFlex was originally created as an extremely lightweight storage system and let the storage nodes be more-or-less independent. As more services were required and SDS interdependencies crept in, the SDS started to show its age and became a scaling bottleneck.

The storage nodes now house two services: PowerFlex Data Server (PDS) and Device Gateway Target (DGWT). The disaggregated processing splits up the node into a processing path (PDS) that handles caching, snapshots, data reduction, and layout from the physical device management (DGWT), which does device I/O to the disks in a given node.

I mentioned node interdependencies previously. In 4.x and before, the storage nodes acted like independent islands where they were each responsible for processing their own write IOs. For example, a write IO comes from an SDC to persistent memory in the primary SDS. It's then mirrored into persistent memory in that block's secondary SDS before host acknowledgement happens. Each of those nodes was responsible for independently compressing those blocks on its way to the drives.

As workloads grew, this duplication of work became a target for Dell engineering to optimize. Lastly, as CPU architecture changed with more cores and memory channels changing up NUMA nodes, the SDS had a hard time adapting and was a choke point (even though it may be a choke point, these nodes are cranking 200K+ IOPS each, which is definitely not insignificant). The DGWT and PDS fix that and set things up for the future. Now, the PDS on the primary node does the initial processing and compression before handing off to the DGWT both locally and on the partner nodes for persistence.

During the beta, we force-terminated PDS and DGWT independently and observed the results.

Pulling the rug out from under a node's PDS didn't do anything since clients just rerouted to surviving nodes.
When a DGWT is terminated, the drives are gone from the cluster and a rebuild begins.

Remember, with erasure coding, two concurrent DGWTs can go away, and the cluster will keep running, albeit with no redundancy, until the rebuild is complete.

Interface

Last, but certainly not least, let's talk about interacting with the system, starting with the HTML5 interface. I have an appreciation for the work that goes into putting together something coherent; it's hugely challenging to categorize menus and make them intuitive. I broke down the recent PowerFlex interface versions like this:

Dell PowerFlex 3.x and before: A collection of independent components that form like a loosely coupled Voltron to serve up storage.
Dell PowerFlex 4.x: A single management platform that manages infrastructure and happens to give out volumes. I may be alone in this but it felt very infrastructure-forward.
Dell PowerFlex 5.x: A single management platform that provisions storage and happens to manage infrastructure.

The move from 3.x to 4.x unified the management plane and really helped the look, feel and operability of the system, and the 5.0 Ultra version of PowerFlex Manager further refines things. I really like the changes Dell made to PowerFlex Manager in 5.0 because, to me, it puts the storage system first. It maintains the same overall look and feel of Dell's other storage systems. The interface delivers clear visibility into the cluster configuration, sends alerts and continues to make provisioning a snap.

Another component of the PowerFlex interface is its API. For a platform originally designed without a pointy-clicky interface, of course, it has a good API. There are now significantly more metrics you can query to determine exactly what the components of a system are up to. This can be queried as part of a custom Grafana dashboard, or with the VMs available from Dell.

WWT's experience with Dell PowerFlex 5.0 Ultra

Touched on above, WWT was a part of the Dell PowerFlex 5.0 Ultra beta program and got to see the system earlier this year. Even at that time, the level of polish was good, and we experienced no issues in our testing. We put the system through its paces and tested the new snapshot limit, the performance with and without the write cache, among other things. In short, they worked well. The interface, first and foremost, is welcoming traditional storage folks in ways I don't think PowerFlex Manager 4.x is. I've already mentioned testing the PDS and DGWT, which, to me, are significant portions of what's new.

What needs improvement? Really, the improvements I suggested were around simple workflow items and style choices. The core system worked great, which is admirable when considering the long-ish runway to launch. I tend to approach betas first from the perspective of a novice user. I look for intuitive workflows, consistent styling, and proper spelling and punctuation. My intent is to make our customers' day-to-day use of the platform better.

The other suggestion was around snapshots. The new system supports over 1000 snapshots, and these can be managed by snapshot policies. Unfortunately, only 60 of those snapshots can come from the snapshot policies. The rest must be created manually or with some custom automated process like a script or API call. Given the system can clearly create 1022 snaps, the fix probably lives within PowerFlex Manager. It may not be a huge deal for you, and I'm told an update to this made it into the development backlog.

Conclusion

Dell PowerFlex 5.0 Ultra delivers a technically compelling evolution of the PowerFlex platform. From turn-of-the-crank improvements like the interface, to bigger things like the number of snapshots, to massive things like erasure coding, there is a lot here. It's the starting point for the next phase of the system that sets it up to grow into the future. Currently, it's for greenfield deployments only; things like remote replication (SDR) and in-array migration are part of those upcoming features the code will grow into. Since its initial announcement at Dell Tech World, this release has seen a lot of excitement from the customers I've talked with. I've witnessed leaders suggest initial testing and certification of this release as soon as possible because of its improved capabilities.

Reshape your enterprise storage with WWT and Dell Connect with an expert