NetApp Makes Transitioning to Software-Defined Storage Seamless With SolidFire
In this article, we discuss our experience going through the early access program for NetApp's new eSDS (Enterprise Software-Defined Storage) based on elementOS and detail some of our findings.
One of the most rewarding and exciting aspects of our work is evaluating and helping shape products before they hit the market. Here at WWT, we happen to have an excellent relationship with NetApp and take part in many early access programs (EAP). Whether it's a new release of ONTAP, a yet unreleased hardware platform or a new software offering, we get the chance to get our hands on products before they become generally available (GA) several times per year.
About a year ago, I was sitting in a briefing with a few of my colleagues when NetApp's SolidFire group came in to present the roadmap for ElementOS. As the briefing went on, we learned about all the new features soon to be released from a manageability, performance, application integration and hardware standpoint. Still, by the end of the briefing NetApp shared some exciting news. They were hard at work decoupling the elementOS software with the hardware they provided as SolidFire appliance.
The difficulty of running software on just any hardware platform (especially without running a hypervisor to abstract the hardware layer) is often underappreciated. The tightly coupled and highly tested combination of hardware and software provided as part of a storage appliance provides customers with a very predictable experience, from availability to performance. ElementOS providing features like guaranteed performance through QOS policies make performance validation on third party hardware even more critical to delivering feature parity to the end customer.
Fast forward six months, the SolidFire team at NetApp reached out to us with a prescriptive bill of material asking us if we wanted to take part in the EAP and provide feedback on our first experience with what they dubbed SolidFire eSDS (Enterprise Software-Defined Storage). This was going to be interesting.
Testing out SolidFire eSDS
Being that we are in 2020, our first hurdle came from simply getting servers to run the software. The bill of material NetApp provided to us was very prescriptive, and none of the components were outlandish. Still, when we ordered our hardware from HPE (of all things), HPE had a backlog on the chassis of the DL360, which unfortunately set us back almost 7 weeks. At this point only one hardware configuration is supported for eSDS, which is based on the HP DL360 gen 10, using 9x3.8TB NVMe drives, 2 much smaller drives as boot devices, a write-intensive NVME X4 device for de-staging, 25Gbps networking and a sizeable amount of RAM.
At this point it's essential to discuss the shared responsibility model that is intrinsically linked to any software-defined storage (SDS) offering. When purchasing a storage appliance, the vendor or OEM is responsible for everything up to the system's configuration. This means whether it's a hard drive failure or a software bug, you call the OEM, and it becomes their responsibility to meet the Service Level Agreement (SLA) of your support contract. Easy!
When it comes to SDS, things are different. You end up sourcing your hardware, your licenses and support for the underlying operating system, maintaining the system’s firmware, deciding how you will be implementing network configuration, what cables you will be using, etc. A lot of the responsibilities have to be thrust upon the customer in an SDS model.
NetApp does a good job outlining the responsibilities of each party in this model and provides a good depth of best practices throughout the documentation to help customers make good decisions when it comes to configuring the components for which they are responsible.
The initial install experience requires a minimum level of familiarity with Linux. The documentation provides an adequate level of details about configuring the base operating system, from partition layout to which packages to install, while providing some leeway.
The network configuration is very prescriptive throughout the documentation, but NetApp makes it clear that customers can implement in various ways that differ from the documentation — they just need to assume the responsibility of the network configuration, routing, interface aggregation, etc. If one is to follow the instructions in the deployment guide, they would end up with an LACP link aggregate for the management network and an LACP link aggregate for data (iSCSI) network on each node, which provides great redundancy and performance.
Once the network configuration is completed, the next step is simply to deploy the elementOS package. NetApp focused on providing customers an experience that would facilitate large-scale deployments by including a deployment process that leverages Ansible. The Ansible role provided simply fetches the package, installs any required/missing dependencies and does the configuration of the devices that will be used as storage.
It's very easy to see how one could modify the playbook to include their own network configuration as part of the deployment to encompass every post-operating system (OS) installation task as part of a single playbook. This, combined with a kickstart server for initial OS install, could enable customers to deploy hundreds of nodes daily if they needed to.
Beyond the fairly simple installation process, the cluster creation and elementOS configuration are exactly the same as it would be if one were to be deploying an appliance. Once the initial setup was completed, we proceeded to run through a battery of tests with our 4-node cluster including volume creation, QoS policy creation/application and testing, account creation, node addition and removal, disk removal/replacement and finally, performance testing (which I presume anybody reading this will be most interested in).
The one test that required a simple behavioral modification on the part of the operator was the disk removal and replacement. As we have established earlier, the shared responsibility model places the burden of the hardware support solely on the customer. This meant that in the case of the HPE DL360, in order to do a drive removal (per manufacturer's instructions), the stop traffic and request removal button needs to be pressed. This button is situated on the drive removal latch and can be seen in the image below as a lit green square (if you get a better picture that lit green square will look strangely like three spinning hard disk platters).
And now on to what 90 percent of readers (that haven't closed this browser tab yet) are waiting for: our performance testing. To say the least, we were very pleasantly surprised with the performance testing we did.
Since we performed a number of tests with SolidFire nodes in the past using vdbench, we chose to do similar tests in order to compare the results. There are always some differences in the environment that will skew things, but our environment was configured so that we had a 4-node elementOS eSDS cluster using bonded (LACP) 25Gbps interfaces for the iSCSI network. Our drivers had the exact same configuration (as there were 4 more nodes for a second eSDS cluster) and were sitting on the same network switches: Cisco Nexus 9Ks. This basically provided us with 50gbps of bandwidth between each node and each System Under Test (SUT). To maximize the potential parallelism, we provisioned 16 volumes (for even distribution amongst 4 nodes) on the SUT and used 32 thread per driver along with an increased queue depth of 256.
The results were eye-opening. At nominal IO rates, we saw an I/O response time (IORT) well below 1ms. In fact, when we kept the IO rate below the node rating (100K IOPS per node), we observed stable latencies around 0.45ms at 100 percent read. As part of the testing, we varied the read vs. write ratio from 100 percent read to 100 percent write-in increments of 10 percent for block sizes of 4k, 8k, 16k, 32k and 128k.
Without posting a litany of graphs, what I will say is that we managed to exceed the rated performance of the nodes by up to 40 percent in some cases (100 percent read). With the largest IO sizes, we were able to generate throughput upwards of 1GB/s per node, and while keeping the IO rate under the node rated performance, we were able to get extremely consistent latency throughout all our tests. Even more impressive was our ability to deliver those kinds of results using pre-GA code.
The future of software-defined storage
All in all, this has been a very interesting experience. For a long time, customers looking for a software-defined storage solution running on commodity hardware had very few options to choose from. NetApp has managed to broaden those options by delivering a very solid product with eSDS.
Albeit currently limited to one hardware platform, we can clearly see a not so distant future where two to three hardware manufacturers have certified configurations for running eSDS and customers can choose the flavor of the day.
As a parting thought, I would say the most important consideration for customers wanting to deploy software-defined storage is the burden of support they will be taking on. It is very important to understand that a lot of responsibilities that traditionally lie in the OEM's court with storage appliances get transferred over to customer's, thereby reducing the cost of acquisition.
If you would like more information about the testing we performed, feel free to reach out to your local WWT account team and they will get you in touch with our primary storage practice for a full debrief.
A huge thank you goes out to the team that supported these efforts, both on the WWT and the NetApp side. This effort would not have been possible without the help of Eric Becker, Rob Walters, Daniela Howard, Krishna Nallamothu, Jill Caughterty, Brad Katz and many others.