Ask me anything based on real-world experience with NetApp HCI
NetApp announced the NetAPP HCI product lineup in June 2017, followed by the General Availability release in October. As a top global NetApp Partner, WWT has been testing, prodding and poking an early release system in our Advanced Technology Center (ATC) for the past several months.
Why does NetApp call it HCI when it's not really HCI?
I guess we’ll get right to it then. Is NetApp HCI technically hyper-converged infrastructure (HCI)? No, no it isn’t. Per the non-existent official HCI rule book that I wasn’t able to find, for something to be HCI it has to provide compute and storage resources from a single shared node. NetApp HCI doesn’t do this today since it uses separate nodes for compute vs storage. So technically, no, it isn’t HCI.
[Pro-tip for the reader: this is a good time to pause while you 5-step your way to acceptance. Wait for it… wait for it…]
“An HCI by any other name would smell as sweet” – William Shakespeare, if he were alive and worked in IT.
OK, now that we’re all past the misnomer, we can concentrate on things that matter. Here are some advantages that NetApp HCI has over “technically HCI” solutions:
- We can scale compute and storage independently. This is a big deal, especially for data center workloads that are lumpy. It’s also very useful for IT organizations that don’t have a well functioning crystal ball to predict future growth, acquisitions, etc.
- We can mix and match different storage nodes without affecting anything at the compute layer. Element OS, the powerful SDS layer the NetApp HCI is built on, fully supports mixing different node generations and models within a single SDS cluster. Think of how useful this is when it comes time to expanding a cluster’s performance or storage capacity one node at a time.
- Pre-existing compute resources, virtualized or not, can leverage the storage tier within NetApp HCI. This greatly simplifies migrations into a NetApp HCI environment and allows customers to leverage existing investments. No mass migrations or lift and shift required.
- NetApp HCI eliminates the HCI “tax” for storage resources. Why license an OS/hypervisor for CPU cores that only provide SDS resources? With “technically HCI” solutions all cores have to be licensed, even if they spend all their time handling storage tasks. With NetApp HCI, only the compute nodes require OS/hypervisor licensing.
- By separating compute and storage nodes, we solve a whole bunch of CPU and memory contention issues between storage and compute demands. It’s one of the reasons NetApp HCI can guarantee storage performance regardless of what’s going on at the compute layer. They go even farther and combine this with some industry-leading QoS capabilities that you can read about here.
So TL;DR – It’s not really HCI, but in many ways it’s better.
What are my configuration options?
NetApp is keeping it simple:
- There are two node types: compute and storage.
- Each node type comes in three sizes: small, medium, and large
- The storage nodes are rated for a guaranteed number of IOPS: 50K, 50K, and 100K for the small, medium, and large configs, respectively.
- Storage nodes come with two 10/25Gb ports. The SFP+ modules are not included, but both 10Gb and 25Gb modules are supported.
- Compute nodes come with four 10/25Gb ports. The SFP+ modules are not included, but both 10Gb and 25Gb modules are supported.
- All nodes come with two in-band 1GbE RJ45 ports and one optional out-of-band 1GbE RJ45 port.
- Network switches are not included with the solution, but any basic 10/25Gb switch that understands VLANs and LAGs will suffice.
How does it scale?
As mentioned above, the compute and storage nodes scale independently so it will be very common to have different quantities of each in the same system.
- Compute clusters scale from a minimum of two nodes to a maximum of 32. Larger clusters, multiple clusters, and/or multiple vCenter instances will work, but the simplified deployment software doesn’t support those scenarios out of the box.
- Storage nodes scale from a minimum of four nodes to a maximum of 40 nodes. NetApp hasn’t formally tested NetApp HCI past 40 storage nodes, but the underlying Element OS software has been proven at that scale and beyond with NetApp’s SolidFire systems, so there’s no inherent architectural or software limitation preventing this if there is market demand.
- Scaling can be performed one node at a time and chassis can be pre-deployed with empty slots to enable very quick/easy scaling on-demand.
If you’re closely following the math, you may have noticed that the smallest system possible is a six node system, comprised of four storage nodes and two compute nodes. A six node HCI system isn’t exactly a great fit for most ROBO-type scenarios which is why NetApp isn’t targeting them with this solution. Rather, because of the scalability, reliability, and feature set baked into Element OS, they’re targeting core enterprise data center and application development/hosting environments.
How easy is it to install?
NetApp has a very simple and easy to use installer they’ve named NDE, or NetApp Deployment Engine. It comes pre-installed on the system and is accessible via a web browser once the system boots for the first time. It asks for basic inputs like credentials, hostnames, IP ranges, etc., and it does sanity checking along the way to make sure everything will work out in the end. It’s honestly so simple and easy to use that it’s kind of boring, and that’s a very good thing. A lot of thought and work went into the product to make that happen, so kudos to the folks at NetApp for getting it done. The only gaps in NDE we saw in our pre-release testing were a requirement for a DHCP server (which was resolved before GA to also support static IP addresses) and no support for tagged VLANs (which can be manually changed after NDE finishes if necessary). We are told that support for tagged VLANs is coming in a future release.
How does it perform?
To be direct: as advertised in the data sheets.
Since the storage nodes are separate from the compute nodes there is no variability based on compute workload allowing NetApp to guarantee the number of IOPS that each node can service. When combined with Element OS’s unique QoS capabilities, it performs not only as advertised, but it is predictable as nodes are added.
IOPS and his evil twin, Latency
How does it handle failure?
When we test for failure handling, or what we often refer to as RAS (reliability, availability, and serviceablity), we generally look to answer two questions:
- Does a given failure result in a loss of availability, or even worse, a loss of data?
- What happens to performance during and after a failure?
These are always interesting (and sometimes fun in that “epic fail” kind of way) tests to perform because they usually involve physically pulling an individual component, making sure things continue to work, and then replacing the component and watching how things recover. Throughout the process we run a steady-state “kitchen sink” workload on the system so we can analyze the results when we’re done testing. It should be noted that we are intentionally limiting the workload demand during theses tests to a relatively low level. We’re not trying to test max performance, but rather, how the system reacts to RAS testing under a steady and manageable workload.
10Gb NIC Cable Pulled From a Compute Node
- We pulled the NIC at 10:24 and replaced it at 10:34.
- We received an alert in vCenter that a NIC had failed.
- As you can see, there was no increase in latency at all during the pull, and only a minor blip in latency when replacing the NIC. This likely has more to do with VMware NIC teaming than anything else, and is very similar to what we see on other HCI solutions.
- We received a warning in vCenter that a NIC had failed.
- No special intervention was required after the cable was plugged back in.
10Gb NIC Cable Pulled From a Storage Node
- We pulled the NIC at 10:49 and replaced it at 10:53.
- An alert showed up in the NetApp vCenter Plugin.
- We were very impressed by how little of an effect this had on the system.
- No special intervention was required after the cable was plugged back in.
Single Drive Pulled From a Storage Node
- We pulled a drive from a single storage node at 11:22 and replaced it at 11:48.
- An alert showed up in the NetApp vCenter Plugin.
- We saw a temporary blip in latency when the drive was pulled, followed by slightly increased latency while the drive rebuild took place. The overall rebuild appears to have only taken about 15 minutes (while under load), which is quite good given the small size of the system. We know from previous tests against stand-alone Element OS systems with more nodes rebuild even faster since it’s an everything-to-everything rebuild methodology.
- Upon re-inserting the drive, we had to manually add the drive back into the system via the NetApp vCenter Plugin. The drive showed up as a new/empty drive and once re-added it immediately went back into the overall storage capacity of the cluster.
- We didn’t see any latency changes after re-adding the drive. Element OS doesn’t need to move any data back to the replaced drive. Rather, it will begin to leverage the drive as new IO operations occur.
Simultaneous double drive pull (that time we intentionally broke a HCI solution)
Sometimes, after trying a bunch of stuff to unsuccessfully make a system fail, we go all out and do something that we know will cause a failure, just to see how well the system puts itself back together.
This is fine. I’m okay with the events that are unfolding currently.
In the case of NetApp HCI, we decided to pull two drives simultaneously to create such a situation. To understand why it creates problems, a little background is required.
Element OS uses a proprietary “Helix” data protection scheme to protect all data written to a given cluster. There’s a lot to it, but at a very basic level, Helix data protection makes two copies of every chunk of data that is written to the system and makes sure those two duplicates are stored on different nodes. This way any drive in any node or even an entire node can fail without causing any loss of data or loss of access to data. As soon as a drive or node fails, the speedy everything-to-everything rebuild process kicks off (as in the previous test) to re-create the second copies of data that were lost when the drive or node failed. It works great and statistically it’s on-par with running RAID-6 since the rebuild times are so short.
Back to our test. When we pulled two drives simultaneously, some percentage of the data blocks within the system (the ones that were unfortunate enough to have their two copies of data on the two drives we pulled) went missing. Not all data blocks, just some of them, which is precisely the kind of thing that makes shared VMFS file systems react and fail in fun and interesting ways. Unrealistic, yes, but we were determined to make the system fail one way or another, so we did.
The result: big HCI system fall down go boom.
Storage IOPS. Workload generator capped at 15K IOPS. Not a max IOPS test.
- Well, that worked. We pulled both drives at 14:23 and put them back in at 14:28.
- From about 14:23, when we pulled both drives, to 14:32, we witnessed the slow and gradual meltdown of the system.
- It was interesting that IO latency didn’t spike until VMFS truly became unhappy. This makes sense as writes were still working as expected (they were being written to the remaining drives in the cluster) and the reads that were accessing the missing data weren’t returning at all, so they didn’t show up in the latency results at all.
- When we put the two drives back in, we didn’t need to re-add them to the cluster like we did when we only pulled one drive. It was aware of the situation and rather than marking both drives as dead and attempting to rebuild from non-existent second copies, it simply paused and waited for the drives to come back.
- Since we put both drives back in at the same time, Element OS didn’t initiate a rebuild at all – it just took off again like nothing ever happened. At that moment, the storage layer was ready to go again. Critically, there was no data loss, only data unavailability.
- The ESXi hosts required reboots and/or datastore rescans after the drives were replaced to get their wits back. We saw some of the workload generators take off around the 15:32 mark on the charts, with all of them coming back around 15:47.
- Interestingly, despite the latency spike when VMFS finally gave up around 14:41, latency was consistent. Consistent across a test that was designed to cause the system to fail.
It should be noted that the failure of the system in this scenario should not in any way be seen as a weakness or fault of the system. Statistically, simultaneous total drive failures are extremely rare. This was a lab-simulated test to force a failure, not an indication of a real-world concern. That being said, if you do happen to have a chaos monkey running loose in your data center, NetApp HCI’s resiliency may come in handy.
In the end, we were generally satisfied if not impressed by the results of our testing. Even though NetApp HCI is a 1.0 product, it’s clear its underlying Element OS is a seasoned data center product. The two-tier compute and storage architecture allows for some significant advantages over classic “technically HCI” solution at the expense of allowing it to scale down to meet small ROBO needs. The performance and RAS testing we performed in the ATC showed that the system performed and was reliable as marketed by NetApp.