Back in October 2019 was NetApp Insight, and if you attended the event, you probably came back thinking to yourself that this year was all about Keystone, NetApp's new consumption model. You wouldn't be wrong, but one very interesting announcement that was overshadowed by Keystone was the introduction of NetApp's All-SAN Array, or ASA.
NetApp has found its roots in the file services world, bringing a large number of innovative features and technologies to the storage world. Over the years, NetApp pioneered a number of features that helped them grow their market share such as snapshots, cloned copies and data deduplication to name only a few. Later in their existence, NetApp realized that by focusing solely on file storage, they were forgoing an ever increasing part of the market: SAN storage.
NetApp and SAN storage
It's no secret that while NetApp introduced SAN support in ONTAP, their slice of the market share of SAN workloads didn't exactly jump overnight. Initially NetApp struggled to make a dent in the SAN space, and it's not before their huge investment in all-flash appliances and performance optimization, starting with ONTAP 9, that they have seen the needle move significantly. Since then, NetApp's share of storage sold for SAN workloads has been steadily increasing.
Even with outstanding SAN performance and scalability, one challenge NetApp still faced was the ability to provide symmetric access through multiple controllers to logical unit numbers (LUNs). This is a requirement that comes up very frequently with customers looking for storage for mission critical applications, and it makes sense. What symmetric LUN access provides its users can be summarized under 3 topics:
- consistent latency/performance during fail over; and
- higher potential throughput.
NetApp's introduction of the ASA configuration is meant to address all three of those issues.
Working with LUNs
Historically, Data ONTAP would provide hosts access to LUNs in a heterogeneous fashion. Given that a LUN would be "owned" by a single controller in the cluster, all paths to the LUN through the controller owning the LUN would be advertised as active/optimized to the clients accessing the LUN, while all paths going through controllers that did not have direct access to the LUN would be advertised as active/non-optimized. This resulted in all the traffic going through a single controller for a given LUN, which meant that in the event your active/optimized path became unavailable, the host would encounter an APD (all paths down) event.
If you think of the architecture of ONTAP and the multi-protocol support, this makes sense. You could not have a volume (the container for a LUN) with a filesystem, a permissioning scheme, etc., active on two or more nodes at a time. By disabling file protocols and relegating all filesystem ownership and management to the host, the controller now only deals with blocks to a block device. The sequencing of those blocks is done by the filesystem owner: the host.
From an availability standpoint, having all paths active and optimized at the same time means that the hosts do not need to worry about finding a new path should their active/optimized path suddenly become unavailable. The hosts are already aware and using a number of other paths that will provide access to the block device. Also, the LUN not being "owned" by a single controller at a time does not need to effectively fail over to a partner node/controller in order to become available again. This effectively means that unless a catastrophic event such as a dual fabric failure or a dual power source failure happens, the availability should be maintained at 100 percent.
When it comes to consistent latency during link failure, you can probably already see how this change in architecture will support consistent latency through fabric or storage controller failure. The key to achieving consistent latency during failure will be having the appropriate configuration on the host side to detect and reject a path that becomes unavailable quickly and stops sending I/O requests down that path. NetApp offers a number of tools to help ensure the appropriate configuration is in place to get the best possible experience.
- ActiveIQ OneCollect is a tool from NetApp that will create an inventory of your environment that will include hosts, switches and storage and validate it against the IMT (Interoperability Matrix Tool).
- ActiveIQ Config Advisor focuses on the configuration of the storage controller to ensure that all best practices are met and all limits are respected.
- ActiveIQ AutoSupport is the last catch all. AutoSupport, when configured, will collect the configuration data on a weekly basis and run similar checks as ConfigAdvisor and alert you to any misconfiguration on the storage side.
Higher throughput with ASA
When it comes to higher throughput, there are two factors that weigh in to the architecture of the ASA. The first is the number of active paths between the the client and the storage. Obviously, having access to throughput from two different fabrics simultaneously and front-end ports on multiple controllers will provide a much greater throughput potential.
The second factor that will affect throughput will be having two controllers working at the same time to service the IOs to the block device. This will reduce the probability of running into a bottleneck caused by the processor of the storage controller.
I was told many times growing up, "you can't have both the butter and the money for the butter,"—which probably betrays my French origins—but I think a more common English idiom is "you can't have your cake and eat it too." NetApp has truly delivered a platform that can now be considered for any workload including, yes, mission critical workloads, meaning we no longer have to choose between the butter and the money (or cake).
But that comes at a cost. Some limitations have been implemented in the ASA offering that are important to keep in mind.
- 50 percent processor utilization: If you want your workload to behave the same in the event of a failure, you will need to keep your storage controller processor utilization below 50 percent at all times, but that's typical of any symmetric architecture.
- 2-node maximum: With today's AFF, NetApp supports up to 12 nodes in a single SAN cluster. That is not the case with the ASA. The documentation makes many mentions of increased node count for future releases, but as of ONTAP 9.7, only 2 node configurations are supported.
- FCP and iSCSI only: There's no NVMe support quite yet (again, something that gets mentioned many times in the documentation as a future deliverable).
- Space reclamation (T10 hole-punching) should be disabled: Not to be confused with has to be disabled, the recommendation is to turn space reclamation off for the sake of latency consistency. Space reclamation is a very intensive process for the controller processor and could result in higher latencies for extended periods of time.
- Snapshot policies should be disabled: A recommendation from NetApp is to use an external manager for snapshots such as SnapCenter to provide application consistent snapshots.
Clearly with such a high value deliverable with the ASA platform, we simply cannot leave it at that. Keep an eye out for our next article about the ASA (you can do this by following our Primary Storage topic), where we will actually run tests in the ATC to demonstrate the impact of path failover between the regular AFF platform and the ASA platform.
If you have any special requests when it comes to testing, please contact us about adding your test cases to our test plan!