NetApp SDS Testing In The ATC

Summary

We worked inside the Advanced Technology Center (or ATC) directly with NetApp architects who had a chance to bless the setup and help with the baseline performance testing. Then, as a trusted advisor to our customer, we (ATC Lab Services team) walked the array through the paces of testing in the Execution phase of the POC.

ATC Insight

The series of tests that were performed on the NetApp Solidfire storage platform was broken into 3 major sections:

Baseline Performance Testing
Gathered a baseline of performance data by running five VDBench jobs for a 30-minute iteration and then recorded and captured the data.
Functionality Testing
Gathered required evidence on management form and functionality for the solution.
Resiliency Testing
Lastly, tested the resiliency of the platform by introducing hardware failures while replication was taking place with the same VDBench job was running in the background.

Last impressions and thoughts are listed in the conclusion at the end of this ATC Insight.

Hardware and software consisted of the following components:

4x Dell Poweredge R640's - were used for the compute resource
Cisco 9K-C93180 switches - for 25GB iSCSI storage traffic
8x NetApp H610S-1 storage nodes
1x NetApp OVA deployed management node

**Testing on the SDS solution was performed in October 2020 and into November 2020.

High level design of the physical environment is depicted below:

Performance Testing

The NetApp Solidfire solution was filled to 50% capacity at the request of the client. The array had the same number of Volumes/LUNs carved out and presented for performance testing on each VM in each cluster. This is a standard used for all VDBench testing.

The client provided us with the performance testing requirements of which is documented below:

Max IOPs, 4K Block, 70% Read, 30% Write

Average Latency	4.10 ms
Average IOP's	498.56 K
Average Throughput	2.042 GB/s

Max IOPs, 8K Block, 30% Read, 70% Write

Average Latency	4.68 ms
Average IOP's	436.48 K
Average Throughput	3.576 GB/s

Max IOPs, 8K Block, 90% Read, 10% Write

Average Latency	4.95 ms
Average IOP's	414.12 K
Average Throughput	3.533 GB/s

Max IOPs, 32K Block, 70% Read, 30% Write

Average Latency	10.07 ms
Average IOP's	203.14 K
Average Throughput	6.66 GB/s

Max IOPs, 1MB Block, 70% Read, 30% Write

Average Latency	278.67 ms
Average IOP's	7.38 K
Average Throughput	7.74 GB/s

Functionality Testing

The client had asked us to go through different aspects of functionality for each solution. This was a combination of QoS testing, performance monitoring, efficiency reporting, snapshot capabilities, and overall general management of the solution. For this ATC Insight, we focused on the QoS testing (if capable), performance monitoring, and snapshot capabilities. If there is further interest in the other tests we performed, please feel to reach out to the author of this ATC Insight.

QoS Testing

This test was completed by running two separate VDBench jobs. For this instance, we had the QoS policy set to 200K IOPs so the first job was kicked off at 25K IOPs with a 4K block at 50% read and 50% write. The job was started for a period of time before kicking off the second job running at 200K IOPs with a 4K block at 50% read and 50% write; this would put the traffic over the set 200K IOP threshold and should engage the QoS policy to take effect.

Shows the policy of 200K IOPs being applied to all 64 volumes for the VDBench cluster

Shows the first job run at a steady state

Shows the addition 200K IOP job being kicked off and the QoS policy working as expected

Live Performance Monitoring

The client required that the solution would alert to an abnormality ion health when an instance would happen, for this, we ran a VDBench job workload of 25K IOPs, 4K block, 50% read and 50% write. After the workload ran for a period of time an SFP was pulled from one of the nodes. While we see the impact in the VDBench output there was no notification from the NetApp interface, though the event could be seen in the logs.

Shows no impact alert from the removed SFP

Snapshot Capabilities

This test revolved around a future feature request from the client, teams were starting to look at the benefits of snapshots in their environment and the client wanted to see what functionality could be garnished from each solution. The ask from the client was as follows:

Manually create a snapshot
Create a snap of the previous snapshot
Present snapshot to host and actively use
Create a consistency group snapshot
Create snapshot schedule with retention rules

In this test we were not able to take a snapshot of a snapshot without using CLI. The request from the client was for everything to be done through the GUI. Also if looking to have different retention periods then a unique schedule would need to be created for every retention period as the only time a retention period could be created is with the schedule creation.

Shows in the interface where to go to perform a snapshot

Shows the options available when taking a snapshot

View of the volume and the snapshot both mounted on the same host

Resiliency Testing

The final round of testing was based on hardware resiliency. For each of the tests, a baseline VDBench job would run 50K IOPs, 4K block, 50% read and 50% write. The tests consisted of the following:

Power leg pull from one node
Connectivity leg pull from one node
Remove a node from the cluster for a period of 5-minutes
Drive removal from one node, if no impact, pull the drive from subsequent node until impact

Impact from pulling a single leg of power from a node

**Found this a bit odd that a power pull on only one leg would cause an impact, NetApp agreed and took the action to look into it and document the findings.

Impact from pulling a leg of connectivity

Impact from pulling a single node from the cluster

Impact of drive pull till failure, took six drives pulled, one drive from a node with a 20 second delay between pulls

Test Tools

VDBench

VDBench is an I/O workload generator for measuring storage performance and verifying the data integrity of direct-attached and network-connected storage. The software is known to run on several operating platforms. It is an open-source tool from Oracle. Visit VDBench Wiki for more information.

Graphite/Grafana

A Graphical User Interface (or GUI) that we use in the Advanced Technology Center (or ATC) to visually depict the results data that we derive in our compute and storage lab efforts. Visit Grafana for more information.

Last Impressions and Thoughts

NetApp does a great job of making a straight forward solution with easy setup and management.
The improvements being made to the software layer will help garner an advantage in the player market.
Would like to see Fibre Channel as a connectivity option.