In this ATC Insight

Summary

We worked inside the Advanced Technology Center (or ATC) directly with NetApp architects who had a chance to bless the setup and help with the baseline performance testing. Then, as a trusted advisor to our customer, we (ATC Lab Services team) walked the array through the paces of testing in the Execution phase of the POC.

ATC Insight

The series of tests that were performed on the NetApp Solidfire storage platform was broken into 3 major sections:

  1. Baseline Performance Testing
    Gathered a baseline of performance data by running five VDBench jobs for a 30-minute iteration and then recorded and captured the data.
     
  2. Functionality Testing
    Gathered required evidence on management form and functionality for the solution.
     
  3. Resiliency Testing
    Lastly, tested the resiliency of the platform by introducing hardware failures while replication was taking place with the same VDBench job was running in the background.

Last impressions and thoughts are listed in the conclusion at the end of this ATC Insight.

Hardware and software consisted of the following components:

  • 4x Dell Poweredge R640's - were used for the compute resource
  • Cisco 9K-C93180 switches - for 25GB iSCSI storage traffic
  • 8x NetApp H610S-1 storage nodes
  • 1x NetApp OVA deployed management node

**Testing on the SDS solution was performed in October 2020 and into November 2020.

High level design of the physical environment is depicted below:

 

 

Performance Testing

 

The NetApp Solidfire solution was filled to 50% capacity at the request of the client.  The array had the same number of Volumes/LUNs carved out and presented for performance testing on each VM in each cluster.  This is a standard used for all VDBench testing.

The client provided us with the performance testing requirements of which is documented below:

Max IOPs, 4K Block, 70% Read, 30% Write

 

Average Latency

4.10 ms

Average IOP's

498.56 K

Average Throughput

2.042 GB/s

 

NetApp Solidfire Max IOPs, 4K block, 70% Read, 30% Write

Max IOPs, 8K Block, 30% Read, 70% Write

 

Average Latency4.68 ms
Average IOP's436.48 K
Average Throughput3.576 GB/s

 

NetApp Solidfire Max IOPs, 8K block, 30% Read, 70% Write

 

Max IOPs, 8K Block, 90% Read, 10% Write

 

Average Latency4.95 ms
Average IOP's414.12 K
Average Throughput3.533 GB/s

 

NetApp Solidfire Max IOPs, 8K block, 90% Read, 10% Write

 

Max IOPs, 32K Block, 70% Read, 30% Write

 

Average Latency

10.07 ms

Average IOP's

203.14 K

Average Throughput

6.66 GB/s

 

NetApp Solidfire Max IOPs, 32K block, 70% Read, 30% Write

 

Max IOPs, 1MB Block, 70% Read, 30% Write

 

Average Latency

278.67 ms

Average IOP's

7.38 K

Average Throughput

7.74 GB/s

 

NetApp Solidfire Max IOPs, 1MB block, 70% Read, 30% Write

 

Functionality Testing

 

The client had asked us to go through different aspects of functionality for each solution.  This was a combination of QoS testing, performance monitoring, efficiency reporting, snapshot capabilities, and overall general management of the solution.  For this ATC Insight, we focused on the QoS testing (if capable), performance monitoring, and snapshot capabilities.  If there is further interest in the other tests we performed, please feel to reach out to the author of this ATC Insight.

 

QoS Testing

 

This test was completed by running two separate VDBench jobs.  For this instance, we had the QoS policy set to 200K IOPs so the first job was kicked off at 25K IOPs with a 4K block at 50% read and 50% write.  The job was started for a period of time before kicking off the second job running at 200K IOPs with a 4K block at 50% read and 50% write; this would put the traffic over the set 200K IOP threshold and should engage the QoS policy to take effect.  

Shows the policy of 200K IOPs being applied to all 64 volumes for the VDBench cluster
Shows the first job run at a steady state
Shows the addition 200K IOP job being kicked off and the QoS policy working as expected

 

Live Performance Monitoring

 

The client required that the solution would alert to an abnormality ion health when an instance would happen, for this, we ran a VDBench job workload of 25K IOPs, 4K block, 50% read and 50% write.  After the workload ran for a period of time an SFP was pulled from one of the nodes.  While we see the impact in the VDBench output there was no notification from the NetApp interface, though the event could be seen in the logs.

Shows the impact spike from a SFP pull
Shows no impact alert from the removed SFP

 

Snapshot Capabilities

 

This test revolved around a future feature request from the client, teams were starting to look at the benefits of snapshots in their environment and the client wanted to see what functionality could be garnished from each solution.  The ask from the client was as follows:

  1. Manually create a snapshot
  2. Create a snap of the previous snapshot
  3. Present snapshot to host and actively use
  4. Create a consistency group snapshot
  5. Create snapshot schedule with retention rules

In this test we were not able to take a snapshot of a snapshot without using CLI. The request from the client was for everything to be done through the GUI.  Also if looking to have different retention periods then a unique schedule would need to be created for every retention period as the only time a retention period could be created is with the schedule creation.

Shows in the interface where to go to perform a snapshot
Shows the options available when taking a snapshot
View of snapshot
View of the volume and the snapshot both mounted on the same host
Shows the snapshot as being editable
Shows the ability to take a group snap
Shows the group snap

 

Resiliency Testing

 

The final round of testing was based on hardware resiliency.  For each of the tests, a baseline VDBench job would run 50K IOPs, 4K block, 50% read and 50% write.  The tests consisted of the following:

  • Power leg pull from one node
  • Connectivity leg pull from one node
  • Remove a node from the cluster for a period of 5-minutes
  • Drive removal from one node, if no impact, pull the drive from subsequent node until impact
Impact from pulling a single leg of power from a node
Impact from power being plugged back in

**Found this a bit odd that a power pull on only one leg would cause an impact, NetApp agreed and took the action to look into it and document the findings.

Impact from pulling a leg of connectivity
Impact from pulling a single node from the cluster
Impact from a single drive pull
Impact of drive pull till failure, took six drives pulled, one drive from a node with a 20 second delay between pulls

 

Test Tools

VDBench

VDBench is an I/O workload generator for measuring storage performance and verifying the data integrity of direct-attached and network-connected storage. The software is known to run on several operating platforms.  It is an open-source tool from Oracle. Visit VDBench Wiki for more information.

Graphite/Grafana

A Graphical User Interface (or GUI) that we use in the Advanced Technology Center (or ATC) to visually depict the results data that we derive in our compute and storage lab efforts. Visit Grafana for more information.

Last Impressions and Thoughts

  • NetApp does a great job of making a straight forward solution with easy setup and management.
  • The improvements being made to the software layer will help garner an advantage in the player market.
  • Would like to see Fibre Channel as a connectivity option.

Technologies