In this ATC Insight

ATC Insight

There was a series of tests to be performed on the HPE Nimble Storage AF60 platform.  This testing was broken into 3 major sections:

  1. Baseline Performance Testing
    Gathered a baseline of performance data by running five VDBench jobs for a 30 minute iteration and record data.
     
  2. Replication Testing
    Tested replication while a specific VDBench job was running in the background and record data. 
     
  3. Resiliency Testing
    Lastly, tested the resiliency of the platform by introducing hardware failures while replication was taking place and the same VDBench job was running in the background.

Hardware and software consisted of the following components:

  • Cisco UCS 5108 with B200 M4 blades - this was used as source and destination host configuration
  • Cisco 6248 switches - were used for 10GB connectivity to hosts
  • Brocade C620 switches - were used for source FC connectivity
  • Cisco Nexus 5548 switches - used for replication traffic of storage arrays
  • Cisco MDS 9148 switches - were used for destination FC connectivity
  • Nimble AF60 - pair
  • Nimble software version 5.1.4.0

**Testing on this HPE Nimble Storage AF60 was performed in December 2019 and into January 2020.

 

Baseline Performance Testing

The HPE Nimble AF60 array was filled to 70% capacity at the request of our client.  The array had the same number of volumes/luns carved out and presented for performance testing on each VM in each cluster.  This is a standard used for all VDBench testing.

The command VM was the only VM used for migration and had a total of 6 volumes mapped.  The first volume was filled with 200GB of data before being replicated to the target array.  Once replication was completed, we were able to kick off the jobs that created change data on the other volumes that were replicating.  This gave us a baseline of what replication looked like without performance jobs running during replication.  

Before the client came on-site we ran the performance testing ahead to save time and came back with the following baseline performance results for the HPE Nimble AF60 Array:

Nimble 75K IOPS, 64K block, 100% Sequential, 100% Write
Nimble 75K IOPS, 64K block, 100% Sequential, 100% Write
Nimble 75K IOPS, 64K block, 100% Random, 50% Read/Write
Nimble 75K IOPS, 64K block, 100% Random, 50% Read/Write
Nimble 150K IOPS, 64K block, 100% Random, 50% Read/Write
Nimble 150K IOPS, 64K block, 100% Random, 50% Read/Write
Nimble 75K IOPS, 8K block, 100% Random, 50% Read/Write
Nimble 75K IOPS, 8K block, 100% Random, 50% Read/Write
Nimble 150K IOPS, 8K block, 100% Random, 50% Read/Write
Nimble 150K IOPS, 8K block, 100% Random, 50% Read/Write

Performance testing resulted in the following findings for each test run:

 

Performance testing findings

 

Replication Testing


After the performance testing had been completed we moved onto the replication testing which held the most weight for our client.  Our customer used these testing results around replication to formulate some thoughts based on some standards they have today in their production data centers around replication.  

A base workload was used for all replication testing, this was a VDBench job running 150K IOPS, 8K block, 100% random, 50% read/write.  This consisted of the 200GB data being replicated first.  Then, once replication was completed, the VDBench job was kicked off for a run time of 30 minutes.  The following results were observed for the HPE Nimble Storage AF60 Array

Results for HPE Nimble Storage AF60 Array


Resiliency Testing

Once the replication testing was completed, the final round of testing with our client commenced.  This testing was specifically around resiliency.  

For the resiliency testing, there were a few key elements being tested:

  • First, a controller failure simulation
  • Second, a fiber channel switch going down
  • Finally, a drive being pulled from the array 

Additional Notes:

For the controller failure simulation, we decided to pull the controller and leave it out for a 10 minute time period to better simulate a failure.  For the HPE Nimble Storage AF60 the failure test for the controller was actually pulling one of the controllers out.

For the drive being pulled from the array, the first iteration of the drive pull resulted in results that showed no impact due to updated technologies on how the system detects a missing drive.  If the pull and replace was too quick there was no impact (as if the drive never was pulled out).  This lead in a change of testing to have the drive pulled for a few minutes at a time.  

 While resiliency testing was commencing we had a baseline job running that consisted of 150K IOPS, 8K block, 100% random, 50% read/write.  This was also paired with replication being enabled while the test's were performed.  The sequence for each test consisted of the following steps:

  1. Start VDBench job of 150K IOPS, 8K block, 100% random, 50% read/write
  2. Let run for 5 minutes
  3. Enable replication between the source and destination array
  4. Let run for 5 minutes
  5. Perform the decided resiliency test
  6. Wait for array to equalize once test was complete
  7. Stop replication
  8. Stop VDBench job

The above was tested on the HPE Nimble Storage AF60 Array for each of the three resiliency tests.  This testing was the most time consuming testing that we did with our client.  

Here are the testing results of resiliency testing on the HPE Nimble Storage AF60 Array:

Controller Failure

At 1:56pm controller A was removed from the chassis resulting in controller failure. At 2:06pm controller A was reinserted into the chassis. 

Array recovered as expected with no loss of service to the workload. IO was impacted for approximately 30 seconds after controller failure. Controller B became primary instantly with slight increase in latency. Controller A came back online as the standby controller.

Fail single fiber channel switch

Workload with replication started at 2:36pm. All power was removed from a single brocade switch at 2:42pm. We observed a nominal impact to latency. The Brocade switch recovered at 2:45pm.

Single drive failure

Workload with replication started at 2:16pm. A single drive was failed from the disk array enclosure by removal at 4:20pm causing a disk rebuild with in the array. The disk rebuild lasted approximately 8 minutes. Latency during this time period averaged 11.54ms read and 6.40ms write.

Test Tools

Depiction of Grafana being used as front end to VDBench

VDBench

VDBench is an I/O workload generator for measuring storage performance and verifying the data integrity of direct-attached and network connected storage. The software is known to run on several operating platforms.  It is an open-source tool from Oracle.  To learn more about VDBench you can visit the wiki HERE.

Graphite/Grafana

A Graphical User Interface (or GUI) that we use in the Advanced Technology Center (or ATC) to visually depict the results data that we derive in our compute and storage lab efforts.  To learn more about this product you can go HERE.

Specific Hardware and Software Nimble AF60 for Lab Testing:

  • Nimble software version 5.1.4.0
     
  • AF60-2QF-184T-2 Description: AF60, 2X10GBASET, QUAD 16GBFC (QTY. 2 PAIR), QTY 2 X 92TB FLASH PACK AFS3-138240-2 Description: ALL FLASH EXP SHELF FOR ALL FLASH CONTROLLERS, QTY 1 X 46TB FLASH PACK, QTY 1 X 92TB FLASH PACK DR Site
     
  • AF60-2QF-184T-2 Description: AF60, 2X10GBASET, QUAD 16GBFC (QTY. 2 PAIR), QTY 2 X 92TB FLASH PACK AFS3-138240-2 Description: ALL FLASH EXP SHELF FOR ALL FLASH CONTROLLERS, QTY 1 X 46TB FLASH PACK, QTY 1 X 92TB FLASH PACK

Supporting Lab Environment for Testing:

  • Cisco UCS 5108 with B200 M4 blades - this was used as source and destination host configuration
  • Cisco 6248 switches - were used for 10GB connectivity to hosts
  • Brocade C620 switches - were used for source FC connectivity
  • Cisco Nexus 5548 switches - used for replication traffic of storage arrays
  • Cisco MDS 9148 switches - were used for destination FC connectivity

Diagram Depiction of the Lab in the ATC:

 

 

Technologies