High Performance Object Testing

Overview

In mid 2021, WWT completed a high-performance object storage benchmarking effort between four separate OEMs and their all-flash solutions. The OEMs/products included Pure FlashBlade, NetApp SGF, DELL EXF900 and Vast Data.

Here, I cover details on the testing methodology used and how the results were presented. While WWT may have more recent test examples, the basic object storage performance testing methodology has remained consistent with the only exception being ongoing improvements in the test jig itself.

The test Jig

Hardware

Part of the requirements for the testing effort was to expand the throughput (bandwidth) capabilities of the existing workload servers running the benchmarking tool in the lab. Previously, we had 12 servers made up of half HPE and half DELL models capable of doing approximately 80 GB/s in total.

Since some of the arrays being tested were capable of much higher throughput than 80GB/s, we added 16 more servers (for a total of 28) of mixed types, including 12 Cisco C-series servers and some additional DELL servers, increasing the maximum throughput to approximately 164 GB/s. We also purchased four new Cisco model 3232 switches to allow us to support all the I/O connectivity that would be required, including 100Gb NIC support.

Lastly, we made an investment in cables to interconnect everything. Rarely do we consider cabling to be a critical path item, but in this case, we had to purchase about $14K worth of 100Gb capable cables (which proved challenging during the height of the pandemic).

Software

The biggest of software requirements pertained to the open source version of the object storage benchmarking tool, COSbench. We used a specialized version of release 1.4 that provides a "dense," non-compressible data pattern on writes (PUTs).

COSbench has been around for a long time (>10 years), and there are other object benchmarking tools out there, but we prefer COSbench because it's open source, hasn't been created by any specific object storage OEM and does the job nicely for current requirements. In some cases, we also provided some virtual machines for OEM specific advanced reporting and analytic capabilities.

Fig 1. The WWT object test bed for the 2021 high performance object testing.

Apples to apples testing

WWT's testing goals are always to give the most accurate view of how products compare. That can be challenging when dealing with completely different OEM architectures. For instance, one OEM may have "unified nodes," where the compute and storage are located in the same chassis, and another may have separate compute nodes as well as storage nodes or expansion trays that allow for independent scale-out. One OEM may be using the latest modern CPU multi-core processors, where another may be running CPU versions a generation or two behind. Some solutions may leverage the latest SSD technology and NVMe interconnectivity, where another may leverage a totally different SSD and interconnect strategy.

This is not to say that one OEM may provide a full rack of equipment to be tested, where another may provide a half-rack. The only requirement that WWT placed on the OEMs providing gear for this particular effort was that they supplied a reasonable production configuration and not simply a minimum "starter" config. In general, this meant at least 8 nodes.

At the end of the day, all these difference between OEMs can be reduced down to two different metrics as detailed below.

Normalization metric #1 (performance per rack-unit)

The first way WWT normalizes the information is to calculate a performance per rack unit metric by simply taking the performance results for each solution and dividing by the over-all solutions required space. It can be noted that this includes all specialized gear needed to deploy from the OEM. This can be a little subjective in some cases especially around things like network interconnectivity, but if gear such as aggregation switches are required to come from the OEM then it's included in the over-all configuration rack space. If the customer can provide their own standardized TOR aggregation switches then it is not. This is where density of the solution is beneficial as well. For example, one solution nodes each required one RU chassis for the "server" component and two RU chassis for the SSD "storage" component. While that sounds reasonable, most of the competing solutions had storage and compute integrated into a single two RU chassis. This physical configuration design put the 3RU per node OEM immediately at a disadvantage when it came to the "normalized" performance metrics.

Normalization metric #2 (price per performance/rack unit)

The second normalization metric would be to derive a cost-to-performance metric. WWT doesn't perform this level of analysis since customer quotes can vary between proposed solutions and the test team is rarely involved with pricing, but it is relatively easy for the customer to perform this analysis when the competing OEM quotes are submitted.

It is quite possible, even likely, that the "best" performance solution is considerably more expensive than the 2nd or 3rd performance solution tested, yet after a full evaluation of all the products, the performance for the "less expensive" solution is quite satisfactory (good enough!) for the desired use case.

The test suite

A test suite, in this case, is defined as a group of benchmarking tests that are ran as part of a logical set and the results documented. A full test suite consists of the results from all the different types of test types such as write (PUT), read (GET), mixed (PUT and GET) and DELETE.

There are several product testing phases performed before being able to create the "final" published documentation set. These phases include "Test Prep," "OEM Validation." "Final Testing" creation of the "OEM Final Report" and lastly, the "OEM Comparison Test Report." More details for each phase and how the testing is performed are described below.

Test prep

The "Prep" phase occurs immediately after the OEM turns the array over to WWT after installation and includes filling the array to half capacity with dense (uncompressible) test data also called the "Fill Test." In the case of a particularly small capacity array, the fill test may be skipped.

Test prep also includes the creation of all the test buckets that will be used during testing, along with the pre-populating of the data that will be needed for the GET tests.

COSbench test parameters

COSbench can be somewhat of a daunting test tool because of the powerful features and flexibility that it is designed to have. It seems we learn some new nuance each time we do testing.

WWT has learned that there are some very important considerations when running certain tests, such as preventing object over-writes during PUT tests. In the OEM final report, we include a sample of the base configuration parameters for each test ran (see figure 2 below).

One of the benefits of having WWT assist in the testing process is that we have experience in ensuring that the tests being run accurately reflect the testing you are wanting to perform. Once standardized, the same test parameters are used across all the different OEMs tested.

Fig. 2: Example of COSbench test parameters for a PUT sub-test step for a 100KB object size.

Types of workloads

We used three different object sizes of 100KB, 10MB and 1GB (small, medium and large) to test using separate pure PUT and GET workloads. We also tested a non-concurrent mixed workload consisting of a PUT/GET/DELETE, but only used the medium object size.

For the mixed workload, the mixture of PUTs to GETs were tested using five different ratios, including 80/20, 60/40, 50/50, 40/60 and 20/80, while maintaining a consistent 10% DELETE component.

As mentioned, the actual object sizes used were 100KB, 10MB, and 1GB, but with the introduction of AI/ML workloads, there is a good possibility that we will decrease the small object size to 50KB or smaller for future tests.

Stepping of the workload

In order to find the performance "sweet spot" for each set of tests, the workload was stepped up from what could be considered a light workload e.g., 28 threads (one thread per load gen server) up to a very heavy workload of 3884 threads for GET testing, and 1344 threads for PUT testing.

During OEM validation testing, each "step" (9 total) is run for 12 minutes and for 17 minutes during the final test phase. The overall times included a minute for stabilization, or what used to be called "spin-up," and a minute for cool-down, or what used to be called "spin-down" (obviously antiquated terms for SSD).

The resulting reported data is based off of a 10 or 15 minute step, respectively. An example of this thread stepping for a test run (for PUTs) would be incrementing steps of 28, 56, 112, 224, 448, 672, 896, 1120 and 1344 threads (also sometimes referred to by us as users).

Determining the response time knee of the performance curve

Once a full test passes for a particular object size and workload is performed, the results (objects/second and average response time) are captured and then graphed to visually show the "knee of the response time curve." This is the point where additional thread increments start to logarithmically increase the response time.

The knee RT is generally ~2X the response time of the lightest workload. While it may be possible to drive more objects per second by adding additional load, it's being done at the expense of response time, which could greatly impact the application's performance.

Below you can see an example of the knee of the response time curve and its correlating objects per second transferred. In the example below, the knee is circled in red on the response time axis. You might notice that there are actually two separate test plots graphed and shown (two RT [orange and yellow plots] and two TPS [blue and gray plots]).

When reporting results, we always run at least two passes to show consistency between tests and to ensure we are meeting our test deviation goals of under 5 percent. In some cases when arrays are greatly stressed (especially workloads that are causing high CPU), you will see deviation increase at the higher workloads and the two test plots will diverge. If the plots diverge even under low or moderate workloads, then we work with the OEM to understand where the issues may be and report our findings in the OEM final report.

Fig 3. Knee of the response time example for one of the OEMs (knee is circled in red).

Capturing test runs

When we run a set of tests we capture all of the COSbench data and include it in the OEM's report created during the testing process. This data is useful to determine how stable the array performs and the amount of "jitter" that is introduced over the course of the tests.

Figure 4 shows an example of two different steps of the same workload: one at 56 threads and one at 448 threads (this example was testing from a prior benchmarking effort and not testing from the high performance object testing).

You can see that the graphing (every 10 seconds) is pretty consistent during the low thread/load testing of 56 threads (left hand graph), but that the right hand plot is very inconsistent as the workload was stepped up to a heavy workload of 448 threads. This is indicative of the array struggling to keep up.

Fig 4. Test step consistency.

The COSbench summary data for each test is also captured, which is used to create the knee of the response time graphs. Fig 5. Is an example of what that summary data looks like directly from COSbench and included in the OEM reports.

Fig 5. Native COSBench output summary report, each line is one of the nine test steps.

Validating metrics with array analytics

As part of the test verification, the COSbench results (which can be considered server-side) are compared to what the array metrics are indicating. This is to assist in validating not only that the results are accurate but that external issues, e.g., network is not impacting the data.

Fig 6 shows an example of correlating a PUT test run, using the array side metrics. You can make out the 9 levels as the test steps up the load as well as start to see the consistency (well-defined plateaus) start to deteriorate at the higher loads.

Fig 6. Example of using array side metrics to validate COSbench data.

OEM validation

As mentioned above, the first set of test runs are called "OEM Validation" tests, and multiple passes are run and then graphed with the results shared with the OEM's technical team.

Our engineers go over the data with the OEM's technical resources to determine if the results are expected and discuss any anomalies seen in the testing. If there are "expected" or "known" anomalies then the information is called out in the final OEM report that is shared with both the OEM and the customer.

If an unexpected anomaly is found then WWT works with the OEM to identify the issue. If possible, we correct it, even to the extent of testing ad-hoc patches. However, only a limited amount of time can be dedicated to trying to do OEM side corrective action(s) before testing has to continue and the results reported "as is,"

Final testing

Final testing occurs after OEM validation and is the "final" extended test run results (15 min steps vs. 10 min per step and approximately three hours per test) that are presented in the final report.

These test results are also shared with the OEM prior to including them in the final summary. Completing a single OEM's full test suite includes at least two passes per test and takes approximately 4-5 days of round-the-clock testing. In practice, provided issues are usually not encountered, the whole testing process takes 2-3 weeks per OEM, including creating the OEM final report documentation.

Individual OEM reports

After completing the testing for OEM and before moving on to the next OEM, all the test data, performance information and lessons learned are put together in a separate, detailed and stand-a-lone report, which is again shared with the OEM for verification.

While the OEM does have input into how we present the data, as well as how WWT presents any issues encountered, WWT reserves the right to word and present the data in what we believe is a fair and accurate report. This is usually not an issue between the OEMs and the WWT technical team, in that the OEM's technical team is aware of our detailed findings and have had a chance to either fix any issues or explain reasons for an array's behavior.

Each OEM tested has this stand-a-lone report that goes into much more technical depth than the "final OEM comparison report, which is considerably like an executive summary. In some cases, according to the scope of the testing and time allocated, the OEM final report will describe the impact of invoking various storage services, such as encryption, compression and different protection schemes (EC vs replicas) and/or how the product scaled when additional nodes were added.

Figure 7 shows an example excerpt of the Table of Contents for a OEM individual report where these additional tests were performed.

Fig. 7: Example of the Table of Contents for an individual OEM report, representing approximately 160 pages.

OEM final summary and comparisons report

Once testing across all OEMs has been completed, a final summary and comparison report is generated that compares the normalized performance across all OEMs in a graphed output. The report will also outline lessons learned for each OEM, which includes any highlights and observations during testing.

This summary is intentionally kept concise and is not shared with the participant OEMs. However, a final meeting with each of the OEMs is conducted, where a sanitized output of how they placed in each test suite compared to their peers is presented. In other words, they do not know specifically which of the other OEMs placed better or worse than their own performance.

Fig. 9: The final report for the high-performance object arrays tested as part of this POC.

Determining a clear-cut winner

It's usually not as simple as being able to say that OEM "X" is the winner because there are a lot of other factors at play, such as desired functionality and price. For example, we can come up with four separate use cases or scenarios where each of the four participants would have been a better choice than the other three. There was, however, one OEM that did rise to the top as the general use case performance winner.

Want to know who it was? Find that out and more by reaching out to WWT. Let's collaborate to customize your next proof of concept testing and gather reliable performance data you can evaluate before you make that next big storage investment.