High Performance File Systems for AI/ML
In this article
With the emergence of AI/ML solutions in the enterprise, many of our customers are finding that the traditional storage systems and filesystems may not meet the new requirements associated with the new AI/ML workloads.
Initially, NVIDIA partnered with the traditional storage vendors to create reference architectures for AI/ML. These reference architectures were relatively small, consisting of 2-4 DGX1 systems and one or two of the storage systems. These reference architecture made perfect sense at the time. It enabled most enterprises to easily start their journey into AI/ML solutions and dip their toes into the pool without too many barriers.
With the introduction of more powerful DGX2 and subsequently DGX A100 systems, and the customers wanting to scale their workloads, they found that the traditional filesystems and storage systems did not scale or perform well in some AI/ML workloads.
In the HPC world, they have learned this lesson many years ago. They did not use traditional filesystems or storage systems. The primary filesystems in the large HPC world are parallel filesystems such as GPFS, Lustre and others. Unfortunately, utilizing these HPC technologies was very difficult and complex. There was a skill and knowledge gap and a stiff learning curve to close this gap.
This is where vendors such as DDN, IBM, and HPE (Cray acquisition) can help. They created an "appliance" or storage system that hid a lot of the complexity of GPFS and Lustre.
DDN has been in the HPC marketplace for many years. They do a great job of creating an appliance that hides the complexity of the Lustre filesystem. They created an AI/ML-specific platform called the A3I platform. This appliance can be small as a few TBs in 2 RU and scale to 10s of PBs. DDN was one of the first NVIDIA DGX SuperPOD reference architecture and also one of the first to support the NVIDIA GPUDirect Storage feature. We had a chance to play with the DDN A3I platform in our ATC.
IBM and HPE (Cray) have been market leaders in HPC. IBM offers GPFS solutions in the Spectrum Scale products. IBM Spectrum Scale is now part of NVIDIA DGX SuperPOD reference architecture and supports the GPUDirect feature.
HPE offers both GPFS and Lustre solutions. HPE offers its own AI/ML reference architecture outside of the NVIDIA DGX ecosystem.
These HPC storage appliances do a good job hiding lot of the complexity but do not eliminate it completely. The data fabric can be complex to install and configure (see below). There also is the complexity of configuring the clients. For example, Lustre gives you many configuration options on the client-side that one has to be aware of.
AI/ML solutions have become very popular. It seems like every enterprise has strategic initiatives to leverage it to achieve business objectives. As usual, new technologies emerge and it, in turn, creates new opportunities. The new opportunities gave birth to new players such as WEKA, VAST, PANASAS, Pavilion, NGD and Fungible.
We had a chance to test and work with WEKA, VAST and PANASAS in our ATC.
WEKA and PANASAS position themselves as a modern platform for AI/ML. They both utilize their own parallel file systems created to meet the AI/ML workloads better than the traditional parallel filesystems such as GPFS or Lustre. WEKA states that traditional parallel filesystems were designed for large IO high throughput in the traditional HPC workloads, and are more than a couple of decades old. The traditional parallel filesystems also have a central metadata server. This centralized metadata server can become a bottleneck. WEKA was designed specifically for AI/ML workloads, which can be a mix of smaller IOs and large IOs, and is designed to do well with both large and small IOs. WEKA also has distributed metadata, which eliminates the metadata server bottleneck.
WEKA is an SDS (Software-Defined Storage)solution, so one can use any of the qualified server platforms. HPE offers one of these qualified reference architecture using DL servers. We had the WEKA system using HPE DL325 servers in the ATC. Each server had 19 NVMe drives in 1 RU form factor. Pretty dense solution and very performant, but not cheap. WEKA does have the capability to add object storage as the second tier of storage to reduce the cost per TB. SDS offers one the flexibility to select the server platform, but that flexibility requires one to implement and integrate the system. Even though the documentation from WEKA was very useful, we found this to be pretty complex. We would strongly recommend using WEKA implementation services.
Hitachi recently also started to offer WEKA as part of their HCSF (Hitachi Content Software for File) system. This product bundles the servers, networking, and object storage as a solution and comes preinstalled and configured in a rack. This makes the implementation much easier and faster.
VAST uses NFS as the main access protocol, but they made enhancements with new technologies. They utilize NVMeoF with Intel's Optane persistent memory, QLC SSDs and the fast 100/200g fabric. Along with these latest hardware technology enhancements, they also enhanced NFS protocol with their multipath NFS using RDMA/TCP. According to VAST, this enhancement allowed them to achieve up to 80X the performance (160GB/s) of regular NFS.
The VAST system we have in the ATC is a 3x3 system. It has 3 server chassis, 3 drive chassis and two Mellanox SN2700 switches. With our system connected to a single DGX A100 system, we were able to achieve over 100GB/s.
WEKA, VAST and PANASAS offer regular NFS client access, but to get the optimal performance, they recommend using their proprietary clients. The installation, configuration and optimization of these clients can be complex. WEKA and VAST were one of the first NVIDIA DGX SuperPOD reference architecture and also one of the first to support NVIDIA GPUDirect Storage. GPUDirect Storage feature requires that the proprietary client be used.
Along with new filesystems, the new AI/ML solutions require faster networking. Many of the NVIDIA DGX SuperPOD reference architectures have InfiniBand and/or 100/200g ethernet. In the traditional HPC environment, InfiniBand has been the defacto standard connectivity for many years. Unfortunately, most enterprises are not familiar with the InfiniBand. They would feel more comfortable with ethernet, but 100G/200G ethernet is different than the regular ethernet in the enterprise; especially the 100G/200g ethernet with RoCE. The configuration of the new data fabric is also another area of complexity and skills gap.
We noticed that most of the NVIDIA SuperPOD reference architecture used InfiniBand. In our lab, we did use InfiniBand with the WEKA cluster. Our WEKA cluster was small, so we only needed a single InfiniBand switch, and the configuration of it was not too difficult. WEKA had pretty good documentation so even with no prior experience with InfiniBand, we were able to get it to work. I think we may have run into difficulties if we had a larger cluster that required multiple InfiniBand switches.
During our testing period, NVIDIA had added RoCE as a supported configuration for GPUDirect Storage configuration, but we were not able to find any reference implementation of it. We worked with VAST to try to implement GPUDirect Storage using RoCE. We eventually got it to work, but it took many hours to get it to work correctly. The biggest hurdle was implementing the data fabric. The complexity was from having multiple switches in our configuration. We had 2 SN2700 switches that came with VAST system and 1 SN3700 that we had with our DGX A100 systems. The connection and correct fabric configuration were much more complex than anticipated.
We hope the article provided some insights to pay attention to when looking at AI/ML solutions and the infrastructure needs that support them. The infrastructure component is a small variation of the overall AI/ML solution mindset. There are a myriad of choices, and customers often utilize public cloud solutions. We offer a great introductory briefing in the area of AI/ML infrastructure.
Want to go deeper? We also have a great group of consultants that can get you started. Please engage your WWT account manager to start your journey.