From the manufacturing floor to research laboratories, modern digital transformation is being driven by advancements in artificial intelligence (AI), machine learning (ML) and deep learning (DL). It’s an exciting time for organizations looking to turn data into actionable insights that accelerate prescribed outcomes.
To maximize AI/ML/DL technology, organizations must first ensure they have a solid foundation in the people, processes and technology of large-scale data analytics and high-performance computing (HPC).
This means investing in the right infrastructure to capture, manage and analyze valuable data at scale, and often in real time. Successful organizations will gain a competitive advantage through an ability to efficiently identify patterns and trends that can improve decision making and unlock new opportunities.
But the journey to success is not easy.
Historically, the monolithic HPC systems available to organizations have been unable to keep pace with the growing demand for AI use cases involving deep learning — the training of increasingly large neural networks with data.
Until recently, due to the limitations of last-generation HPC systems, even companies well-versed in data science and HPC were unable to take advantage deep learning's many benefits. They would encounter scaling issues when they tackled challenges too data, compute or graphics intensive. Such issues manifested as slow data processing or limited compute power, both of which prolonged the time it took to solve business challenges.
This article explores how a combination of next-gen HPC hardware and integration expertise can help organizations overcome scaling issues to solve complicated problems — all without succumbing to the urge to throw additional hardware at what is, in essence, a computing power problem.
- On the infrastructure side: HPE’s powerful Apollo 6500 Gen10 System delivers the computing capacity needed to help organizations of any size realize the full benefits of AI/ML/DL.
- On the business side: WWT’s HPC experts can design, test and integrate HPC solutions into your environment through our Advanced Technology Center (ATC). We are well-versed in evaluating and optimizing data flows and workflows across the full spectrum of configurations and issues that arise in today's ever-growing data centers.
Read on to learn more about HPC and how WWT can help you harness the scalable performance of AI/ML/DL to solve business problems faster.
What is HPC?
High-performance computing (HPC) is a rapidly growing field that generally refers to the use of next-gen hardware and software to address extreme-scale computational, data and visualization problems.
At its core, HPC is the ability to process data and perform complex calculations at high speeds. For perspective, a laptop with a 3GHz processor can perform around three billion calculations per second. While that’s much faster than any human could achieve, it pales in comparison to HPC solutions, which can perform quadrillions (or more) of calculations per second.
HPC requires a highly architected ecosystem of tightly coupled performant, optimized servers, accelerators, memory, interconnects and storage.
The ecosystem of HPC use cases spans industrial and commercial users, hardware and software vendors, academia, national laboratories, HPC centers and other R&D institutions.
Supercomputers are possibly the best-known example of HPC. They go far beyond the typical scale of enterprise systems and clusters composed of many smaller computers and processors. Supercomputers and HPC superclusters aggregate a massive amount of resources, at scale, into one agile system capable of breaking the computational and timeframe barriers needed to model, simulate or analyze problems or conditions that are numerically, data and graphically intensive.
The world fastest reported HPC/supercomputing systems are registered on the TOP500 List. Supercomputer Fugaku, a system based on Fujitsu’s custom ARM A64FX processor, ranks first on the list. It is installed at the RIKEN Center for Computational Science (R-CCS) in Kobe, Japan. It was co-developed in close partnership by Riken and Fujitsu and uses Fujitsu’s Tofu D interconnect to transfer data between nodes. Fugaku’s HPL benchmark score of 442 Pflop/s exceeds the most performant HPC system in the U.S., named Summit, three times over.
HPC/supercomputing architectures typically follow a scale-out system, where additional hardware can be added and configured as the need for more computer power arises.
Capability vs. Capacity
In the traditional HPC world, there are generally two recognized types of scale-out computing: capability computing and capacity computing. Both use predictive data analytics to project future activity, actions and market trends, and to expose unknown problems in the data.
- Capability Computing typically marshals all available computing power across the entire system to solve a single problem as quickly as possible. A good example is the U.S. Advanced Simulation and Computing (ASC) Program, which “supports the Department of Energy's National Nuclear Security Administration (NNSA) Defense Programs by developing simulation capabilities and deploying computing platforms to analyze and predict the performance, safety and reliability of nuclear weapons and to certify their functionality in the absence of nuclear testing.”
- Capacity computing typically designates subsegments of the supercomputing system’s horsepower and resources to efficiently tackle several large problems, or many smaller problems, concurrently. For example, capacity computing is used to forecast multiple weather models for oncoming weather events, such as hurricanes.
Without an expertly architected ecosystem of HPC servers, accelerators, memory, interconnects and storage, these computing types would take months or even years to solve the problems they're tasked with due to slow data processing.
The Apollo 6500 Gen10 System
Enter HPE’s Apollo 6500 Gen10 System, available now to customers and partners in WWT's ATC. Soon, the ATC will also have an Apollo 6500 Gen10 Plus System — an even greater powerhouse featuring NVIDIA's new A100 Tensor Core GPUs.
Billed as an enterprise platform for accelerated computing, the Apollo 6500 was built to address the most crucial step of training deep learning computer models capable of learning, reasoning and determining the best course of action in real time.
In layman's terms, the Apollo 6500 can help organizations supercharge their HPC capabilities to accelerate outcomes such as product time-to-market and condensing research timelines. Other use cases suitable for the Apollo 6500’s deep learning capabilities include:
- Identifying vehicles, pedestrians and landmarks for autonomous vehicles
- Monitoring oil field drilling rigs to prevent disasters
- Recognizing images
- Recognizing speech and translating
- Processing natural languages
- Designing drugs
- Completing bioinformatics
Accelerated performance with flexibility
How does the Apollo 6500 Gen10 Plus System deliver its promised scalable performance, flexibility and resilience? The backbone of the 6500 is HPE's ProLiant Gen10 server, which leverages NVIDIA’s A100 GPUs to tackle the most complex HPC and AI workloads. These GPUs are responsible for delivering the superior performance required by massive data and pixel-intensive workflows, generating up to 312 TFLOPs of TF32 single-precision compute power at peak performance.
On top of the GPUs, the Apollo 6500 platform features PCIe and NVLink GPU interconnects that give organizations flexibility in tackling a wide variety of workflow requirements. Organizations can use the interconnects to toggle between the two types of scale-out computing — capability and capacity computing — as their workflows demand.
Not only do HPE’s Gen10 systems deliver incredible performance, but you won’t find a more secure server anywhere, period. That’s because HPE’s new Silicon Root of Trust secures the Gen10 portfolio in firmware protection, malware detection and firmware recovery — all down to the silicon level. This unrivaled technology ensures that the Gen10 servers will never boot with compromised firmware. And to make the customer experience as seamless as possible, HPE also offers iLO server management software and GreenLake cloud services.
For an overview of the platform’s specifications, see the following table:
WWT’s HPC Expertise
The power of the Apollo 6500 System is impressive. And perhaps a bit daunting given its range of AI/ML/DL capabilities. Organizations wanting to get the most out of this intriguing HPC advancement will need expert guidance to properly integrate it into their environment.
Luckily, WWT has a growing team of 35 HPC Advocates whose job is to help organizations assess and optimize data flows within their data centers. This team leverages prior industry experience at HPC vendors like DDN, Cray, IBM and Lenovo to integrate HPC solutions for our customers. Our active projects touch many industries, including advanced manufacturing, healthcare, state and local government (e.g., DoD, IC, law enforcement, NIH, NOAA, NASA) and more.
Our HPC experts perform the critical integration work in our Advanced Technology Center (ATC), a truly unique place. The ATC is a B2B Platform and collaborative innovation ecosystem built on a collection of physical and virtual labs. It fast-tracks the ability to design, build, educate and deploy innovative technology products, integrated architectural solutions and digital business outcomes. Partners and customers can leverage more than a half-billion dollars of equipment to solve the many types of interdisciplinary problems they see in their own environments.
Because HPC can be a difficult concept to wrap your head around, let’s take a quick look at how WWT's HPC experts used the Apollo 6500 to accelerate workflows in the real world.
Use Case: Folding@home
Based at Washington University in St. Louis (WashU), Folding@home is a distributed computing project operating since its founding at Stanford University in 2000.
By downloading the Folding@home’s free protein dynamics software, citizen scientists worldwide sign up to donate their spare CPU/GPU resources to a good cause. The software leverages the spare processing power of personal devices, like PCs and gaming systems, to perform the data-intensive simulations needed to unlock how proteins fold, misfold and interact with other cellular proteins that cause disease. Each new software download represents another computing node in Folding@home’s growing network of distributed, raw supercomputing power.
Folding@home issued a call for help in late February 2020, announcing it was focusing its crowdsourced research on SARS-CoV-2 and COVID-19. Their goal, shared by many in the open science community, was to better understand the viral infection's protein dynamics to accelerate the research and development of potential treatments and therapies.
Since the announcement, Folding@home has reported that their number of volunteer researchers has skyrocketed from 30,000 to nearly a quarter-million, and counting.
World’s most powerful supercomputer
Data-intensive research like protein dynamics simulation requires computational horsepower that typically necessitates one or more massive supercomputers. The reason is it takes an incredible amount of processing capacity to simulate how proteins self-assemble, or fold, in fractions of a second. Now multiply that process by billions or trillions of possible folding combinations and you understand why supercomputers are the tool of choice for these researchers.
Folding@home leverages its decentralized network of everyday machines to impressive advantage. Through brute force trial and error, each device running its software simulates a protein fold at the molecular level until it returns a result.
What makes Folding@home’s distributed network unique is the sheer amount of computing power it has been able to harness. 2.4 exaFLOPS is a world record amount of computing power. For reference, an “exaFLOP” is one quintillion floating-point operations per second.
How did Folding@home achieve this amount of raw compute power? In short, by tapping the collective goodwill of hundreds of thousands of citizen scientists and organizations around the world, including WWT. We are proud that our ATC — specifically, our HPE Apollo 6500 Gen10 System — has made a meaningful contribution to the cause.
Apollo 6500 in action
Our ATC engineers, architects and HPC experts understood that the latent compute power of the ATC's next-gen hardware could be a significant boost to Folding@home’s goal of generating the massive amount of distributed compute power needed to accelerate its research.
To maximize our contribution, WWT quickly unified our various department-based folding teams and consolidated all spare HPC resources in the ATC. These resources included the following servers, GPU, CPU cores and memory:
Once we installed Folding@home’s software, thus connecting the ATC’s latent compute power to the other nodes in the distributed network, our single Apollo 6500 server immediately began outpacing the next largest cluster of resources by a factor of two to three times.
Properly integrating the HPE Apollo 6500 for such an intense workflow was not easy. WWT’s integration and HPC expertise was key to ensuring the eight GPUs and 40 CPUs were completely and properly utilized by Folding@home’s COVID-19 protein simulations.
The sustainable performance and reliability generated from this use case should speak volumes to anyone who comes to the ATC in need of accelerating a business or research workflow.
HPE’s impressive Apollo 6500 Gen10 systems deliver the computing capacity needed for organizations of all sizes to realize the full benefits of AI/ML/DL. With guidance from WWT’s HPC experts, who offer the ability to test and validate HPC solutions in the ATC prior to integration, you are just one step away from demystifying AI/ML/DL and gaining market share.
For those interested in learning more about the HPC architecture available to customers and partners in our ATC ecosystem — including the Apollo 6500 — Earl J. Dodd is who you want to connect with.
How would your organization use the Apollo 6500 to optimize and accelerate data-intensive workflows? Drop us a note in the comments below.