SC19 Is Over, the Learning Is Not

Congratulations on surviving SC19. I hope you enjoyed learning, networking and collaborating with your peers at the world's largest HPC and supercomputing conference. I know I enjoyed the events, technical sessions and fellowship immensely. I especially hope you found the conversations around the convergence of HPC and AI to be valuable.

The big winners and news at SC19

SC is the world's largest international conference and exhibition for high-performance computing, networking, storage and analysis. In addition to hosting technical meetings, the conference includes exhibitions by industry and research institutions.

Whether you had the chance to attend this year's conference or not, I'm positive these conversations will pique the interest of any organization. The theme of SC19 was: The future is now. HPC is now.

How does one summarize the impossible? You don't. Yet, I'll give you a rundown on the big-ticket items I found at SC19 (no order assumed):

The TOP500 List required >1.14 PFLOP/s to qualify
Cray is now an HPE company, and I hope for a smoother integration than that of SGI
Cray-Fujitsu A64FX ARM-based processor partnership for the Exascale era
Aurora A21 technology solution pre-announcement by Raja Koduri of Intel
HPC and AI convergence is real (the new HPL-AI benchmark was official launched)
Processor diversity continues to be a major focal point (Intel, AMD, ARM, and accelerators)
Intersect360 Research and Hyperion Research updates on the HPC + AI marketplace
Liquid cooling innovations (pumps, pipes, sinks, exchangers, and full to partial immersion) for Edge to the Exascale systems
NVMe (Tier 0) memory solutions (and M&A activity to come)
HPC services offerings and providers are blooming (we are a graying industry)
RoCE (Ethernet) comes on strong – farewell OPA – and another proprietary interconnect

Let's recap just a few of the events I participated in as part of the conference, with more insight around my big ideas above.

HP-CAST 33 at SC19

I participated in the HP-CAST (High Performance Consortium for Advanced Scientific and Technical) 33 event this year, leading the charge for WWT. For those new to high-performance computing (HPC), let me give you a quick insight into the power of HPC.

High-performance computing helps you unlock your data… and then define it. HPC aggregates computing power to deliver higher performance than typical enterprise computing, creating the computational power and storage capacity to solve the world's most complex problems. HPC is beyond the domains of science, engineering and research. HPC, coupled with AI and big data, has become business computing's go-to architecture and infrastructure for global competitiveness and agility.

As for HP-CAST, I soaked in the expert guidance from HPE on the essential development and support issues for HPC and AI systems. As stated by HPE, "the High Performance Consortium for Advanced Scientific and Technical (HP-CAST) computing works to increase the capabilities of innovative Hewlett Packard Enterprise (HPE) solutions for HPC and AI."

The HP-CAST meetings included corporate briefings and presentations by executives, technical staff and HPE partners (under NDA), and discussions around customer issues related to high-performance computing and AI/big data.

Dell EMC HPC Users Group at SC19

I also represented our HPC team in the Dell EMC HPC Users Group conference event this year.

Dell EMC is expanding the boundaries of what's possible by advancing, democratizing and optimizing high performance computing (HPC) — enabling more organizations to leverage their power to drive more innovations and discoveries with solutions for High Performance Computing and AI.

While at the conference, I met with Dell EMC users and company experts that shared their expertise, insights, observations, suggestions and experiences to improve current HPC solutions and to influence future technology capabilities and impact. Working together with Dell EMC, WWT can help advance the HPC industry in the design, delivery and deployment of the convergence of HPC, big data and AI.

Intel HPC Developers Conference at SC19

Before the start of SC19, I attended Intel's HPC Developers Conference. Intel's theme this year is "connecting with HPC + AI experts to accelerate your innovations with Intel."

And the 2019 Developers Conference did just that.

The highlight of the event came from the keynote by Raja Koduri, SVP, Chief Architect, and General Manager of Architecture for Intel. Raja spoke frankly about the challenges facing accelerated computing systems at Intel and pre-announced the Aurora A21 system for the next era of HPC.

Keynote presentation from Raja Koduri of Intel

Aurora is a planned state-of-the-art exascale supercomputer designed by Intel/Cray for the US Department of Energy's Argonne Leadership Computing Facility (ALCF). The system is expected to become the first supercomputer in the United States to break the exaFLOPS barrier. Raja mentions an architecture fountain of capabilities and products coming from Intel to meet the demands placed on Aurora A21.

Learn more about Intel HPC.

Intel Fireside Chat with WWT & Dell EMC

It was a pleasure to present at one of Intel's Fireside Chats alongside Adnan Khaleel of Dell EMC, moderated by Jeff Watters from Intel.

Fireside Chat panel from Intel, Dell EMC and WWT — Left to right: Jeff Waters (Intel), Adnan Khaleel (Dell EMC), Earl J. Dodd (WWT)

With a theme of FPGA Accelerators Plus Persistent Memory Balances the HPC Powerhouse, we brought together our industry-leading experts on the convergence of HPC with AI and big data. A brief synopsis of some of the main ideas during our chat is outlined below.

Power & performance at scale

Jeff (Intel): We see the demands on data-centers continuing to grow in one of two ways. One, compute demand is growing faster than CPU performance, driving up floor space and power needs. Two, accelerators are driving higher power density than many data-centers can handle, leading to empty space in racks. What do you recommend to customers looking to prepare their facilities for the decade?

Earl (WWT): Power, water, and weight. The data center of the future (most retrofits today) includes high-voltage direct current (HVDC) solutions and additional water for cooling (component, nodes and racks). The increase in accelerators and attached processing options, coupled with increased cooling for dense and hot components, requires additional scrutiny for floor-loading capacity. Power efficiency can be coupled with HVDC solutions, which is becoming a defacto standard for HPC and HPDA.

Adnan (Dell EMC): Power and water are two of the primary considerations we ask our customers to investigate, especially as our customers start planning the next generation of data centers. We [Dell EMC] have a couple of dense compute platforms for HPC that offer water cooling today, but not many of our customers have the DC infrastructure to be able to take advantage of that, and we see many instances where we've got semi-populated racks because of the per rack power limitation. CPUs & GPUs are only going to be more power-hungry, and very soon we'll be seeing processors that are close to the 300W TDP. At those power levels, water cooling will be pretty much be a must.

Focus on the workflow, not the workloads

Jeff: We talk a lot about the complexity of workflows, including multiple algorithms and the concerns of driving heterogeneous datacenters to run one workflow. How do you reimagine an entire HPC workflow to revolutionize the way a business works? What does it mean to focus beyond just the HPC workload?

Earl: Workflow optimization is the next evolution in the convergence of HPC and HPDA (AI/ML/DL and dig data) solutions. The key with a shared infrastructure is to enable efficient use of computing resources based on workflow (not workloads), demand and policies that securely and seamlessly extend the converged infrastructure to network edge locations.

Adnan: Ultimately, we see it as part of the overall convergence of various workloads. We've been talking about this for several years now, but it's finally reaching mass applications in the last few years. I often speak to customers who want HCI and want to do everything on it, mostly because they don't have to stand up separate clusters dedicated to each workload. And, when you share a cluster with flexibility designed, you also minimize data movement, which can save a lot of time in the entire workflow.

Convergence of HPC & HPDA (AI/ML/DL)

Jeff: You bring up a good point regarding convergence. At Intel, we've been talking about the convergence of HPC, ML/DL, and data analytics for quite some time. Is this message realistic or just marketing fluff? Do you each see this convergence happening at data centers, and if so, how do you see HPC infrastructure supporting and globally optimizing for the growing demands of artificial intelligence (AI) and big data analytics processing needs?

Earl: To be 100% realistic, it's happening today. The convergence is real. As large-scale machine learning and streaming start to play a more significant role in an enterprise or agency, these big data systems need more computational capabilities like HPC. Tools and frameworks are making streaming and machine learning algorithms more powerful, which also demands the horsepower of scalable HPC.

Adnan: It's real indeed, and that's one of the reasons I alluded to the complex workflows we see in production today. AI is driving that convergence today, but a few years ago we could argue that big data was doing the same. The more mature tools today do allow for better-mixed workloads, like Intel's Magpie, for instance. And that's been the primary hurdle to this convergence thus far: tool support. SLURM and YARN still don't play nice with each other, but it's getting better.

Offloading more work

Jeff: I hear a lot about offloading work, but between AI accelerators, multiple GPU architectures and FPGAs, many people don't know where to start. Each of these adds a lot of cost to a compute node for a handful of workloads that can take advantage of them. How do you see these technologies helping to improve insight and discovery?

Earl: Another back-to-the-future area for HPC focuses on the migration to offload engines where more computation, metadata management, data processing and network streams on embedded or attached devices can be processed. The more we keep the operating system (OS) out of the way from our processing, while including computational and data management aspects on offloading engines, the higher hybrid systems can scale.

Adnan: My approach has generally been that you don't need any accelerators, FPGAs, etc. if you're just getting started. I work with several customers who are looking at the best of breed, say in AI, and they get all caught up with the technology. I still think that the modern Xeons today offer very decent performance, and guess what, you already have them in your DC to start with. In some instances, when customers are much further along, accelerators do offer advantages, but it's still an uphill battle as the talent and skill to program and utilize them efficiently is not as mature as it with processors. But again, the tool support is getting better. Ultimately there must be a business driver that dictates which approach you go with.

Large memory nodes using SMP and persistent memory

Jeff: Where do you see new persistent memory architectures fitting into an HPC data center?

Earl: We see NVM uses for improving uptime, productivity and lower total cost of ownership (TCO). One can bring data and metadata back into the node to help reduce overall energy consumption and data movement.

Adnan: One of the primary use cases for large SMPs with persistent memory technology like Optane is in accelerating databases, and specifically graph databases that are memory and processor hogs. Also, graph problems are notoriously tricky to partition, so that the more of the graph that stays in the same node's memory, the better. The latency in going across nodes can kill performance. But also, we also see the application of NVMe technologies in accelerating parallel file system performance (e.g., Cambridge, that won the IO500 in June, used them to improve BeeGFS performance, and we've got internal testing on improving Lustre with a flash tier). Data and the need to shuttle large quantities of it very quickly are going to force us to rethink how we think of memory and storage hierarchies. Besides, we've already seen a massive shift towards tape archival to datalakes, and so the need for storage performance is not going down any time soon.

HPC and WWT

In summary, as a shift in high performance computing has taken an efficient data-intensive supercomputing turn in recent years, coupled with the convergences of AI and big data, a fundamental rethink in architecture balance is required. One size does not fit all, and WWT is at the forefront of bringing balanced converged solutions to its customers by partnering with the leading HPC vendors like Intel and Dell EMC.

Our focus in striving to create more balanced systems focuses on the entire workflow or value stack of AI-converged, data-intensive HPC architectures. The key is not to veer far from common sense, but of course, implementation is something of a challenge and art that we at WWT understand.

The learning has just begun.

AI and HPC are increasingly intertwined as leadership computing projects around the world prioritize the converged use of modeling, simulation, AI and big data. WWT has deepened its partnership with Dell EMC, Intel Corporation, HPE and others around the convergence of HPC, AI, big data, and industrial IoT solutions. Advanced infrastructure and use cases in the Advanced Technology Center (ATC) highlight the fundamental values and ROI opportunities for businesses integrating HPC and AI into their workflows.

Our partnerships with leading OEMs and their HPC portfolios (high-density servers, high performance storage and industry-leading software) are helping businesses meet the cost, performance and responsiveness demanded by big data analytics and high performance computing.

Stay up to date with activities from our HPC and AI teams and with all the testing and validation happening in the ATC.

Upcoming HPC events

ISC'20 in Frankfurt, Germany: https://www.isc-hpc.com/
HPC & AI on Wall Street in NYC: https://www.hpcandaionwallstreet.com/
SC'20 in Atlanta, Georgia: https://sc20.supercomputing.org/ & https://youtu.be/ieLlFar0U5U