In this blog

SC23 was held in Denver from November 12 to 17 this year. Key themes included CDI, power and cooling innovation, and accelerated computing developments. Here are some of our key takeaways:

Composable disaggregated infrastructure

Composable disaggregated infrastructure (CDI) adoption is growing thanks to the falling cost of high-performance architecture (HPA), primarily via more efficient node configurations, better utilization of accelerators, more efficient power and cooling, more straightforward upgrades and expansions, and less recycling.

Power requirements continue to increase as well. For example, Oak Ridge National Laboratory (ORNL) now requires 56.5KW to 66KW per rack (10KW to 30KW is the amount of power commonly available). Moreover, multi-tenant environments (rather than relying on one large language model) are becoming the norm.

Partners highlighting CDI at SC23 included Liqid, who showcased their GPU on-demand and IO accelerators, and GigaIO Networks, who featured their GigaPOD and SuperNODE solutions. You can find Liqid in our Advanced Technology Center (ATC).

Also of note was Computer Express Link (CXL), an open industry-standard protocol that's increasingly prevalent across CDI solutions. In fact, CXL has been responsible for putting CDI into consideration as a key component of many next-generation HPC/HPA solutions.

Power and cooling 

The ever-increasing need for power and cooling to support AI solutions was a clear focus at SC23. Power demand continues to grow at an exponential rate thanks to accelerated computing, as combining CPU + GPU power to support AI/ML development and production workloads leads to increased heat generation.

Non-traditional cooling technologies are also needed — a universal challenge across enterprise customers, public cloud hyperscalers, and managed solutions/aaS providers. 

Companies demoing HPA cooling solutions included Motivair, CoolIT Systems and Cooler Master. Solutions demonstrated included liquid heat exchangers, liquid emersion, and traditional air cooling. 

Overall, power and cooling were common themes throughout many AI solution presentations.

Accelerated computing technology developments  

The "need for speed" theme was supported by industry veterans and newcomers alike. Below are selected summaries of key announcements at and after SC23:

  • NVIDIA: Announced the new GPU H200. The H200 is the same chip as the H100, but with an increase of the HBM memory from 80GB to 141GB.
  • Intel: Data center GPU Max Series, Xeon CPU Max Series and 4th Gen Xeon Scalable Processors. The Aurora Supercomputer in Argonne National Labs with HPE (166 racks with 63,744 Max Series GPUs.) Additionally, Intel announced Emerald Rapids targeting AI speech recognition as one use case amongst additional AI use cases. Details on the Gaudi 3 processor for AI Inferencing stated this will be the last standalone GPU. The plan is to consolidate CPU + GPU into one chip — another common theme at SC23.
  • Microsoft: Announced Maia 100 AI chip, a custom chip designed for AI training and inference in the Azure Cloud.
  • AMD: MI300A and MI300X AI chips on display at the conference, including a motherboard with 8x chips.
  • Qualcomm: Announced a family of AI accelerators called the Cloud AI 100 jointly with HPE. Intended for generative AI and LLMs.
  • Tachyum: Tachyum Prodigy integrates CPUs, GPUs and Tachyum Processing Units (TPU – edge inference engine). The consolidation of CPU + GPU into single chips saves space, reduces power consumption and begins to merge CPU+GPU solutions into a single platform.
  • Inspire Semiconductor: InspireSemi Thunderbird is touted as a "supercomputer-cluster-on-a-chip" based on open-source RISC-V CPU instruction set architecture (ISA) for an "all CPU" programming model.

Additional innovations  

Quantum Village: Quantum computing solutions that were demonstrated include Quandela photonic quantum computers and IonQ quantum computing. Quantum computing, while at times considered a niche solution, continues to gain investment and interest due to its ability to solve specific optimization problems (e.g., AI optimizations) faster and with less energy than alternative solutions.

MLOps:

  • Run.ai: Demonstrated their AI Workflow Management tool suite targeted for data science, MLOps, and DevOps teams.
  • Cnvrg.io: Intel's Full Stack AI Operating System provided overviews of their MLOps solution featuring project and data set management.
  • Rescale: Announced an HPC cloud MLOps solution, the Rescale Cloud File System, to simplify dataset cloud management through automation and policy-based control to high-performance cloud-based storage.

Conclusion 

A quick summary of our conference highlights include: 

  • Accelerated and composable computing technology developments dominated the conference.
  • A focus on the use of HPC for AI and machine learning (ML).
  • The development of new technologies for exascale computing, such as quantum computing and data center cooling.

WWT continually evaluates new and emerging technologies for incorporation into our ATC ecosystem based on customer demand and solution development interest.  

For AI and HPA inquiries, please reach our team and follow High-Performance Architecture on wwt.com.  

  • Technical Solution Advisors: Ryan Avery, Earl J. Dodd, Phillip Hendrickson, Derrick Monahan, Justin Van Shaik
  • Business Development Managers: Bobby Baker, Kurt Fultz