1. Introduction to the inference pivot

The AI industry is moving beyond model training into the transactional reality of inference-driven value. Training consumes months of GPU time on complex architectures to build a parameter weight file. While training architectures like SuperPODs are designed for massive high-speed interconnectivity to handle synchronized operations across all GPUs, inference does not require that level of interconnectivity. East-West fabrics are not usually required for inference tasks. We can deliver equivalent compute power at a lower cost by building specifically for the nature of inference.

2. Sizing the model and the memory

Key factors in sizing for inference include:

  • Responsiveness (Time To First Token): The speed at which the first token is generated.
  • Concurrency: The number of simultaneous requests the system can handle.
  • Prompt Size: The maximum prompt length supported.
  • Throughput: The number of tokens generated per second.

A general rule for LLMs is that the GPU memory in gigabytes should match the number of billion model parameters, plus a 20% overhead for caching and other activities. For example, a Llama 3.1 model with 70 billion parameters at FP8 precision requires at least 84GB of memory, which is less than an H200's 141GB. To maximize throughput, the model may be distributed across all 8 H200s in a DGX.

Most models operate at FP8 precision but can also use FP4.  FP4 requires half the memory of FP8, FP16 requires double, and FP32 requires quadruple. The precision level required depends on the data being inferred.

Memory needs increase with caches and optimizations. Latest guidelines suggests a 100% overhead.

3. The mechanics of parallelism

When a model exceeds the High Bandwidth Memory (HBM) capacity of a single GPU, we use parallelism to distribute computations across multiple GPUs. We categorize these methods by the location of inter-GPU communication.

  • Tensor Parallelism: This involves splitting individual layers within a single node. It relies on high-bandwidth intra-node links, such as NVLink.
  • Pipeline Parallelism: This method assigns different model layers to different nodes. Layers are processed sequentially: one node computes its layer and passes the result to the next.
  • 2D Parallelism: This is a combination of previous methods. Tensor Parallelism is used for intra-node scaling within a single box using NVLink, while Pipeline Parallelism handles inter-node scaling across multiple boxes.
  • Data Parallelism: This maximizes HBM utilization by handling multiple requests in a single pass. We balance batch size with the degree of parallelism to ensure the KV cache utilizes the available memory effectively.

4. GPU selection and precision 

CPUs still have a strategic place in the stack for linear statistical methods and sentiment analysis. However, generative AI is driven by GPUs. 

The GH200 is a unique choice because it provides a 624 GB joint memory pool. By combining GPU memory with 480 GB of LPDDR5X, the GH200 can infer large models. 

Note the difference in capabilities by GPU at the required precision of your task, and the differences in memory, both at a single GPU and in the 8-way DGX.

Data Center GPU Performance (TFLOPS)

    H100     H200      B 200       B300  

HBM

8-way

 

    80GB

  640GB

   144GB

1,152GB

   192GB

1,536GB

   288GB

2,304GB

FP64Scientific Computing       540        540        296         10
TF32

High Precision Models

-Molecular Simulation

-Financial Modeling

    8,000     8,000        540       296
FP16Learning and Inference  16,000      16,000      36,500   36,000
FP8Mixed Learning/Inference  32,000    32,000    72,000   72,000
FP4 Low precision Inference   144,000 144,000 

Here are examples of models that depend on subtle statistical differences and need higher precision..

  • Accumulated minor numerical errors over long attention chains can destabilize training, demanding higher precision.
  • Fine-tuning with reinforcement learning (e.g., RLHF) can be numerically unstable and benefit from high precision for stable policy updates.
  • Combining image, speech, and text inputs (e.g., autonomous driving) may need more stable gradient flows across modalities.

Image and video

The required image and video processing level and type may drive a different GPU selection. The RTX PRO 6000 and L40S include RT (Ray Tracing) cores and engines for encoding and decoding, as well as input ports. Most VISION & VIDEO models do not use RT Cores except as part of a pipeline.

HBM

Memory is critical to both model size and performance, as GPUs are generally memory-constrained. For example, the H100 has the same FP16 FLOPS as the H200, but the number of Tokens generated is over 70% more. The GH200 is unique in providing a single memory pool combining CPU and GPU memory. This makes it an excellent choice for inference on larger models.

5. Reference architecture: Stripping the east-west fabric

The standard Inference POD architecture is built for efficiency rather than the massive synchronization required by training. We eliminate the expensive East-West fabric because inference is typically self-contained within a single GPU/server. We focus instead on North-South networking for model delivery and user requests. 

For text-based tasks, 25 GbE is sufficient. For image-heavy or future-proofed deployments, 100 GbE is the standard. We can use Direct Attach Copper (DAC) within the rack as it is the most cost-effective, most reliable, and lowest-latency interconnect available. 

The server type also needs to be evaluated. If the model's memory requirements fit on a single GPU, a PCIe server model may be used. A model can infer across multiple GPUs when NVLink is used, as in the 8-way DGX H200 or a 2-way HGX H200 with NVL2. 

7. Storage for random IOPS

Inference storage requirements are the opposite of training requirements. While training requires sequential throughput for large checkpoints, inference demands high random IOPS and low latency to serve concurrent requests. The storage stack has four specific layers.

  1. Local Filesystem: Local NVMe is used to pin model weights. Models are pulled from the registry and prewarmed to ensure the GPU stays fed.
  2. Shared Distribution: We use NFS systems like NetApp or Dell PowerScale for model versioning across the cluster. This layer is for distribution, not per-request reads.
  3. Object Storage: S3-compatible registries serve as the source of truth for model versioning and rollbacks.
  4. State Stores: These functionally replace traditional shared systems. For low-latency retrieval, we enable the GPU to bypass the CPU when accessing data from stores such as Weka or Milvus.
  5. Distributed State Stores: In-memory Databases like REDIS can be used to maintain State Stores so that if a GPU/Server/location drops, the user can be picked up with full context.

8. Determining the edge

"The Edge" refers to everything from a local sensor to a regional data room. Regional network latency is typically 30ms, while human response time is closer to 200ms. For most LLM applications, network latency is negligible compared to the model's processing time. Data Center Inference is less expensive and more flexible than edge-based inference. We recommend edge deployment in these specific scenarios:

  • Sub-millisecond latency requirements for factory floor controllers or real-time robotics.
  • When only inference results need to be sent upstream.
  • Data volume that exceeds available bandwidth, such as high-resolution satellite or medical imagery.
  • Situations requiring standalone operation during network dropouts to prevent total system failure.

9. Data center and uptime strategy

Training and inference have different data center priorities. Training can take place at Tier 1 facilities, with a focus on CAPEX amortization. If a training run goes down, we lose time but not a real-time customer interaction. Inference requires Tier 3/4 facilities with a very high uptime. Because inference value is realized in real-time user interactions, downtime is a direct service failure. While a training run is sensitive to a single GPU failure, inference architectures are more resilient. A single GPU failure in an inference POD affects only a small subset of requests or a single server.

10. Summary

Reliable and cost-effective inference requires selecting appropriate GPUs and an architecture tuned for Inference. To obtain the lowest cost per token, WWT can guide you through selecting models, optimizing their inference, and designing the data center solution. We stand ready to guide and assist you in achieving your goal.