Blog • February 18, 2026 • 7 minute read

Building for the Result: A Guide to Inference Architecture - Part 1

This document provides a comprehensive guide to designing efficient AI inference architectures, focusing on optimizing hardware and system design for real-time model deployment rather than training. * Inference focus over training: Inference drives AI product value and requires less inter-connectivity than training, allowing cost-efficient architectures by removing unnecessary training overhead. * Guidance for cost-effective inference: Selecting suitable GPUs, optimizing models, and designing tailored data center solutions are key to achieving low cost per token and reliable inference performance.

In this blog

1. Introduction to the inference pivot

The AI industry is moving beyond model training into the transactional reality of inference-driven value. Training consumes months of GPU time on complex architectures to build a parameter weight file. While training architectures like SuperPODs are designed for massive high-speed interconnectivity to handle synchronized operations across all GPUs, inference does not require that level of interconnectivity. East-West fabrics are not usually required for inference tasks. We can deliver equivalent compute power at a lower cost by building specifically for the nature of inference.

2. Sizing the model and the memory

Key factors in sizing for inference include:

Responsiveness (Time To First Token): The speed at which the first token is generated.
Concurrency: The number of simultaneous requests the system can handle.
Prompt Size: The maximum prompt length supported.
Throughput: The number of tokens generated per second.

A general rule for LLMs is that the GPU memory in gigabytes should match the number of billion model parameters, plus a 20% overhead for caching and other activities. For example, a Llama 3.1 model with 70 billion parameters at FP8 precision requires at least 84GB of memory, which is less than an H200's 141GB. To maximize throughput, the model may be distributed across all 8 H200s in a DGX.

Most models operate at FP8 precision but can also use FP4. FP4 requires half the memory of FP8, FP16 requires double, and FP32 requires quadruple. The precision level required depends on the data being inferred.

Memory needs increase with caches and optimizations. Latest guidelines suggests a 100% overhead.

3. The mechanics of parallelism

When a model exceeds the High Bandwidth Memory (HBM) capacity of a single GPU, we use parallelism to distribute computations across multiple GPUs. We categorize these methods by the location of inter-GPU communication.

Tensor Parallelism: This involves splitting individual layers within a single node. It relies on high-bandwidth intra-node links, such as NVLink.
Pipeline Parallelism: This method assigns different model layers to different nodes. Layers are processed sequentially: one node computes its layer and passes the result to the next.
2D Parallelism: This is a combination of previous methods. Tensor Parallelism is used for intra-node scaling within a single box using NVLink, while Pipeline Parallelism handles inter-node scaling across multiple boxes.
Data Parallelism: This maximizes HBM utilization by handling multiple requests in a single pass. We balance batch size with the degree of parallelism to ensure the KV cache utilizes the available memory effectively.

4. GPU selection and precision

CPUs still have a strategic place in the stack for linear statistical methods and sentiment analysis. However, generative AI is driven by GPUs.

The GH200 is a unique choice because it provides a 624 GB joint memory pool. By combining GPU memory with 480 GB of LPDDR5X, the GH200 can infer large models.

Note the difference in capabilities by GPU at the required precision of your task, and the differences in memory, both at a single GPU and in the 8-way DGX.

Data Center GPU Performance (TFLOPS)

		H100	H200	B 200	B300
HBM 8-way		80GB 640GB	144GB 1,152GB	192GB 1,536GB	288GB 2,304GB
FP64	Scientific Computing	540	540	296	10
TF32	High Precision Models -Molecular Simulation -Financial Modeling	8,000	8,000	540	296
FP16	Learning and Inference	16,000	16,000	36,500	36,000
FP8	Mixed Learning/Inference	32,000	32,000	72,000	72,000
FP4	Low precision Inference			144,000	144,000

Here are examples of models that depend on subtle statistical differences and need higher precision..

Accumulated minor numerical errors over long attention chains can destabilize training, demanding higher precision.
Fine-tuning with reinforcement learning (e.g., RLHF) can be numerically unstable and benefit from high precision for stable policy updates.
Combining image, speech, and text inputs (e.g., autonomous driving) may need more stable gradient flows across modalities.

Image and video

The required image and video processing level and type may drive a different GPU selection. The RTX PRO 6000 and L40S include RT (Ray Tracing) cores and engines for encoding and decoding, as well as input ports. Most VISION & VIDEO models do not use RT Cores except as part of a pipeline.

HBM

Memory is critical to both model size and performance, as GPUs are generally memory-constrained. For example, the H100 has the same FP16 FLOPS as the H200, but the number of Tokens generated is over 70% more. The GH200 is unique in providing a single memory pool combining CPU and GPU memory. This makes it an excellent choice for inference on larger models.

5. Reference architecture: Stripping the east-west fabric

The standard Inference POD architecture is built for efficiency rather than the massive synchronization required by training. We eliminate the expensive East-West fabric because inference is typically self-contained within a single GPU/server. We focus instead on North-South networking for model delivery and user requests.

For text-based tasks, 25 GbE is sufficient. For image-heavy or future-proofed deployments, 100 GbE is the standard. We can use Direct Attach Copper (DAC) within the rack as it is the most cost-effective, most reliable, and lowest-latency interconnect available.

6. Reference architecture: PCIe or NVLink

The server type also needs to be evaluated. If the model's memory requirements fit on a single GPU, a PCIe server model may be used. A model can infer across multiple GPUs when NVLink is used, as in the 8-way DGX H200 or a 2-way HGX H200 with NVL2.

7. Storage for random IOPS

Inference storage requirements are the opposite of training requirements. While training requires sequential throughput for large checkpoints, inference demands high random IOPS and low latency to serve concurrent requests. The storage stack has four specific layers.

Local Filesystem: Local NVMe is used to pin model weights. Models are pulled from the registry and prewarmed to ensure the GPU stays fed.
Shared Distribution: We use NFS systems like NetApp or Dell PowerScale for model versioning across the cluster. This layer is for distribution, not per-request reads.
Object Storage: S3-compatible registries serve as the source of truth for model versioning and rollbacks.
State Stores: These functionally replace traditional shared systems. For low-latency retrieval, we enable the GPU to bypass the CPU when accessing data from stores such as Weka or Milvus.
Distributed State Stores: In-memory Databases like REDIS can be used to maintain State Stores so that if a GPU/Server/location drops, the user can be picked up with full context.

8. Determining the edge

"The Edge" refers to everything from a local sensor to a regional data room. Regional network latency is typically 30ms, while human response time is closer to 200ms. For most LLM applications, network latency is negligible compared to the model's processing time. Data Center Inference is less expensive and more flexible than edge-based inference. We recommend edge deployment in these specific scenarios:

Sub-millisecond latency requirements for factory floor controllers or real-time robotics.
When only inference results need to be sent upstream.
Data volume that exceeds available bandwidth, such as high-resolution satellite or medical imagery.
Situations requiring standalone operation during network dropouts to prevent total system failure.

9. Data center and uptime strategy

Training and inference have different data center priorities. Training can take place at Tier 1 facilities, with a focus on CAPEX amortization. If a training run goes down, we lose time but not a real-time customer interaction. Inference requires Tier 3/4 facilities with a very high uptime. Because inference value is realized in real-time user interactions, downtime is a direct service failure. While a training run is sensitive to a single GPU failure, inference architectures are more resilient. A single GPU failure in an inference POD affects only a small subset of requests or a single server.

10. Summary

Reliable and cost-effective inference requires selecting appropriate GPUs and an architecture tuned for Inference. To obtain the lowest cost per token, WWT can guide you through selecting models, optimizing their inference, and designing the data center solution. We stand ready to guide and assist you in achieving your goal.