Right-sizing AI Infrastructure in the Token Economy

Earlier this year at NVIDIA GTC 2025, Jensen Huang highlighted the immense infrastructure challenges faced across industries as they embrace and scale AI. These challenges go beyond training large AI models, driven by the demands of inference at scale and agentic AI.

In this blog, I want to outline this challenge and discuss how businesses can redefine their infrastructure strategy to capture this value and prepare for what comes next.

What exactly is inference, and why is it now becoming more resource-intensive?

AI models operate in two distinct phases: training and inference. Training involves learning from large datasets. It is an offline, resource-intensive process. Inference, by contrast, involves using the trained model in real-world applications to generate predictions or answers in real time.

A simple analogy is preparing for an exam. Training is the period where you study and build your knowledge. Inference is the moment you sit the exam and apply that knowledge to answer questions. Inference is less resource-intensive, though it has stricter latency requirements.

Recent advances in large language models have introduced a new level of capability: reasoning. Rather than jumping straight to an answer, reasoning models can go through intermediate steps, such as breaking down problems, evaluating alternatives and checking for consistency. This is enabled by prompting strategies like Chain of Thought, which guide the model to reason step by step.

Reasoning requires a significant increase in tokens generated during inference, as the model produces more verbose and structured outputs. This not only impacts latency and compute costs during inference, but it also has implications for how training is conducted.

Towards the token economy

Jensen used the term token as the fundamental unit of AI, a shift that reflects both the redefinition of traditional computing and the ongoing transition to accelerated computing. While different AI domains may use terminologies other than token, the core idea is that AI systems do not process text, image or DNA sequence directly. These inputs need to be encoded – i.e., translated – in a way the model can understand.

Computing faces a substantial shift in role, moving from retrieving files to generating tokens. AI factories now generate tokens that are the basis for music, words, videos and chemicals. The demand for tokens is at an all-time high, with agentic AI and reasoning models generating many more.

Since there is a direct relationship between the tokens generated and the revenue that AI providers can make, we can also approximate the revenue per second, based on a provider's ability to generate tokens. Of course, the value of a token is not purely measured by volume. The quality of the output each token generates, in both relevance and usefulness, directly impacts its economic and strategic value.

We're entering an economy that is fundamentally driven by the generation and consumption of AI tokens. To succeed in it, organizations must progress along their AI adoption journey, selecting and rightsizing infrastructure that can support these demanding new workloads.

High inference demand is a sign of success, but it comes at a price

The performance of AI models typically improves as you scale up model parameters and training data volume. These two combine to drive up the demand for compute power, unless you trade off speed by extending training time. Better performance means better results, be it more accurate responses, smarter reasoning or more creative outputs.

Every time performance is pushed further, bigger models are required – often taking multiples more compute. GPT-4, as an example, has around 5.8 times as many parameters as GPT-3, requiring a similar proportion more compute.

AI companies focused on model training must continue pushing the boundaries of compute performance to maintain a competitive time to market. Organizations relying solely on pre-trained models cannot afford to be complacent.

Two key factors are driving continued pressure on AI infrastructure, even at inference runtime:

Reasoning-capable models tend to generate far more tokens per query, due to longer prompts, intermediate steps and chain-of-thought reasoning. This can increase token throughput by an order of magnitude or more compared to earlier models.

AI-enabled applications put pressure on infrastructure. Higher user engagement and expectations mean inference workloads scale aggressively — sometimes even more rapidly than training demands.

Building the right-sized infrastructure to navigate the token economy

One of the major announcements of the event was NVIDIA Dynamo, a new open-source software platform designed to manage and orchestrate agents running on an AI factory. This is similar to how virtualization vendors orchestrate Enterprise applications in traditional data centers. It accelerates and scales AI reasoning models within AI factories by handling the complexity of AI inference workloads. This will be the operating system of an AI factory, helping to industrialize and scale complex AI tasks.

NVIDIA Dynamo doubles the inference performance when running Llama models on NVIDIA Hopper™-based systems*. For NVIDIA -GB200 NVL72 systems, Dynamo boosts the number of tokens generated by over 30 times per GPU over Hopper, when running the DeepSeek-R1 model**.

AI infrastructure demands evolve very quickly, and what worked yesterday may not work today. From the diversity of model architectures to emerging inference techniques and rapid advancements in software, variables are in constant flux.

That's why right-sizing infrastructure is a strategic priority, not just a technical one. When infrastructure is aligned to the real-world requirements of AI workloads, organizations can unlock efficiency, reduce operational costs and accelerate time to value.

At WWT, we help customers to right-size their AI infrastructure by tailoring solutions that match specific application needs. This involves selecting the right processors, accelerators and storage for particular AI models, depending on whether they prioritize latency, throughput, reasoning complexity or any combination.

With the AI Proving Ground, we offer a controlled environment where organizations can test and validate different AI solutions and configurations before they make full-scale investments. This ensures that businesses can buy what they do need, not what they may need. It also means AI systems are optimized for real business outcomes.

In the token economy, staying up-to-date and running the most efficient models and infrastructure is critical. With regular validation and tuning, it's possible to improve operational efficiency and protect long-term AI investments.

Physical AI will transform industries worth $50 trillion

The impact of AI, especially agentic AI and reasoning models, will be unprecedented. As Jensen stated, a powerful example of what's ahead is the development of physical AI in robotics, which could redefine industrial productivity in the years to come.

But while models and applications grab the spotlight, infrastructure is the critical foundation that enables them to scale and succeed. Evolving infrastructure is just as essential as innovative models. As a business, you need to be ready for where AI is going in the near future.

To quote Heraclitus, "The only constant in life is change". In the dynamic and fast-moving field of AI, nothing could be more true.

What will Jensen announce at GTC Europe in Paris? If you're attending, visit the WWT stand and discuss with our team.

Our experts are always on hand to talk about how you can set yourself up for AI success. Get in touch to start the conversation

* https://www.theregister.com/2025/03/23/nvidia_dynamo/?utm_source=chatgpt.com

** https://nvidianews.nvidia.com/news/nvidia-dynamo-open-source-library-accelerates-and-scales-ai-reasoning-models?utm_source=chatgpt.com