AI at Scale on Kubernetes: Why Platform Discipline Determines Accelerator Efficiency
AI efficiency on Kubernetes is no longer constrained by raw accelerator availability but by platform discipline, specifically how enterprises design scheduling, placement, observability, data locality and governance to maximize accelerator yield. Read on to learn more about why platform discipline determines accelerator efficiency.
Intended audience: Enterprise infrastructure, platform engineering, cloud and AI leaders responsible for scaling AI workloads on Kubernetes.
Executive summary
AI is changing the economics of Kubernetes. In the cloud-native era, platform teams could tolerate small inefficiencies in scheduling, observability or storage design because the cost of waste was measured in vCPUs and memory. In the AI era, the unit of waste is fundamentally different: a stranded accelerator, a delayed training run, an inference service missing latency targets, or an expensive cluster that appears busy without producing proportional business value.
Kubernetes is already the common operating layer for modern infrastructure. The vast majority of container users run it in production, and roughly two-thirds of organizations hosting generative AI models rely on it for some or all inference workloads. But adoption alone does not create an efficient AI platform.
The strategic issue is straightforward: Default Kubernetes behavior is necessary but insufficient for AI at scale. The platform can expose accelerators through device plugins, and newer capabilities such as dynamic resource allocation move it closer to claim-based, shareable, attribute-aware device management. Efficient AI operations, however, depend on higher-order disciplines: workload admission, topology-aware placement, accelerator sharing policy, end-to-end observability, locality-aware data architecture, declarative lifecycle management and deliberate multi-tenancy design.
The implication for enterprise leaders is that AI platform maturity is no longer measured by whether Kubernetes can run training or inference. It is measured by accelerator yield — how consistently the platform converts scarce, high-cost hardware into useful model work at the right quality, throughput, and latency.
Organizations that make that transition will treat Kubernetes less as a generic container orchestrator and more as an AI operating model, one that unifies queue-based scheduling, advanced batch orchestration, model-serving frameworks, accelerator lifecycle tooling and standardized telemetry into a single platform discipline.
For most enterprises, the first moves should be:
- Separate training and inference into distinct node pools, priorities and operating policies.
- Introduce admission control and queue-based scheduling for large AI jobs.
- Instrument accelerator, model and pipeline telemetry together rather than as separate silos.
- Standardize cluster, accelerator and security policy through declarative automation and version-controlled configuration.
AI is breaking traditional Kubernetes assumptions
Kubernetes was built to orchestrate containers. AI forces it to orchestrate scarcity. The distinction matters. AI workloads are not simply larger versions of stateless microservices; they combine expensive accelerators, bursty parallelism, model-specific performance characteristics, multi-stage data pipelines and, increasingly, heterogeneous hardware.
Kubernetes includes stable support for managing accelerators through vendor device plugins, and clusters with different accelerator types can be targeted using labels and selectors. But those primitives only expose hardware. They do not solve the economic problem of placing the right workload on the right accelerator with the right sharing, priority and data locality policy.
Training jobs want large, coordinated allocations and can tolerate queuing if it improves overall throughput. Inference services want predictable latency, isolation and rapid response to traffic changes. Batch preprocessing wants throughput and data proximity. Model evaluation and fine-tuning often need fractional or shared accelerator access rather than whole-device allocation. A platform that treats these as equivalent "pods with resource requests" will eventually optimize for none of them. In practice, the default Kubernetes scheduler lacks job-level admission control and gang scheduling, making it insufficient for complex AI workloads.
AI efficiency on Kubernetes is primarily a platform-discipline problem, not a procurement problem. More accelerators do not fix weak admission policies, poor placement, storage bottlenecks or shallow observability. They only make those flaws more expensive.
The problem: Inefficiency at scale
Accelerator underutilization is a platform symptom, not an edge case
Accelerator underutilization is not anecdotal. Multiple peer-reviewed studies report average utilization between roughly 25 and 50% in production ML clusters. Weng et al. explore this problem directly in "Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent" (USENIX ATC '23), showing why the gap persists: Accelerators are not just another bin-packing dimension; they are contiguous allocation units that become stranded when CPU, memory, topology or fragmentation constraints make the remaining capacity unusable for incoming work. In the paper's evaluation of a production-scale cluster with more than 6,200 accelerators, the cluster became effectively full despite 500 accelerators remaining unallocated. Fragmentation and resource mismatch had stranded usable capacity at scale.
That finding should reframe how platform leaders think about cluster utilization. Traditional infrastructure dashboards may show nodes as healthy and allocated. AI economics demand a different question: How much of the accelerator fleet is producing useful work for the workloads that matter most? In AI infrastructure, a partially used accelerator is not free headroom. It is compounding technical debt.
Operational blind spots make waste harder to correct
Most platform monitoring stacks were designed around CPU, memory, node health and network saturation. AI workloads demand a deeper layer of visibility. Vendor-provided accelerator exporters can surface device-level metrics through standard monitoring endpoints, and open telemetry frameworks offer vendor-neutral collection of traces, metrics and logs. Those are essential building blocks, but they are not sufficient on their own. AI platforms need to correlate accelerator telemetry with queue time, batch size behavior, model latency, throughput, pipeline stage duration, cache hit rates and downstream business SLOs.
The challenge becomes sharper once accelerator sharing enters the picture. When time-slicing is enabled through the Kubernetes device plugin, current accelerator metrics exporters cannot associate device metrics to individual containers. That means a team may know an accelerator is busy without knowing whether the right tenant or model is receiving the right share of performance. In practice, this is where AI observability breaks down: Infrastructure teams can see resource consumption, and ML teams can see model outcomes, but neither group can trace waste across the full execution path.
Workload placement has become a first-class efficiency problem
Workload placement is no longer a secondary optimization. It is a core determinant of job completion time, throughput and accelerator return on investment. Kubernetes includes a Topology Manager precisely because CPU, memory and device allocations are not independent for latency-sensitive or accelerator-heavy workloads. With pod scope and a single-NUMA-node policy, the platform can align an entire pod to a common NUMA boundary and reduce inter-NUMA overhead. Advanced batch schedulers extend this model further with gang scheduling, bin-packing policies, heterogeneous device scheduling, and network topology-aware placement.
Research reinforces why this matters. Rajasekaran et al. demonstrated in "CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters" (USENIX NSDI '24) that incorporating network-aware placement improved average and tail job completion time by up to 1.6x and 2.5x, respectively, compared to schedulers that treated accelerator count as the only scarce resource. The wrong interconnect, the wrong node pairing, or the wrong communication path can erase the value of otherwise adequate accelerator capacity. For distributed training, platform efficiency is as much about communication topology as raw accelerator count.
Workload orchestration and scheduling
The most important shift is from pod scheduling to workload admission. Queue-based admission controllers manage quotas and decide when a job should wait, when it should be admitted to start, and when it should be preempted. They support fair sharing and resource fungibility across heterogeneous environments. Advanced batch schedulers complement this model with gang scheduling, bin-packing, quota-driven queue management and topology-aware scheduling for high-performance workloads. When these admission and scheduling layers integrate with framework-specific operators for distributed computing, the result is a path toward consistent policy across batch processing, training and serving.
The practical goal is simple: Do not let partial allocations burn scarce accelerators while the rest of the workload waits. Training jobs should not launch on incomplete resource sets when the result is prolonged runtime and stranded capacity. Inference services should not compete blindly with training jobs for nodes designed around different latency and memory assumptions. Queue-based orchestration is not an administrative layer added on top of Kubernetes; it is the mechanism that turns accelerator scarcity into predictable policy.
Accelerator sharing policy is part of the same pillar. Hardware partitioning divides supported accelerators into isolated instances with dedicated compute and memory resources. Time-slicing oversubscribes accelerators and interleaves workloads but provides no memory or fault isolation. Multi-process execution enables cooperative concurrent access from multiple processes. These are not interchangeable tools. Hardware partitioning is appropriate when predictable isolation matters. Time-slicing fits bursty or lightweight shared usage where strict isolation is less important. Multi-process execution suits compatible cooperative workloads.
Mature platforms choose among them intentionally rather than exposing them as ad hoc options to application teams.
Longer term, Dynamic Resource Allocation is especially important because it reflects how AI teams actually reason about hardware. Rather than requesting blind device counts per container, it introduces device classes, resource claims, sharing semantics and fine-grained filtering on device attributes. That is a much closer fit for heterogeneous accelerator fleets where "one accelerator" may mean very different memory, topology or cost characteristics.
Observability for AI workloads
AI observability has to bridge infrastructure and model behavior. Accelerator-level telemetry is necessary for utilization, memory use, thermals and health. OpenTelemetry is necessary for traces, metrics and logs across application and platform boundaries. But the real maturity test is correlation. A disciplined AI platform should be able to answer questions such as: Which queue policy reduced wait time for a given training class? Which model version is driving accelerator saturation without improving throughput? Which pipeline stage is starving the accelerator? Which tenant is consuming shared accelerator time without meeting service targets?
This is where many organizations still operate with blind spots. They can see that a node is hot, or that inference latency is rising, but not whether the root cause is scheduler back off, cache misses, data loading stalls, over subscription behavior or a mismatch between the model and the accelerator profile. AI platforms need a joined data model for infrastructure telemetry and model telemetry. Without it, optimization becomes guesswork.
Data and storage architecture
An accelerator waiting on data is still a wasted accelerator. That is why data architecture is inseparable from accelerator efficiency. Kubernetes local volumes are aware of node constraints through persistent-volume node affinity, and Topology Manager can align compute and device placement for latency-sensitive or high-throughput applications. Those capabilities point to the broader design principle: Storage and placement should be built for locality, not treated as a neutral backplane.
For enterprise AI, the right pattern is typically hierarchical. Durable object storage acts as the system of record. High-throughput shared storage supports common datasets, checkpoints and artifact exchange. Node-local or rack-local caching shortens the hot path for frequent reads, embeddings, feature shards or preprocessed training data. The design question is not only where data is stored, but how quickly the platform can move the right data to the right accelerator without cross-cluster or cross-region drag.
Cluster architecture and resource design
AI clusters should be intentionally heterogeneous and policy-aware. Kubernetes explicitly recommends labels and selectors when clusters contain different accelerator types, and taints with tolerations work with node affinity to keep inappropriate workloads off inappropriate nodes. Those primitives are basic, but they remain foundational. A disciplined AI platform uses them to create distinct resource domains: training pools, inference pools, memory-rich pools, cost-optimized pools and specialized pools for particular accelerator or interconnect profiles.
This matters because different AI workloads want different hardware economics. Distributed training may justify premium nodes with strong interconnects and high memory bandwidth. Latency-sensitive inference may prefer smaller, predictable slices with isolation. Development and experimentation may benefit from shared or oversubscribed accelerators. Dynamic Resource Allocation strengthens this design further by allowing administrators to publish device classes tuned for different performance and cost profiles.
Platform automation and lifecycle management
At scale, manual accelerator operations are a source of drift, inconsistency and avoidable risk. Argo CD is a declarative GitOps continuous delivery tool for Kubernetes that continuously compares live state with desired state in Git. Flux provides a composable GitOps toolkit for building continuous delivery on Kubernetes. Node autoscaling adds or removes nodes based on pending work, and accelerator lifecycle operators automate the driver, container toolkit, device plugin, feature discovery and monitoring stack required to provision accelerator nodes consistently.
That combination is more strategic than it may first appear. AI platforms often fail through configuration entropy: One node pool drifts on driver version, another lacks proper labeling, a third exposes different sharing policy, and observability changes are applied inconsistently across environments. GitOps, immutable patterns and automated node lifecycle management are what convert a promising accelerator cluster into a repeatable enterprise platform.
Security and multi-tenancy
Not every AI platform should be a shared cluster, and not every namespace is a security boundary. Kubernetes namespaces isolate groups of resources within a cluster, but nodes, storage classes and persistent volumes remain cluster-scoped. Kubernetes documentation also distinguishes between soft and hard multi-tenancy, with hard multi-tenancy implying materially stronger isolation where tenants do not trust each other. Role-based access control remains the core mechanism for regulating access to cluster resources, while confidential-computing approaches such as Confidential Containers are emerging to protect sensitive cloud-native workloads with hardware-backed techniques.
For AI platforms, this has direct operational consequences. Sensitive training data, proprietary models and regulated inference services may require stronger separation than namespaces alone provide. Shared accelerator efficiency is valuable, but it should not override trust-boundary design. Platform discipline includes knowing when consolidation is appropriate and when isolation is the more efficient choice at the business level.
The cost of getting it wrong
The cost of poor platform discipline is measurable. Goodput, a metric that combines system throughput with statistical efficiency to measure useful training progress per unit of compute, is the basis for the scheduling work by Qiao et al. in "Pollux:Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning" (USENIX OSDI '21). By optimizing for Goodput rather than raw throughput, Pollux reduced average job completion time by 37 to 50% relative to state-of-the-art schedulers and revealed up to 25% lower training cost in cloud environments.
Network-aware placement, as shown in the CASSINIwork (USENIX NSDI '24), improved average and tail completion times by up to 1.6x and 2.5x, respectively. The fragmentation-aware scheduling research discussed earlier reduced unallocated accelerators by up to 49%, recovering 290 additional accelerators in a large production cluster.
These are not marginal gains. They illustrate a broader truth: Platform discipline is not operational hygiene. It is an efficiency multiplier. When enterprises neglect scheduling, topology, locality or observability, the resulting waste does not always show up as an outage. It shows up as longer queue times, slower experimentation, poorer accelerator ROI, over-provisioned clusters and growing skepticism about AI infrastructure spend.
An underutilized accelerator cluster is one of the most expensive forms of technical debt in modern infrastructure.
Why this matters to the business
When organizations operationalize Kubernetes for AI with discipline, four business outcomes follow:
First, efficiency improves: Accelerator fleets deliver more useful work per dollar because fewer devices are stranded and more workloads match the hardware they actually need.
Second, speed improves: Better admission policies, smarter placement and stronger observability shorten experiment cycles and reduce the delay between model idea and production evidence.
Third, scale improves: Multi-team AI programs become sustainable because the platform can govern priority, fairness, isolation and resource classes rather than relying on manual exceptions.
Fourth, governance improves: Cost, risk and service quality become observable and auditable rather than incidental. The success metric shifts from generic cluster utilization toward accelerator yield, queue wait time, job completion time, tail latency and cost-per-useful-output.
WWT point of view
The enterprise challenge is no longer proving that Kubernetes can host AI. The challenge is engineering a platform that makes AI efficient, governable and scalable under real workload pressure. That requires validated design across compute, storage, networking, orchestration, observability and security, not just one more accelerator node pool.
WWT helps clients design, build and deploy end-to-end AI/ML platforms on Kubernetes that treat the disciplines discussed in this paper as integrated system requirements rather than independent optimizations. From queue-based scheduling and topology-aware placement to accelerator lifecycle management and GitOps-driven consistency, the platform engineering challenge is inherently cross-domain.
WWT's Advanced Technology Center and AI Proving Ground provide a production-grade environment where clients can validate these designs using real workloads before committing to large-scale investment. But the deeper value is in the engineering itself: translating accelerator yield, observability, multi-tenancy and storage locality from architectural principles into repeatable, enterprise-grade platforms.
The transition from cluster management to AI platform engineering spans infrastructure, software, operations and business outcomes simultaneously. That intersection is where this work lives.
Conclusion and decision guidance for enterprise leaders
Kubernetes is rapidly becoming the control plane for AI, but Kubernetes itself is not the differentiator. The differentiator is platform discipline. Enterprises that succeed will be the ones that treat scheduling, accelerator sharing, topology, observability, data movement, automation and trust boundaries as first-class design concerns rather than secondary optimizations.
The winning operating model is not "a Kubernetes cluster with accelerators." It is a purpose-built AI platform that aligns workload intent, hardware profile, data path and business priority. In that model, the goal is no longer just to keep clusters running. It is to maximize accelerator yield and convert scarce infrastructure into sustained business output. Organizations that make that shift will not only lower cost; they will increase the pace, reliability and strategic value of AI across the enterprise.
Five key leadership decisions
For leaders responsible for infrastructure, platform engineering and AI enablement, the decisions that matter most fall into five areas:
1. Treat accelerator yield as a first‑class success metric
Leaders should move away from generic cluster utilization metrics and instead govern platforms around accelerator yield — how consistently scarce hardware is converted into useful training and inference work at the required quality, throughput and latency. This shift changes how success is measured and how tradeoffs are evaluated. Platforms that appear "busy" but strand accelerators, extend queue times or degrade model performance are failing economically, even if traditional dashboards look healthy.
Decision implication: Success metrics should prioritize queue wait time, job completion time, tail latency and cost per useful output, not aggregate node utilization.
2. Separate workload intent before optimizing infrastructure
Training, inference, preprocessing and experimentation have fundamentally different performance, latency and sharing requirements. Leaders should resist treating all AI workloads as interchangeable "pods with resource requests" and instead ensure that workload intent is explicit and enforced through admission control, queueing and placement policy.
Decision implication: Investments in scheduling, admission control and workload classification will often deliver higher returns than adding more accelerators to an undifferentiated cluster.
3. Invest in platform discipline before expanding hardware
The evidence in this paper shows that poor scheduling, weak placement policy, shallow observability and misaligned data locality amplify waste as clusters grow. Adding accelerators without addressing these disciplines increases cost faster than capability.
Decision implication: Hardware expansion decisions should be gated on demonstrated platform maturity, including workload admission, topology‑aware placement, accelerator sharing policy and end‑to‑end observability.
4. Make observability a governance tool, not a troubleshooting aid
AI platforms require correlated visibility across accelerators, schedulers, pipelines and models. Leaders should expect their platforms to answer not only what is busy, but why, for whom and to what business effect.
Decision implication: Observability investments should be evaluated on their ability to connect infrastructure behavior to model performance and business SLOs, not on raw metric volume.
5. Design multi‑tenancy and security intentionally
Shared accelerator efficiency is valuable, but it should not override trust boundaries, regulatory requirements or workload isolation needs. Leaders should decide explicitly when consolidation is appropriate and when isolation is the more efficient business choice.
Decision implication: Platform design should offer deliberate choices between shared and isolated execution models, rather than forcing all workloads into a single tenancy pattern.
Overall, enterprises that succeed with AI on Kubernetes will be those that treat the platform as an operating model, not a container runtime. The near‑term advantage comes not from owning more accelerators, but from governing how they are admitted, placed, shared, observed and aligned to business priority.
References
[1] Weng, Q., Yang, L., Yu, Y., Wang, W., Tang, X., Yang, G., and Zhang, L. "Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent." In Proceedings of the 2023 USENIX Annual Technical Conference (ATC '23). https://www.usenix.org/conference/atc23/presentation/weng
[2] Rajasekaran, S., Ghobadi, M., and Akella, A. "CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters." In Proceedings of the 21st USENIX Symposium on Networked SystemsDesign and Implementation (NSDI '24). https://www.usenix.org/conference/nsdi24/presentation/rajasekaran
[3] Qiao, A., Choe, S.K., Subramanya, S.J., Neiswanger, W., Ho, Q., Zhang, H., Ganger, G.R., and Xing, E.P. "Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning." In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI '21). https://www.usenix.org/conference/osdi21/presentation/qiao
[4] Kubernetes Project."Control Topology Management Policies on a Node." Kubernetes Documentation. https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/
This report may not be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means, including, but not limited to, electronic, mechanical, photocopying, recording, or otherwise, without the prior express written permission of WWT Research.
This report is compiled from surveys WWT Research conducts with clients and internal experts; conversations and engagements with current and prospective clients, partners and original equipment manufacturers (OEMs); and knowledge acquired through lab work in the Advanced Technology Center and real-world client project experience. WWT provides this report "AS-IS" and disclaims all warranties as to the accuracy, completeness or adequacy of the information.