A hands-on evaluation of Nebius GPUaaS

The shift happening right now

Not long ago, training or fine-tuning your own large language model required one of two things: a massive capital investment in on-premises GPU infrastructure, or a long-term commitment to a hyperscaler cloud contract. For most organizations, neither option was realistic on the timeline that actual business problems demand. The result was that serious, custom AI work — the kind where you own the model, the data and the outcomes — was effectively out of reach for a large portion of the market.

That is changing. GPU-as-a-Service (GPUaaS) platforms are making it possible for organizations to access world-class hardware for a defined period, work on a specific problem and walk away without a multi-year infrastructure commitment. The barrier to building and fine-tuning your own models is dropping fast, and for organizations paying attention, that changes the calculus entirely.

WWT recently partnered with Nebius to bring this kind of capability to our customers. I was given access to a Nebius GPU environment to evaluate it firsthand — with minimal guidance — and form an honest opinion about what it can do and what it actually feels like to use.

My perspective

My background is in data science, with a B.S. in Computer Science and an M.S. in Cybersecurity. Over the last five years, I've worked in data science roles, and I currently sit at the intersection of technical depth and customer-facing solutions work as a Pre-Sales Data Scientist and Technical Solutions Architect at WWT. I spend a lot of time translating between what customers need to accomplish and what the technology can actually deliver.

I approached this evaluation the way a capable practitioner on a customer's team would: with curiosity, a clear set of questions and no assumption that the platform would hold my hand.

Getting started: Easier than expected

The environment I was provisioned was a single-node machine running Ubuntu 24.04 with CUDA pre-installed and ready to use. The hardware was serious: 8× NVIDIA B200 GPUs with approximately 183 GB of memory per card, 1.7 TiB of system RAM, 160 vCPUs and a full NVLink mesh connecting every GPU pair — a high-bandwidth interconnect that becomes important when you start running workloads across multiple GPUs simultaneously.

Getting connected required nothing more than opening VS Code, pointing the Remote SSH extension at the machine's IP address, and providing my SSH private key. That was it. No proprietary client, no complex networking configuration, no lengthy onboarding process. For anyone who works in VS Code day to day, the experience is immediate and familiar. I was writing and executing code within minutes of first access.

The only practical planning consideration is that provisioning requires scheduling in advance. At this level of hardware, availability must be coordinated, and customers should factor it into their project timelines. This is not a limitation — it is simply the nature of reserving dedicated serious compute, but it is worth knowing upfront.

What I tested and how

With the environment live, I built a benchmark harness designed to answer practical questions about the hardware's capabilities. I trained decoder-only GPT-style language models — the same architectural family as many of today's leading models — across a range of sizes and GPU configurations, using synthetic data to focus on hardware performance rather than data-pipeline variability.

The benchmark covered three model sizes (approximately 350M, 1.3B and 3B parameters), GPU counts of 1, 2, 4 and 8, and used bf16 precision with PyTorch's Distributed Data Parallel framework. The key metrics I tracked were throughput in tokens per second, GPU memory consumption and scaling efficiency — a measure of how effectively adding more GPUs translates into proportional performance gains.

The results

The hardware performed exceptionally well across every configuration tested.

Model

GPUs

Throughput (tok/s)

Peak Mem / GPU

Scaling Efficiency

350M1161,73319.7 GB100% (baseline)
350M81,195,76219.7 GB92.4%
1.3B163,74747.2 GB100% (baseline)
1.3B8460,00447.2 GB90.2%
3B128,37970.6 GB100% (baseline)
3B8197,86870.6 GB87.1%

To understand why these numbers matter: when you scale from 1 GPU to 8, you never get a perfectly proportional gain in throughput. GPUs have to communicate with each other during training — synchronizing gradients after each pass through the data — and that communication takes time. In practice, scaling efficiency above 80% on 8 GPUs is considered strong. Achieving 92% on the 350M model and 87% on the 3B model is an excellent result, driven by the NVLink interconnect, which enables fast, low-latency communication between GPUs on the same node.

The memory numbers tell an equally important story. Even the largest model I tested consumed only about 39% of a single NVIDIA HGX B200's 183 GB capacity. There is significant headroom remaining — a 7B parameter model would still fit within a single card, which means this hardware is well-suited for workloads considerably larger than what I tested here.

Why this matters for our clients

In my pre-sales conversations, one of the most consistent themes I encounter is the gap between what an organization wants to explore with AI and the infrastructure reality they're working within. Two scenarios come up repeatedly:

Teams that want to build or fine-tune their own models but don't have the hardware. Whether the goal is training a domain-specific model from scratch or fine-tuning an existing foundation model on proprietary data, serious GPU compute has historically been a bottleneck. A data science team shouldn't have to wait for a capital procurement cycle or compete for shared cluster resources to do meaningful model work. GPUaaS removes that constraint entirely.

Regulated industries navigating compliance timelines. This is particularly relevant in financial services, where introducing new tools and frameworks into a production environment can take months or longer due to compliance and governance requirements. Organizations in this space often need to evaluate several candidate approaches — different model architectures, training libraries, fine-tuning strategies — before committing to the one that will go through the full compliance process. Doing that R&D in a provisioned GPUaaS environment means teams can move quickly, experiment broadly and arrive at a well-informed decision before any production commitment is made. The compute environment doesn't have to go through compliance — only the solution that comes out of it does.

This second use case, in particular, is where I see GPUaaS providing genuine strategic value, not just convenience.

Overall assessment

I went into this evaluation with no strong expectations and came away genuinely impressed — not just by the hardware, but by how little friction stood between me and actually using it. The B200 GPUs delivered excellent throughput, strong multi-GPU scaling and substantial memory headroom. The access experience was clean and immediately compatible with standard developer tooling. For someone coming in as a data scientist rather than a GPU infrastructure specialist, the platform felt approachable from the first minute.

The broader point is this: platforms like Nebius are part of a meaningful shift in who can do serious AI work. Fine-tuning your own model on your own data, benchmarking architectures before a major commitment, stress-testing infrastructure before a compliance review — these are no longer activities that require owning a data center. They require the right partner, the right platform and a clear problem to solve.

WWT is well-positioned to help customers evaluate whether GPUaaS fits their workload, timeline and goals — and this evaluation gave me a high degree of confidence in Nebius as the platform we bring to those conversations.

 

This evaluation was conducted on a single 8-GPU Nebius node. Multi-node cluster performance, long-run stability testing and larger model configurations represent natural next steps for a more exhaustive assessment.

 

Technologies