Blog • February 26, 2026 • 5 minute read

Storage for Inference (Part 2 of Inference Architecture)

Inference succeeds or fails on tail latency. You can have plenty of GPU power and still miss service-level objectives when storage spikes, replicas scale at the same time or shared systems get noisy. This article lays out the storage behavior observed in real inference services, a reference storage stack that maintains stable P95 and P99 latencies under bursty traffic, and recommendations when pushing inference to the edge.

In this blog

Summary

Inference succeeds or fails on tail latency. You can have plenty of GPU and still miss service-level objectives when storage spikes, replicas scale simultaneously, or shared systems get noisy. This article lays out the storage behavior observed in real inference services, a reference storage stack that maintains stable P95 and P99 latencies under bursty traffic, and recommendations when pushing inference to the edge.

What "well-working inference" requires

Inference storage is about predictability, not peak bandwidth. The hot path needs low tail latency, low jitter, and high availability, because user-facing inference rarely fails loudly. It fails by getting a little slower, then slower again, until your P99 looks like a different product than your P50.

The I/O patterns are straightforward, and that is why the failures are so frustrating. You will see model reads when weights or compiled engines are loaded during cold starts and scaling events. You will also see cache-heavy behavior, where the same artifacts and indexes are read repeatedly. Operational writes, such as logs and telemetry, are necessary but should be treated as background work. Finally, inference has state, session, and key-value stores, conversation state, agentic state, context, and feature retrieval, and those reads and writes can become the hidden tax on every request if you place them poorly.

If you want a quick "what do we measure" view, focus on tail latency and consistency. P95 and P99 latency, jitter, and availability tell you whether storage is helping you or quietly setting traps. When inference breaks, the root cause is usually one of three patterns: storage latency spikes, simultaneous replica loading, or noisy-neighbor effects on shared storage.

Storage elements and where they belong

A practical storage design for inference uses multiple layers because a single storage system rarely meets all needs without trade-offs. The guiding idea is to keep the request path local and stable, and to keep shared systems off the hot path unless you have a hard requirement.

For the hot path, local filesystems on local NVMe are the default for a reason. ext4 or XFS on local NVMe is a common baseline. The goal is to keep pinned model weights and compiled engines close to the GPUs, so cold starts and scale-outs do not stall on a remote dependency. If you have hot indexes, keep them local as well.

Shared filesystems are still useful, but they should usually sit off the request path. They are a good fit for logs, traces, and artifact distribution, especially for staging. NFSv4 appliances are often used for distribution workflows. CephFS can support shared POSIX access where you genuinely need it, but serving performance generally improves when you add local caching and treat the shared layer as a source of truth rather than a per-request dependency. A common pattern is ZFS locally with replication, paired with a shared system for distribution and coordination.

State stores deserve their own line item because they behave differently from artifact distribution. Distributed cache and key-value systems, such as Redis, are often used for sessions and hot features. Vector database indexes for retrieval-augmented generation (RAG) often benefit from memory- or NVMe-optimized layouts, which is another reason local NVMe ends up doing much of the heavy lifting.

Here is a reference inference storage stack that matches those realities:

Key-value cache layer for session state and hot features.
Local NVMe for pinned weights and compiled engines, plus hot indexes.
Object store for the model registry, versions, rollback, and audit.
Asynchronous pipeline for logs and telemetry, never block serving because of this.

This is not about adding layers for style. It is about isolating failure domains. When a shared system is slow or busy, you want the request path to keep running on local assets and cached state, not queue behind someone else's batch job.

Storage for inference at the edge

Edge inference is mostly a latency and resiliency decision, not a storage novelty. If your application cannot tolerate a wide-area network round-trip, or if you need service continuity when connectivity is intermittent, placing inference closer to the user can be the right call. Storage becomes more constrained at the edge, so the layering matters even more.

At the edge, local NVMe still anchors the hot path. You want weights or compiled engines pinned locally, because pulling large artifacts across constrained links during a cold start is a slow-motion outage. Session state and hot features should stay in a local cache or key-value store where possible, with careful decisions about what must be globally consistent versus what can be eventually consistent.

The shared layers typically shift "up" to a central site. Model registry and version history fit well in an object store in the core or cloud, with a controlled distribution process to edge nodes. Logs and telemetry should flow asynchronously back to centralized systems, because blocking on observability writes is an easy way to turn a network hiccup into user-visible latency.

A useful mental check is to ask which events you need to survive gracefully: a burst of traffic, a rolling deployment, or a temporary loss of upstream connectivity. The storage placement that supports those events is usually the placement that keeps your inference service boring in production, which is the highest compliment you can pay an architecture.

Author: Borys Harmaty, WWT employee. #WWTLife. Views are my own and not official WWT communication.