Before we get into the "why Apstra" question, let's start where all network engineers want to begin: the high-level design. Below is the logical design of our network. We have switches that provide speeds ranging from 10 G to 200 GB, with plans to expand to 400 GB over the next few months. 

HLD Apstra AIPG

As you can see, this network is the backbone of several core services that we offer from the AI Proving Ground, including LLMaaS, GPUaaS, VMware Management Cluster, Kubernetes Management Cluster, Storage Arrays and additional one-off servers. This allows us the flexibility to manage all these services under a single, unified network fabric.

One thing that is not shown on this HLD, but is the true heart of Apstra and where all your configuration is done, is the Apstra VM. This VM typically sits outside the network that you are deploying. This appliance is where all the magic of Apstra really happens. As it is where you would define your Blueprints, Racks, Networks, etc, to allow it to push the configuration to your switches.

Hardware decisions and compatibility

Our current architecture for this network is MLAG. As you can see from the diagram above, all switches are paired, and what the diagram doesn't show is that each switch is part of a VPC/MLAG pair. So why MLAG/VPC?  

The biggest reason for this is the capability. Since most of the switches we wanted to use were split across Cisco, Dell and Arista, Juniper recommended using MLAG, as it was a standard feature across all switch models. 

We had originally started our journey with an ESI model, but encountered a few issues, which prompted us to revert to the MLAG model. The first issue we encountered was that none of the Dell 64-port switches could be used at the leaf layer due to incompatibility with the ASICs, regardless of whether it was MLAG or ESI. However, the primary reason we had to transition from ESI to MLAG was an issue we encountered with free-range routing in the Dell-branded SONIC OS. This issue caused intermittent ping drops, mostly in our VMware ESXi cluster. As a result, we moved everything to MLAG using the same Dell-branded SONIC OS, and we haven't had an issue since. 

So why did we end up going with Dell? There were several reasons, but mainly, the switches were readily available and supported everything we needed for a stable backbone. We are using the 5248s for 10/25GBe and 5232s for our 100GBe connections. Dell's SONIC OS distribution page was also really easy to navigate and find the proper bits and bytes for the switch OS installs. For our spines, we ultimately used a 64-port switch, the Z9664F-ON, as it was approved for spine connections and allowed us to future-proof against any new switches that would require speeds of up to 400GBe. 

Why Apstra fits the AI Proving Ground's needs

The main reasons we chose Apstra

  • Vendor-agnostic intent-based orchestration is entirely abstract from device OS or vendor.
  • Scalable across small to large deployments, with support for various fabric designs and topology types.
  • Continuous validation, analytics, and unified visibility, simplifying day-to-day operations and troubleshooting.
  • Fabric-wide snapshot and rollback (Intent Time Voyager) is a robust safety net against misconfiguration or faulty changes.
  • Strong backing, ecosystem support, and multivendor hardware compatibility reduce vendor lock-in and simplify support.
  • Given AIPG's front-end requirements from early POCs to potentially large-scale deployments, Apstra delivers a robust, flexible, and future-proof foundation.

One of the significant strengths of Apstra for our AIPG is its truly vendor-agnostic, intent-based approach. Unlike many other orchestrators that rely on pushing static "configlets" snippets of configuration, Apstra doesn't treat the network as a collection of devices to be individually configured. Instead, you declare intent: how the network should behave at a high level, and Apstra calculates and applies the correct, vendor-appropriate configuration for each device. That abstraction works across vendors and operating systems, so you're not tied to a specific OEM. 

Because Apstra has a deep understanding of the underlying device OS and native configuration model, it can produce consistent, correct configurations without relying on fragile post-facto evaluations of user-supplied configuration snippets. This gives true abstraction regardless of switch make or model, simplifying management and reducing the risk of misconfiguration.

Scalability & flexibility

Another compelling reason we chose Apstra is its scalability. Since it's not locked to a single vendor's hardware, you can treat network switches as fungible, choosing devices based on cost, performance, availability, and not vendor allegiance. Apstra's architecture supports both small and very large data-center fabrics, including complex 3-stage or 5-stage Clos designs with EVPN-VXLAN overlays. It can scale to support hundreds of thousands of connected servers. 

That flexibility is particularly valuable for AIPG's front-end environment: some of our proofs of concept (POCs) may require OEM full-stack solutions, while others may emphasize cost, performance, or rapid deployment. Apstra lets us adapt accordingly without rethinking the orchestration layer.

Robust management, monitoring and troubleshooting

Apstra isn't limited to initial configuration. It offers continuous validation and "closed-loop" assurance: it constantly checks that what is actually running matches the intent you defined, raising alerts if there's any drift, whether caused by manual changes, device failures or operational mistakes. 

When combined with integrated telemetry, flow-data analysis, and root-cause identification, this significantly reduces the mean time to resolution (MTTR) for incidents. Apstra becomes a single pane of glass for the entire data center network, regardless of the vendor mix. 

Intent "Time Machine" configuration roll-back with snapshots

One feature that distinguishes Apstra from many alternatives is Intent Time Voyager, which enables you to snapshot the entire network state, including configuration, intent definitions, and validation/telemetry state, and roll back or forward across snapshots with just a few clicks. This isn't just per-device, but fabric-wide, encompassing all vendor types in use. That ability means if you deploy a complex change (e.g., a new EVPN/VXLAN overlay, routing design, connectivity modifications), and something goes wrong, you can revert the whole fabric to a known-good state almost instantly. It's like having a "time machine" for your network, enormously valuable for risk-prone changes or multi-vendor environments.

Production-grade support

Lastly, and not insignificant, is the fantastic support from Juniper on this product. They are familiar with various OEM switch vendors and will work with you to ensure your fabric runs as intended.

Automation of the network add in Apstra

Managing the full lifecycle of a heterogeneous data center fabric can be complex, but Apstra makes it remarkably simple. In our AI Proving Ground, adding and removing networks is a daily task. To avoid bottlenecks and free up our network team for higher-value work, we built a self-service automation solution that empowers peers outside the network team to provision networks without direct access to Apstra.

Self-service means agility. ATC consumers can autonomously add or remove networks, while network engineers stay focused on strategic initiatives: no tickets, no waiting, just streamlined operations.

 The interface is intentionally simple: 

  • Choose Add or Remove virtual networks.
  • Provide a list of VLAN IDs.
  • Click Submit.

Behind the scenes, Ansible handles the heavy lifting, and the portal provides real-time job feedback. Users never touch Apstra or Ansible directly; it's just a clean, intuitive workflow.

Ansible and Ansible Automation Platform are already core to many ATC workflows, so our automation leverages those resources. While Juniper provides the `juniper.apstra` Ansible collection, its capabilities were limited for our needs. Instead, we utilized Apstra's RESTful API using the `ansible.builtin.uri` module for fine-grained control.

After the user initiates actions through the self-service portal, the backend automation begins by authenticating with Apstra and verifying that there are no uncommitted changes. If there are, the job stops immediately to avoid conflicts. Once the blueprint is locked, we pull the connectivity templates to know where the networks will live in the fabric.

From there, the workflow branches:

  • Add: Create the virtual networks and update the templates.
  • Remove: Remove the networks from templates and then delete them.

 If anything goes wrong along the way, the automation rolls back changes and unlocks the blueprint to maintain a clean and consistent state.

What's next for this environment?

So what's next for this network? We plan to expand this to include additional Cisco and Arista leaves, enabling server connectivity for environments that require a full vendor stack, growing our network even larger for more AIPG offerings.

For more AI Proving Ground content, please refer back to the AIPG Overview page on wwt.com. 

Technologies