Solution overview

This hands-on lab teaches enterprise engineers how to systematically evaluate, benchmark, and select optimal large language models for distinct steps inside an agentic control plane.

A common architectural mistake when building AI solutions is deploying a single, monolithic, highly expensive flagship model to handle every computational task. In production, autonomous agents are composed of many small, distinct, and highly specialized LLM call primitives. A single user request can trigger anywhere from 5 to 50 sequential LLM calls across the system's control loop. Designing a rigid, single-model architecture leaves immense compute efficiency, latency, and system reliability on the table.

Working in JupyterLab on a pre-configured Ubuntu jumpbox, students execute canonical agent use cases against models spanning four capability tiers. They measure physical performance (latencies, token consumption, tail variance) and behavioral compliance (structured-output reliability, tool-use fluency, and grounded faithfulness). By analyzing where smaller, right-sized models outperform expensive flagship tiers, students construct a data-driven model selection scorecard and design a multi-tier cascade deployment plan.

Lab diagram

Loading