How to Choose the Right LLM

Details

Goals & objectives

Hardware & software

Solution overview

This hands-on lab teaches enterprise engineers how to systematically evaluate, benchmark, and select optimal large language models for distinct steps inside an agentic control plane.

A common architectural mistake when building AI solutions is deploying a single, monolithic, highly expensive flagship model to handle every computational task. In production, autonomous agents are composed of many small, distinct, and highly specialized LLM call primitives. A single user request can trigger anywhere from 5 to 50 sequential LLM calls across the system's control loop. Designing a rigid, single-model architecture leaves immense compute efficiency, latency, and system reliability on the table.

Working in JupyterLab on a pre-configured Ubuntu jumpbox, students execute canonical agent use cases against models spanning four capability tiers. They measure physical performance (latencies, token consumption, tail variance) and behavioral compliance (structured-output reliability, tool-use fluency, and grounded faithfulness). By analyzing where smaller, right-sized models outperform expensive flagship tiers, students construct a data-driven model selection scorecard and design a multi-tier cascade deployment plan.

Lab diagram

Goals and objectives

Phase 1: Mapping & Right-Sizing the Anatomy of an Agent

Instead of treating the 12 primitives as a flat checklist, this phase focuses on isolating individual LLM calls within the Observe → Plan → Decide → Act control sequence to find the lowest-cost model that can successfully execute the job.

Objective 1 (Deconstruct the Loop): Classify and map the 12 core LLM call primitives across their precise operational phases:

Observe: Classification, Extraction, Summarization, Grounded Q&A.

Plan: Routing, Planning, Reasoning over Observations.

Decide: Clarification, Validation/Self-Checking.

Act: Generation, Rewriting, Result Synthesis (with cross-cutting Tool Fluency).

Objective 2 (Right-Sizing Routine Nodes): Identify high-volume baseline paths such as routine classification, short-context extraction, and literal reformatting—where a 3B dense model matches flagship performance in sub-200ms time.

Phase 2: Stress-Testing & Isolating Behavioral Failures

This phase shifts from individual use cases to systemic telemetry. You will actively measure where architectural facts (size, reasoning engine) collide with actual behavior at the API boundary.

Objective 3 (Quantify Tail Latency & Caching): Analyze model latency profiles by evaluating p50 vs. p95 tail distributions. Distinguish between misleading, cache-assisted static benchmarks and volatile, cache-busting production workloads.

Objective 4 (Audit the Core Integration Failures): Run empirical trials to catch and categorize severe structural anomalies:

The Agent Tax: Measure the mathematical decay of system-wide reliability over multi-step sequential tasks when single-step success drops below 95%.

Channel Defiance: Identify models (like the always-on 120B MoE) that completely bypass native API tool_calls payloads to dump raw JSON into visible text body channels.

Structural Typing Drift: Trap occurrences where dense models stringify strict database parameters (e.g., "cpu_cores": "16" instead of 16).

Cognitive Compute Volatility (Payload Starvation): Track how unconstrained reasoning models burn through tight output envelopes (max_tokens) inside hidden thinking channels, causing empty or truncated responses.

Phase 3: Building a Defensive Architecture

Move from diagnostic observation to defensive software engineering, building a blueprint that protects runtime SLAs and accuracy.

Objective 5 (Build Interceptive Circuit Breakers): Defeat "The Agent Tax" by implementing programmatic validation gates. When syntax, schema, or contextual rules are violated, intercept the failure and trigger an instant, feedback-rich self-healing retry on a low-cost tier.

Objective 6 (Formulate the Production Cascade & Routing Policy): Design an evidence-based routing schema. Secure low-latency happy paths on dense 3B/7B models, leverage platform prompt caching where predictable, and reserve expensive reasoning effort tiers exclusively for complex causal planning or manual-intervention synthesis.

Hardware and software

User Workspace: A dedicated Ubuntu Jumpbox featuring an isolated Anaconda/Python environment, JupyterLab, and native JSON parser utilities.

Connectivity: Secure, pre-authenticated API connection mapping directly to AIPG's LLMaaS gateway.

Model Provisioning: Models are hosted on high-density GPU clusters (such as Blackwell or Hopper architectures). They are exposed via OpenAI-compatible endpoints with native server-side schema forcing (response_format={"type": "json_object"}) enabled.