LLMs are at the core of agentic systems

Look at almost any diagram depicting the architecture of an agentic AI system, and you will typically find at least one, but frequently multiple, LLMs.

Example: the architecture of an agentic AI system

Why is that?

To answer that question, let's observe a simple model for how human intent is translated into action.

A picture is worth a thousand words:

 

Are LLMs a new class of processor?

Traditionally, human ideation and intent are encoded into a domain-specific language (DSL) or programming language. The resulting "program" is compiled and/or otherwise packaged into or executed within an application. That application, in turn, runs within the context of some environment (an operating system and runtime), finally executing on top of a (possibly virtual) CPU: human intent arrives at some conclusion.

With the introduction of LLMs, we have a new class of processor: one that is capable of understanding natural language and therefore able to directly drive human intent to achieve an end task (i.e., to arrive at a conclusion). Where it was previously necessary to manually convert human intent into computer code, LLMs now allow humans to express that intent directly in natural language. LLMs then interpret that expression to achieve the goal. In this new paradigm, the underlying "tech stack" supporting the compute differs. Where the stack on the left has compilers, OSs and traditional CPUs, the stack on the right has encoders, models and model runners, and GPUs.

Agentic system components

Modern agentic AI systems are usually built from cooperating building blocks. The red box depicted in the above diagram describes the internals of the model-runner layer at a high level—GPUs or specialized accelerators plus inference runtimes such as Triton or vLLM that execute inference based on a specific model ("model weights" in the diagram). As we saw in the opening diagram of this article, LLMs are often instantiated multiple times. Each instantiation may have different weights (i.e., be a different model) or it may be the same model running under conditions that optimize its behavior toward specific goals.

For example, by giving each instantiation different system prompts, preloaded contexts, access to RAG or protocols like MCP & A2A, or by varying temperature settings, a general-purpose LLM can be specialized for planning, code generation or tool use. Frequently, we find subsidiary LLMs used to critique or score the results of other steps in the agentic pipeline, reducing the cognitive burden on the upstream planning LLM. This type of delegation can be very helpful when dealing with limited context window sizes and helps keep upstream LLMs focused on the core task.

Because each sub-task (planning, reflection, acting) benefits from different priors and decoding regimes, breaking the agent into multiple, narrowly-scoped LLM calls yields better factuality and controllability than would have been achieved by giving a single monolithic prompt to a single LLM. Another crucial difference is that not all of the prompts used to achieve the end goal are specified a-priori. Rather, they are generated by the constituent LLMs as they interpret and break down the original prompt.

An orchestration layer (e.g., LangChain, CrewAI, custom state machines) and/or bolted on protocol capabilities (such as MCP and A2A) route messages between cooperating LLM instances, external tools, and (when present) a persistent memory layer (vector or graph stores) that provides long-term context. The orchestration layer plays a role akin to that of an operating system: it schedules LLM calls, manages context (memory), and mediates access to external I/O tools. Finally, a presentation layer—APIs, chat UIs or autonomous task loops—exposes the whole stack to humans or other software. 

Borrowed from Google's A2A Documentation

The "intent interpreter"

From a systems-engineering perspective, a large language model behaves like a general-purpose "natural-language processor." Where a conventional pipeline compiles intent-expressing source code into machine instructions executed on a CPU, an agent pipeline feeds uncompiled intent—plain English—into an LLM that interprets and executes it in situ.

Viewed this way, an LLM's weight matrix is analogous to microcode burned into a processor, and a prompt is analogous to an instruction stream—only, the "instruction set" is natural language and the execution semantics are probabilistic sequence modeling. This reframing explains why agent diagrams feature multiple LLM blocks: just as complex software decomposes into cooperating threads on a CPU, complex agent behaviour decomposes into cooperating language-level processes running atop a new, linguistically-addressable compute substrate.

Because current realizations of this design pattern involve layers of software that materialize the LLM, and because the LLM itself "runs" the natural language prompt by performing inference within the model's latent space, I am framing a materialized LLM as analogous to a language interpreter: an "intent interpreter."

The role of system prompts and prompt templates

Packaged LLMs (such as the frontier models like Claude and ChatGPT, accessed via cloud APIs) almost always incorporate "system prompts" into every inference interaction. These are natural language instructions typically prepended to the LLM's context, and therefore evaluated at inference time. If you're curious about the content of system prompts, this github repository contains a cache of system prompts, including the Claud 4 system prompt, and you can find an analysis of that prompt at this site. While system prompts are typically injected into the context window for evaluation at inference time, models can also be trained to behave in accordance with such prompts, e.g., via reinforcement learning and other techniques. 

Extending the analogy of LLMs as "intent processors," we can think of their associated system prompts as the firmware or BIOS. Before a single user token is "interpreted," the system prompt establishes behavioral baselines: the model's persona, allowed tools, safety policies, and "storage devices" (RAG endpoints, vector stores). Just as a BIOS protects itself with ring-0 privileges, the system prompt lives in a protected slice of the context window that downstream prompts cannot overwrite—unless an adversary manages the textual equivalent of a firmware exploit.

Similar in role to system prompts are prompt templates. These are fixed prompts into which variable elements are inserted, in order to support a particular goal. Here's an example prompt template that might be used to process a {{task_description}} within the LLM responsible for planning as part of an agentic framework. As you can see from this example, one of the powerful characteristics of LLMs as "intent processors" is that we can use natural language to frame and control their behaviors.

<|system|>
You are an autonomous assistant named xyz. You are reliable, concise, and skilled in
planning, code generation, and using external tools via function calls.
Always think step-by-step. Respond in structured JSON if instructed.
</|system|>

<|context|>
Current task: {{task_description}}

Tools available:
{{tool_descriptions}}

Memory snapshot:
{{retrieved_memory_chunks}}
</|context|>

<|user|>
{{user_query}}
</|user|>

<|scratchpad|>
Internal thoughts:
- Step 1: Understand the user's intent
- Step 2: Break down the task into actions
- Step 3: Use tools or generate content as needed
</|scratchpad|>

<|output_format|>
{{desired_output_schema_or_instructions}}
</|output_format|>

<|response|>

How do LLMs reach out into the environment?

At this point, you may be wondering how LLMs are able to take actions, such as using tools, that affect their environment. As usual in computer science, there isn't any "magic" involved. Instead, we find that familiar mechanisms are in play: Namely, structured tool-use protocols orchestrated by the surrounding software, i.e., the LLM runner, e.g., vLLM.

Borrowed from MCP Documentation: https://modelcontextprotocol.io/introduction

At a high level, the LLM is prompted to emit specific text in a predefined format, such as a JSON object, specific sentinels followed by specific function calling or command syntax, driven by its training or the system prompt, and its cognition of the current task. The runtime environment captures the output tokens of the inference stream, then parses that output and executes the corresponding action in the real world: calling an API, querying a database, invoking a Python function or controlling a robot actuator. From the model's perspective, it's simply continuing a sequence; it has no awareness that the output is "doing" anything. The "magic" lies in the surrounding code that interprets and routes its output. This is where the system prompt can be very important, because it may include instructions to pause generation after every tool use, so that the output of that tool use can be incorporated into a subsequent prompt to continue progress towards the ultimate end-task.

This architecture mirrors how interpreters and operating systems mediate between user input and machine-level execution. Just as a shell interprets a command like rm file.txt and calls the appropriate system routine, an LLM agent might output {"action": "search", "query": "nearest gas station"}, which is then handed off to a Google Maps API wrapper. Tool-use specifications are often baked into the system prompt or provided through intermediate layers like LangChain, CrewAI or OpenAI's function calling system or via clients to servers that implement tool-use protocols such as MCP (Model Context Protocol). By whatever means, in other words, there is a defined contract: if the model outputs this kind of structured output, we promise to parse it and process any requested tool use. In this way, the LLM becomes a kind of "intent generator," and the surrounding agent infrastructure materializes that intent as real-world effects.

… just a mental model

Remember, the parallels drawn in this article are only meant to help you build a mental model for the role of LLMs in complex agentic flows. The research community is still working to completely characterize just what exactly is going on within LLMs engaged in this type of inference. It would be natural for you to project the foregoing onto Chain of Thought (CoT)/reasoning models, to characterize the "reasoning" they do as a relatively "procedural" or inductive exploration of the solution space. Several recent papers argue against that characterization, as does our blog post about DeepSeek R1:

Irrespective of whether the resulting plan stems from compute-like pattern matching (a-la Prolog), retrieval-like pattern matching (a-la query based recall from memorized traces), or truly inductive/procedural/multi-step computations via learned operators in the models' latent spaces; the result is a reduction or (as a compiler engineer would say) "lowering" of the natural language intent, into actionable multi-step sequences that can be implemented by the rest of the agentic system.  

Mechanistic interpretability

Mechanistic interpretability is the study of how a machine learning model does what it does, at a semantic level above the simplest operations that drive the computation (i.e., above the matmuls). Rather than treating the model as a black box, mechanistic interpretability aims to reverse-engineer the network by identifying distinct circuits, mapping out the geometry of its hidden states, and tracing the precise computations performed at each stage. Tools like "logit lenses" (and many others) are used to explore these low-level mechanisms, allowing researchers to better understand why a model behaves as it does, diagnose failure modes, and guide safer, more reliable deployments.

I'm sure you'll be hearing more about mechanistic interpretability as LLMs become more and more pervasive in the environment.