As the well-beloved blog of Simon Willison pointed out, coming up with a definition for agents has been tricky.
Simon eventually settled on:
An LLM agent runs tools in a loop to achieve a goal.

And I think that’s fair enough. Though it still bears some further analysis in order to determine:
The answers to these questions carry implications for how we approach managing the deployment and lifecycle of Agents, and for how we reason about architecting Agentic Systems.
So here I’d like to try to elaborate on those points and try to draw out some conclusions.
However
An agent is not whatever code has been written to glue the components of an agent together: langgraph, crewai, etc. We should package this code, version it and manage it. But we should never mistake this code for the agent.
By runtime identity I mean the unique operational identity of a running agent instance, bound to its context and configuration.
An agent should be identified and versioned according to the components that impact its behavior.
The components implied above are:
The smell test here is, if we changed any of these ingredients would the behavior of the system become fundamentally different?
Meanwhile, if you switch from langgraph to crewai you shouldn’t expect a big change in behavior so long as you maintain an equivalent graph of possible state transitions.
In short, our container image alone gives us very minimal information about how that agent will actually behave in production.
Configuration describes potential behavior; context determines actual behavior.

This is a subtle but important difference given the way that you can e.g. still meaningfully audit the runtime behavior of a webserver whether it connects to a Postgres or MySQL server.
This is simply not the case with Agents and so we must be careful to not apply previous assumptions around operational discipline.
This prevents us maintaining provenance across network boundaries.
As Simon’s blog suggests, what an “Agent” needs is primarily an LLM.
And the distinguishing feature of LLM-based systems from a platform perspective is their effective stochastic variability.
The mapping of possible inputs to possible outputs in an LLM is unfathomably large and the selection of paths through that space is probabilistic.
This makes the difference between managing Agent workloads and non-Agent workloads akin to the difference between experimental and theoretical physics.
In experimental physics we observe systems in order to learn their behavior over time.
In theoretical physics we start with a “complete” understanding of a system and observe it in order to find deviations from our expectations.
When we manage web servers and databases we begin with a clean set of expectations about how requests should flow through our system and often only capture more detailed information when something goes wrong.
For example, a request results in a 500 error. We record a resulting stack trace and use it to analyze the problem.
In other words, our observability systems are geared primarily toward noticing deviations.
With an LLM involved, we can’t possibly know our expected outputs ahead of time.
Some user could use a formulation of the Trolley problem to goad the system into spitting out limericks about secret keys in its environment and this poetic failure would return clean 200s.
We can’t afford to wait around for a stack trace.
Instead, our observability must primarily aim to continually map the range of behaviors our system is capable of, so that we can answer “why” even when everything seems to be operating well.
Then, if a travel agent (ha ha) quietly books a wrong calendar event, the LLM version, prompt, and tool call chain that led there are easy to find. Even weeks later when an angry customer sends an email about missing their cruise.
This sort of observability discipline is a key factor in “MLOps”, however, I’ll note that the framing of it is often geared toward improving systems. That is, “we collect detailed runtime data so that we can train our system to become more efficient or so that we can fine tune cheaper models to behave correctly”.
The problem with this framing is that it doesn’t emphasize the fact that we need MLOps level tracing even without fine-tuning. That means:
And we must ensure that our workload identity is:
From which, we can begin to work our way backwards toward recognizing deviations from normal operations. Likely by tuning guardrails.
The observability story is something that I see Agent Frameworks e.g. CrewAI trying to leverage as a selling point, however, I would argue that this is something that can and should mostly be solved at a platform level.
Agents are non-deterministic and require an experimental “MLOps” mindset.
I’m exploring these ideas as part of the Kagenti project, where we’re building primitives on Kubernetes for solving problems around identity and delegated auth flows.
We’re also developing a good observability story that leverages our integration with AI protocol aware gateways and service meshes to build the sort of thorough traces that I’m describing above.
If you’re building multi-agent systems and want to work with us on it, please file an issue, leave a comment, or reach out to me.