Supervisor Agents Don't Exist Yet
A definition of the layer that sits inside a main agent's cycle, decides if each proposed action is acceptable, and either lets it through, nudges it back on path, or stops it.
This is a thesis post about supervisor agents. A supervisor agent is a separate process that sits inside a main agent’s cycle. Every action the main agent proposes passes through the supervisor before it executes. The supervisor decides whether the action is acceptable, and depending on the verdict either lets it through, nudges the main agent back onto the right path, flags the action for a human, or blocks it outright. The layer doesn’t really exist yet, not in the form it needs to. I wrote this because every time I describe what I mean to someone building an agent, they nod, go back to their team, and ship another foundation LLM with a long system prompt. The vocabulary isn’t landing.
The post is the vocabulary.
Overview
The shape end-to-end:
main agent
│
│ proposes action
▼
supervisor (step in the loop)
│
│ fan out to specialists in parallel
▼
specialist 1, 2, ... N
(regex · SQL · AST · classifier · narrow LLM)
│
▼
aggregation: union, not vote
│
┌──────┬───┴────┬──────┐
▼ ▼ ▼ ▼
ok nudge flag block
│ │ │ │
execute correction execute refuse;
+ replan + record main agent
(feeds for must replan
back to human from scratch
main review
agent)
│
▼
replay artifact (signed, re-runnable)
│
▼
feedback log
│
▼
accumulation per deployment
Four moving parts. A taxonomy that names every known failure mode of the main agent. A specialist per entry in the taxonomy, narrow by construction. A decision layer that fans actions out and aggregates verdicts by union. A feedback log that turns every flag and every nudge into training signal for that specific deployment. Almost everything in this post is variations on those four parts.
What it actually is
A main agent is the thing that does the work. It plans, calls tools, edits state, opens PRs, runs commands. It runs between human checkpoints. The interesting main agents in production today are coding agents, support agents, research agents, browsing agents, sales agents, ops agents. They share one property. They do real work without a human watching every action.
A supervisor agent is a separate process that sits inside the main agent’s cycle. It observes what the main agent is about to do. It decides whether that action is acceptable. Then it acts on that decision. The action might be ok (let it proceed), nudge (refuse this exact action but send a correction back so the main agent can replan), flag (allow but record for human review), or block (refuse outright and require the main agent to replan from scratch).
That’s the whole thing. Watch, decide, act, record. Inside the loop, every cycle.
Five properties follow from this definition. They’re what separate a supervisor from the four nearest things people confuse it with.
Separation with placement. A supervisor is a separate process. Its own weights, its own memory, its own prompts. But it sits in the main agent’s loop, not outside it. Every proposed action passes through it before execution. The independence is in state. The placement is in the cycle. If the supervisor lives inside the main agent’s reasoning, it isn’t a supervisor. It’s a self-critic, and self-critics fail in correlated ways with the agent they critique.
Deployment-time. Evals are good. Evals aren’t supervisors. Evals tell you how the agent did against a golden set last Tuesday. A supervisor tells you what is happening to a customer’s account right now. Most broken behaviour shows up only in deployment, against the real distribution, with real noise.
Memory across sessions. A guardrail evaluates one request at a time. A supervisor accumulates. It remembers that this particular main agent tried this particular trick three times this month. It remembers which nudges were heeded and which were ignored. Without accumulation, the supervisor is reset every session, and you’ve built a stateless check, not a supervisor.
A taxonomy of failure modes. The supervisor isn’t watching for “anything wrong”. It’s watching for a named, published list of ways this class of main agent is known to fail. Each named failure mode becomes a unit of decomposition. The taxonomy is the foundation of the whole system, and I’ll come back to it.
Authority across a graded set of verdicts. A supervisor is not a dashboard. It’s a process that can let an action through, return a correction, flag for human review, refuse outright, or roll a state back. The graded authority is what separates a supervisor from the observability stack. Observability tells you. A supervisor decides, and the decision feeds back into the main agent’s next plan.
If a system you’re looking at doesn’t have all five of those, it might be a useful tool. It isn’t a supervisor agent.
Why this layer doesn’t exist yet
The agent field for the last two years has been about capability. Can the agent do the task. Can it use tools. Can it maintain context. Can it plan and replan. Can it spawn sub-agents. The frontier question has been “what can this thing do”.
Supervision is a different problem and has moved much slower. A few reasons for the gap.
Supervision is boring. The exciting paper writes itself if your main agent does something new. The supervisor for that main agent doesn’t write a paper, it writes a log line. The incentives haven’t lined up.
Supervision is invisible when it works. A well-supervised agent looks like an agent that did the right thing. The supervisor’s contribution is the bad outcome that didn’t happen, or the nudge that quietly redirected the agent back onto the original task. That’s hard to demo and hard to sell.
The people building agents have mostly been using them in friendly conditions. Their own machine, their own codebase, their own tolerance for weirdness. The failure modes that need supervision show up at scale, under adversarial use, under drift across months. Those conditions haven’t had time to hurt most teams yet. They will.
The fourth reason is the interesting one. The obvious first version of a supervisor agent is broken, and the field hasn’t collectively concluded that yet. People are still trying to make it work.
Why a foundation LLM doesn’t fit
The default move when someone needs a supervisor is to wire a frontier foundation model in as a reviewer. The prompt looks like this:
You are a senior reviewer for an autonomous agent.
Before the agent's action is committed, review it.
Flag anything unsafe, incorrect, off-policy, or harmful.
Be thorough. Be careful. Use your judgment.
This pattern is in every production-agent tutorial right now. It doesn’t work. The reason isn’t that the model isn’t smart enough. The reason is structural, and it splits into three.
The first is correlated failure. If your main agent is Claude and your supervisor is also Claude with a different system prompt, you haven’t added an independent check. You’ve added a correlated one. They share the same training distribution, the same RLHF priors, the same aesthetic preferences. They get charmed by the same kinds of confident output. They miss the same kinds of subtle errors.
The obvious objection is to use a different model family. Claude watches GPT. GPT watches Claude. The failure modes decorrelate, the intuition goes, because the supervisor and the main agent are now different animals.
It doesn’t really work. They decorrelate a little. Not enough.
The frontier models all trained on essentially the same internet. Same code, same papers, same Stack Overflow, same Wikipedia, same GitHub. Their RLHF labellers come from overlapping populations and rate against similar conceptions of “what a good response looks like”. Their architectures are the same architecture class (transformer, autoregressive, next-token prediction). The benchmark suites they optimise against are largely shared, which means the blind spots those benchmarks fail to surface are shared too.
The fundamental failure modes (hallucination under uncertainty, charm by confident output, sycophancy, susceptibility to prompt injection) are properties of the architecture class, not the specific model. Two different frontier LLMs reviewing each other are two slightly different lenses on the same world model. They’re not two independent components. They’re two correlated components with slightly different fingerprints.
This is a basic reliability point. Redundancy doesn’t help if redundant components fail for the same reason. A second engine that fails for the same reason as the first one is not real redundancy. It is duplication. Two transformer-class language models reviewing each other are two engines from the same factory running on the same fuel. Cross-family review is better than same-family review. It is not low-correlation enough to ship downstream guarantees against.
You can argue a model catches its own mistakes some of the time, and a different-family model catches more. True. The question is variance and floor. A supervisor’s job is to be reliable, not averaged-good. Reducing correlation a notch by swapping the model doesn’t get you there.
The right answer isn’t a smarter foundation model. It isn’t a different-family foundation model. It is to stop putting a generalist in this slot.
The shape that works
A supervisor agent should be a decomposition. Not one model with a big prompt. Many small specialists, each watching for exactly one failure mode, each refusing to answer questions outside its domain. The decomposition is the architecture. Everything else falls out of it.
The taxonomy comes first. Before you write a single line of supervisor code, you publish a list of the failure modes for the class of main agent you’re supervising. Not a wish list. A specific, named, exhaustive-as-you-can-make-it catalogue of the ways this kind of agent is known to fail. Each entry has at minimum a short name, a one-line description, an example drawn from a real incident, a severity, a default verdict (nudge for recoverable failures, block for unrecoverable ones), a reference to an existing risk-classification system if one applies (CWE for security, HIPAA for healthcare, SOC2 for ops, MITRE ATT&CK for adversarial behaviour), and a detection-method label like the following:
deterministicsmall-classifiernarrow-llmhybrid
The taxonomy is published, not internal. Publishing forces precision and forces the field to converge on names. Without shared names, every team builds bespoke supervisors that don’t compose. Shared names are an API.
Each entry in the taxonomy gets exactly one specialist. A specialist is the smallest unit of supervision. It takes one input (a proposed action plus context) and emits one output, a verdict and a confidence, and where the verdict is nudge, a structured correction the main agent can consume. The contract:
// supervisor/specialist.ts
type Verdict = "ok" | "nudge" | "flag" | "block" | "refuse";
interface SpecialistResult {
verdict: Verdict;
confidence: number;
reason?: string; // what the specialist saw
correction?: string; // populated only when verdict is "nudge"
}
interface Specialist {
name: string; // matches a taxonomy entry id
domain: ActionPredicate; // declared at construction
evaluate(
action: ProposedAction,
context: Context,
): SpecialistResult;
}
A specialist is allowed to refuse. Refusal isn’t failure. A specialist asked about something outside its taxonomy entry should return refuse with high confidence, and the decision layer treats refusal as “not the right specialist for this”. This is how you keep specialists narrow. A specialist that answers questions outside its domain isn’t narrow anymore, and once it stops being narrow, the properties you bought by decomposing go away.
The decision layer fans out to every specialist in parallel, with a hard per-specialist timeout. Each specialist returns a verdict or refuses. The decision layer aggregates by union, not vote. Any block blocks. Any nudge becomes a correction sent back to the main agent. Any flag flags. Refusals are ignored. Default is ok if no one objected.
Voting is the obvious alternative and it’s wrong. A specialist is responsible for exactly one failure mode. Inside its domain, it is the authority for that failure mode. Asking other specialists to vote on whether it was right is asking experts in unrelated fields to weigh in on something outside their domain. The first specialist to confidently identify its failure mode wins.
Most specialists should not be LLMs. This is the counterintuitive part. For most named failure modes, the right specialist is a deterministic check. A regex over the proposed action. An AST walk over a diff. A SQL query against the system state. A comparison against a hard-coded budget. These checks aren’t glamorous. They’re correct. The set of inputs a deterministic check returns block or nudge for is exactly the set you said it should. No temperature, no jailbreak, no clever phrasing that gets past it. You can’t prompt-inject a regex. You can’t smooth-talk an AST diff. And critically, a deterministic check is structurally different from the main agent in a way no LLM-on-LLM check can be. Its correlation with the main agent’s failure modes is zero by construction.
When the failure mode is too semantic for a deterministic check, the right specialist is a small classifier or a narrow LLM, fine-tuned on that one failure mode and asked only that one question. Not a foundation model with a general prompt. A small model that has seen ten thousand examples of “this exact failure” and “this exact non-failure” and learned to discriminate between them.
In production, the supervisor is something like fifty specialists. Maybe thirty deterministic checks, fifteen small classifiers, five narrow LLM judges. Most are cheap. All are independent. Each one is responsible for one row of the taxonomy. None of them is asked to “review the action”. Each is asked exactly one question.
Specialists run in-loop, synchronously, on every proposed action. The supervisor is a step in the main agent’s cycle. The latency budget is real, and the answer is parallelism and tight per-specialist timeouts. Cheap deterministic specialists fit in any budget. Expensive narrow LLM judges only fit if the budget is wide enough, or if they’re reserved for severity-tier actions where the cost is worth paying.
There’s a second tier of specialists that run after execution, not before. Post-action checks that verify the action’s effects on the world (the database row that was actually written, the file that was actually committed, the message that was actually sent). Those can also flag, record, or initiate a revert. The pre-action specialists shape what the main agent does. The post-action specialists shape what gets undone.
Specialists accumulate per deployment. Each specialist owns a feedback log. Every flag and every nudge is recorded with the proposed action, the context, the correction (if any), whether the main agent heeded it, and (when a human eventually reviews) a label indicating whether the verdict was correct. Over time, the log is the training signal. False positives go into the negative set, new patterns go into the positive set, the specialist gets sharper for this deployment. Per-deployment memory lives here, in the fine-tuning data for the small models and the expanded rule sets for the deterministic ones. Not in a giant prompt.
The taxonomy itself accumulates too. When a human notices a failure mode no specialist caught, the team publishes a new taxonomy entry, builds a specialist for it, and adds it to the supervisor. The supervisor grows. The main agent stays the same.
Three specialists
Three sketches across three domains, to make the shape concrete.
Support agent. Duplicate refund check.
// supervisor/specialists/duplicate_refund_same_order.ts
const duplicateRefund: Specialist = {
name: "duplicate_refund_same_order",
domain: action => action.type === "refund",
evaluate(action, ctx) {
const count = ctx.db.queryOne(
`SELECT COUNT(*) AS n FROM refunds
WHERE order_id = $1
AND status = 'completed'
AND issued_at > now() - $2 * interval '1 day'`,
[action.orderId, ctx.refundWindowDays],
).n;
return count >= 1
? { verdict: "block", confidence: 1.0, reason: "prior_refund_in_window" }
: { verdict: "ok", confidence: 1.0 };
},
};
No model. A SQL query and a comparison. Correct, fast, cheap, impossible to inject. The verdict is block, not nudge. You can’t issue half a duplicate refund. There’s no correction string that makes this action acceptable. The main agent must replan from scratch.
Research agent. Fabricated-citation check.
// supervisor/specialists/citation_resolves_and_quote_appears.ts
const citationCheck: Specialist = {
name: "citation_resolves_and_quote_appears",
domain: action => action.type === "cite",
async evaluate(action, ctx) {
const fetched = await httpGet(action.url, { timeoutMs: 2000 });
if (fetched.status !== 200) {
return {
verdict: "nudge",
confidence: 0.9,
reason: "source URL did not resolve",
correction: `The citation URL ${action.url} returned HTTP ${fetched.status}. `
+ `Find a different source for this claim, or remove the citation.`,
};
}
if (fetched.body.includes(action.quotedText)) {
return { verdict: "ok", confidence: 1.0 };
}
const paraphrase = await narrowParaphraseJudge.run({
candidate: action.quotedText,
source: fetched.body,
model: "paraphrase-judge-small-v2", // single-purpose fine-tune
});
return paraphrase === "yes"
? { verdict: "ok", confidence: 0.8 }
: {
verdict: "nudge",
confidence: 0.85,
reason: "quoted text not present in source",
correction: `The quote you cited does not appear in the source at ${action.url}. `
+ `Either update the quote to match the source verbatim, or replace `
+ `this citation with one that supports your claim.`,
};
},
};
Mostly deterministic. A small fine-tuned model handles only the paraphrase decision because verbatim string match is too strict. The model is trained for paraphrase detection, nothing else. The verdicts here are nudge, not block. A failed citation isn’t catastrophic. The main agent should be told what’s wrong and which fix is acceptable, then replan. The correction string is structured enough that the main agent can act on it without a second human round-trip.
Browsing agent. Out-of-domain click check.
// supervisor/specialists/click_outside_allowed_domains.ts
const outOfDomainClick: Specialist = {
name: "click_outside_allowed_domains",
domain: action => action.type === "click",
evaluate(action, ctx) {
const d = parseRegistrableDomain(action.targetUrl);
return ctx.allowedDomains.includes(d)
? { verdict: "ok", confidence: 1.0 }
: {
verdict: "block",
confidence: 1.0,
reason: "domain_not_in_allowlist",
};
},
};
Three lines of real logic. Arithmetic-equivalent. Catches a real failure mode that has caused real harm in real deployments. Block, not nudge. The allowlist is a hard boundary; the agent has no business outside it. The main agent gets the block and has to replan a path that stays inside the allowed surface.
You could write twenty of these in an afternoon for a given main agent. That’s what the supervisor is. Twenty of these. Then forty. Then a hundred. Each one narrow, cheap, named, with the right default verdict for the failure mode it watches.
Hard parts
The architecture above is the easy part to describe. The hard parts are these.
Coverage. How do you know your taxonomy is complete? You don’t. You know it’s incomplete the moment the main agent fails in a way no specialist caught. Coverage is a moving target, and the only way to close the gap is to instrument every supervisor decision and every human override, then look at where the supervisor missed. The log is the input to the next taxonomy revision. Without that loop, the supervisor stops accumulating, and a static supervisor against a learning main agent gets worse over time.
Latency. The supervisor is in the cycle, so every action pays its budget. A one-second supervisor in front of a sub-second main agent is unacceptable for interactive use. The answer is parallelism, tight per-specialist timeouts, and a clear policy on timeout. The right default is timeout = refuse, treated as no-vote, action proceeds. The wrong default is fail-closed on timeout, which turns one slow specialist into a denial-of-service against the main agent. The latency budget is configuration, not a global constant. Different actions get different supervisor budgets; a wire transfer can afford 400ms, a tool-call to retrieve a search result cannot.
Authority. The supervisor can record, flag, nudge, block, or revert. Which authority does it have, for which specialist, on which action? This is a configuration problem, not a code problem, but it’s the configuration that decides whether the supervisor is useful or theatrical. A supervisor that can only record is observability. A supervisor that can flag is a queue for human review. A supervisor that can nudge is a coach. A supervisor that can block is operational. A supervisor that can revert is autonomous. Different deployments need different levels, and the same deployment may need different levels per specialist. The duplicate-refund check might be allowed to block. The citation check might only be allowed to nudge. The browser-domain check might be allowed to block, and the budget check might be allowed to revert.
Nudge compliance. A nudge only works if the main agent heeds it. The supervisor has to detect when the main agent ignores a nudge and re-proposes the same action (or a near-equivalent variant). The mechanic is per-session memory keyed on the specialist that issued the nudge. After N repeats inside a session, the verdict escalates from nudge to block. N is a configuration knob. The default I’d choose is two: a nudge, a second nudge with a sharper correction, then block. Without this escalation, nudges become a corrective lullaby that the main agent learns to ignore.
Replay. A good supervisor produces a portable, deterministic record of every decision. Given the same input, the supervisor must produce the same verdict every time it replays against historical actions. That’s what makes the supervisor itself auditable. The replay artifact has to include the proposed action, the context the specialists saw, every specialist’s verdict, the correction strings for any nudges, the aggregation, and the final decision. It has to be signed, time-stamped, and re-runnable. This is what lets a regulator, a customer, or a future engineer ask “what did the supervisor see at 03:14:09, and why did it allow this action” and get a real answer.
The supervisor for the supervisor. Obvious next question. The answer isn’t fully satisfying. Deterministic specialists supervise themselves by construction; same input, same verdict. Small classifiers are validated against held-out sets with continuous precision/recall monitoring. Narrow LLM judges are sampled and human-reviewed in batch. There’s no perfect answer. The argument is that the supervisor’s surface is small, structured, and narrow enough that the validation problem is tractable in a way the main agent’s isn’t. A foundation-LLM-as-supervisor can’t be validated this way. A union of deterministic checks, small classifiers, and narrow judges can. The validation problem moves from intractable to merely difficult, and merely difficult is the kind of problem we can ship against.
FAQ
Isn’t this just a rules engine with extra steps?
Partially. The deterministic specialists are a rules engine. The small classifiers and narrow LLM judges aren’t, the nudge mechanic isn’t, and the feedback loop that retrains specialists per deployment isn’t. A static rules engine doesn’t accumulate. The supervisor does. Calling it “rules engine plus” undersells the accumulation loop, which decides whether the supervisor gets sharper or stays flat across months.
What if I use a different model family for the supervisor, Claude watching GPT or GPT watching Claude?
The decorrelation is weaker than it looks. Frontier LLMs trained on largely the same internet, the same architecture class, and the same RLHF aesthetic from overlapping rater populations. They share blind spots in correlated ways. Cross-family review is better than same-family review, but it’s still two correlated components, not two independent ones. And the structural problems (variance under broad prompts, no accumulation) don’t go away when you swap the model. The slot itself is wrong for any foundation LLM, regardless of which lab trained it.
Why is nudge a separate verdict instead of just “block with a message”?
Because the contract is different. block says “this action is unacceptable, replan from scratch, the supervisor has no opinion on what you do next”. nudge says “this exact action is unacceptable, but here is a structured correction the supervisor expects you to incorporate into your next plan”. The main agent treats them differently. A block resets the planning context. A nudge feeds back into it. Conflating the two costs you the most productive verdict in the system.
Why union aggregation and not weighted voting?
A specialist is responsible for exactly one failure mode. Inside its domain, it’s the authority. Outside, it refuses. Voting introduces noise from specialists that don’t know the failure mode. Weighted voting just hides the noise under a coefficient. Union is the only aggregation rule that lets every specialist remain the sole authority in its own narrow lane, which is the property the decomposition was supposed to deliver.
What about a giant LLM with retrieval over the failure history?
Three problems. The retrieval prompt grows with deployment age and stops fitting in cache. The retrieval system itself becomes a critical, hard-to-validate component. And the underlying model is still a generalist failing in correlated ways with the main agent. You moved the accumulation problem out of the prompt but you didn’t fix correlated failure or variance. A giant LLM with retrieval is a better foundation-LLM supervisor than one without retrieval. It still isn’t a supervisor.
Don’t evals, guardrails, or a critic already cover this?
Evals run offline, against a fixed dataset, before deployment. Guardrails run inline against the prompt, one request at a time. A critic runs inside the main agent’s own reasoning loop. None of them are a separate, deployment-time, stateful, taxonomy-driven, authoritative process sitting in the main agent’s cycle. Each fills a different slot. The supervisor slot is empty.
Who builds the taxonomy if I’m starting fresh?
You do. You start with the failures you’ve actually seen in your own deployments. You add entries as new failure modes show up. Over time, the cross-team common entries get pulled into a shared taxonomy and the long tail stays per-deployment. The same way CVE and CWE evolved.
Most of the agent infrastructure I see being built right now is doubling down on capability and pretending supervision will figure itself out. It won’t. The supervisor for the autonomous main agent has to be designed, not gestured at, and it has to live inside the cycle, not next to it.
I’m not announcing a roadmap. I’m writing down the object so the conversation can get past “we have a supervisor, we use GPT to watch Claude Code” or whichever cross-family combination is currently in the tutorial. When the field shares the words, it’ll be possible to compare actual supervisors and notice which ones are real. The post is a definition. The work is the specialists, one row of the taxonomy at a time.
The hard part of the next phase of agent deployment isn’t making them smarter. It’s making every autonomous act leave enough structure behind that another process can say yes, here is a better way, here is what to log, or no.

