<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Subho Halder]]></title><description><![CDATA[Co-founder and former CEO of Appknox. Now building in AI x Security. Working notes from what I am building and what I'm shipping, what's surprising me, what I can't yet explain.]]></description><link>https://notes.subhohalder.com</link><image><url>https://substackcdn.com/image/fetch/$s_!-jSz!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca45cc84-5aaf-4cf2-b841-e84265796233_400x400.png</url><title>Subho Halder</title><link>https://notes.subhohalder.com</link></image><generator>Substack</generator><lastBuildDate>Sat, 20 Jun 2026 14:17:02 GMT</lastBuildDate><atom:link href="https://notes.subhohalder.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Subho Halder]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[me@subhohalder.com]]></webMaster><itunes:owner><itunes:email><![CDATA[me@subhohalder.com]]></itunes:email><itunes:name><![CDATA[Subho Halder]]></itunes:name></itunes:owner><itunes:author><![CDATA[Subho Halder]]></itunes:author><googleplay:owner><![CDATA[me@subhohalder.com]]></googleplay:owner><googleplay:email><![CDATA[me@subhohalder.com]]></googleplay:email><googleplay:author><![CDATA[Subho Halder]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Supervisor Agents Don't Exist Yet]]></title><description><![CDATA[A definition of the layer that sits inside a main agent's cycle, decides if each proposed action is acceptable, and either lets it through, nudges it back on path, or stops it.]]></description><link>https://notes.subhohalder.com/p/supervisor-agents-dont-exist-yet</link><guid isPermaLink="false">https://notes.subhohalder.com/p/supervisor-agents-dont-exist-yet</guid><dc:creator><![CDATA[Subho Halder]]></dc:creator><pubDate>Thu, 21 May 2026 14:42:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-jSz!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca45cc84-5aaf-4cf2-b841-e84265796233_400x400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is a thesis post about supervisor agents. A supervisor agent is a separate process that sits inside a main agent&#8217;s cycle. Every action the main agent proposes passes through the supervisor before it executes. The supervisor decides whether the action is acceptable, and depending on the verdict either lets it through, nudges the main agent back onto the right path, flags the action for a human, or blocks it outright. The layer doesn&#8217;t really exist yet, not in the form it needs to. I wrote this because every time I describe what I mean to someone building an agent, they nod, go back to their team, and ship another foundation LLM with a long system prompt. The vocabulary isn&#8217;t landing.</p><p>The post is the vocabulary.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://notes.subhohalder.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Overview</h2><p>The shape end-to-end:</p><pre><code><code>              main agent
                  &#9474;
                  &#9474; proposes action
                  &#9660;
              supervisor  (step in the loop)
                  &#9474;
                  &#9474; fan out to specialists in parallel
                  &#9660;
        specialist 1, 2, ... N
        (regex &#183; SQL &#183; AST &#183; classifier &#183; narrow LLM)
                  &#9474;
                  &#9660;
        aggregation: union, not vote
                  &#9474;
       &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
       &#9660;      &#9660;        &#9660;      &#9660;
       ok   nudge     flag   block
       &#9474;      &#9474;        &#9474;       &#9474;
   execute correction execute  refuse;
           + replan   + record main agent
           (feeds     for      must replan
           back to    human    from scratch
           main       review
           agent)
                  &#9474;
                  &#9660;
        replay artifact (signed, re-runnable)
                  &#9474;
                  &#9660;
            feedback log
                  &#9474;
                  &#9660;
        accumulation per deployment
</code></code></pre><p>Four moving parts. A taxonomy that names every known failure mode of the main agent. A specialist per entry in the taxonomy, narrow by construction. A decision layer that fans actions out and aggregates verdicts by union. A feedback log that turns every flag and every nudge into training signal for that specific deployment. Almost everything in this post is variations on those four parts.</p><h2>What it actually is</h2><p>A main agent is the thing that does the work. It plans, calls tools, edits state, opens PRs, runs commands. It runs between human checkpoints. The interesting main agents in production today are coding agents, support agents, research agents, browsing agents, sales agents, ops agents. They share one property. They do real work without a human watching every action.</p><p>A supervisor agent is a separate process that sits inside the main agent&#8217;s cycle. It observes what the main agent is about to do. It decides whether that action is acceptable. Then it acts on that decision. The action might be <code>ok</code> (let it proceed), <code>nudge</code> (refuse this exact action but send a correction back so the main agent can replan), <code>flag</code> (allow but record for human review), or <code>block</code> (refuse outright and require the main agent to replan from scratch).</p><p>That&#8217;s the whole thing. Watch, decide, act, record. Inside the loop, every cycle.</p><p>Five properties follow from this definition. They&#8217;re what separate a supervisor from the four nearest things people confuse it with.</p><p><strong>Separation with placement.</strong> A supervisor is a separate process. Its own weights, its own memory, its own prompts. But it sits in the main agent&#8217;s loop, not outside it. Every proposed action passes through it before execution. The independence is in state. The placement is in the cycle. If the supervisor lives inside the main agent&#8217;s reasoning, it isn&#8217;t a supervisor. It&#8217;s a self-critic, and self-critics fail in correlated ways with the agent they critique.</p><p><strong>Deployment-time.</strong> Evals are good. Evals aren&#8217;t supervisors. Evals tell you how the agent did against a golden set last Tuesday. A supervisor tells you what is happening to a customer&#8217;s account right now. Most broken behaviour shows up only in deployment, against the real distribution, with real noise.</p><p><strong>Memory across sessions.</strong> A guardrail evaluates one request at a time. A supervisor accumulates. It remembers that this particular main agent tried this particular trick three times this month. It remembers which nudges were heeded and which were ignored. Without accumulation, the supervisor is reset every session, and you&#8217;ve built a stateless check, not a supervisor.</p><p><strong>A taxonomy of failure modes.</strong> The supervisor isn&#8217;t watching for &#8220;anything wrong&#8221;. It&#8217;s watching for a named, published list of ways this class of main agent is known to fail. Each named failure mode becomes a unit of decomposition. The taxonomy is the foundation of the whole system, and I&#8217;ll come back to it.</p><p><strong>Authority across a graded set of verdicts.</strong> A supervisor is not a dashboard. It&#8217;s a process that can let an action through, return a correction, flag for human review, refuse outright, or roll a state back. The graded authority is what separates a supervisor from the observability stack. Observability tells you. A supervisor decides, and the decision feeds back into the main agent&#8217;s next plan.</p><p>If a system you&#8217;re looking at doesn&#8217;t have all five of those, it might be a useful tool. It isn&#8217;t a supervisor agent.</p><h2>Why this layer doesn&#8217;t exist yet</h2><p>The agent field for the last two years has been about capability. Can the agent do the task. Can it use tools. Can it maintain context. Can it plan and replan. Can it spawn sub-agents. The frontier question has been &#8220;what can this thing do&#8221;.</p><p>Supervision is a different problem and has moved much slower. A few reasons for the gap.</p><p>Supervision is boring. The exciting paper writes itself if your main agent does something new. The supervisor for that main agent doesn&#8217;t write a paper, it writes a log line. The incentives haven&#8217;t lined up.</p><p>Supervision is invisible when it works. A well-supervised agent looks like an agent that did the right thing. The supervisor&#8217;s contribution is the bad outcome that didn&#8217;t happen, or the nudge that quietly redirected the agent back onto the original task. That&#8217;s hard to demo and hard to sell.</p><p>The people building agents have mostly been using them in friendly conditions. Their own machine, their own codebase, their own tolerance for weirdness. The failure modes that need supervision show up at scale, under adversarial use, under drift across months. Those conditions haven&#8217;t had time to hurt most teams yet. They will.</p><p>The fourth reason is the interesting one. The obvious first version of a supervisor agent is broken, and the field hasn&#8217;t collectively concluded that yet. People are still trying to make it work.</p><h2>Why a foundation LLM doesn&#8217;t fit</h2><p>The default move when someone needs a supervisor is to wire a frontier foundation model in as a reviewer. The prompt looks like this:</p><pre><code><code>You are a senior reviewer for an autonomous agent.
Before the agent's action is committed, review it.
Flag anything unsafe, incorrect, off-policy, or harmful.
Be thorough. Be careful. Use your judgment.
</code></code></pre><p>This pattern is in every production-agent tutorial right now. It doesn&#8217;t work. The reason isn&#8217;t that the model isn&#8217;t smart enough. The reason is structural, and it splits into three.</p><p>The first is correlated failure. If your main agent is Claude and your supervisor is also Claude with a different system prompt, you haven&#8217;t added an independent check. You&#8217;ve added a correlated one. They share the same training distribution, the same RLHF priors, the same aesthetic preferences. They get charmed by the same kinds of confident output. They miss the same kinds of subtle errors.</p><p>The obvious objection is to use a different model family. Claude watches GPT. GPT watches Claude. The failure modes decorrelate, the intuition goes, because the supervisor and the main agent are now different animals.</p><p>It doesn&#8217;t really work. They decorrelate a little. Not enough.</p><p>The frontier models all trained on essentially the same internet. Same code, same papers, same Stack Overflow, same Wikipedia, same GitHub. Their RLHF labellers come from overlapping populations and rate against similar conceptions of &#8220;what a good response looks like&#8221;. Their architectures are the same architecture class (transformer, autoregressive, next-token prediction). The benchmark suites they optimise against are largely shared, which means the blind spots those benchmarks fail to surface are shared too.</p><p>The fundamental failure modes (hallucination under uncertainty, charm by confident output, sycophancy, susceptibility to prompt injection) are properties of the architecture class, not the specific model. Two different frontier LLMs reviewing each other are two slightly different lenses on the same world model. They&#8217;re not two independent components. They&#8217;re two correlated components with slightly different fingerprints.</p><p>This is a basic reliability point. Redundancy doesn&#8217;t help if redundant components fail for the same reason. A second engine that fails for the same reason as the first one is not real redundancy. It is duplication. Two transformer-class language models reviewing each other are two engines from the same factory running on the same fuel. Cross-family review is better than same-family review. It is not low-correlation enough to ship downstream guarantees against.</p><p>You can argue a model catches its own mistakes some of the time, and a different-family model catches more. True. The question is variance and floor. A supervisor&#8217;s job is to be reliable, not averaged-good. Reducing correlation a notch by swapping the model doesn&#8217;t get you there.</p><p>The right answer isn&#8217;t a smarter foundation model. It isn&#8217;t a different-family foundation model. It is to stop putting a generalist in this slot.</p><h2>The shape that works</h2><p>A supervisor agent should be a decomposition. Not one model with a big prompt. Many small specialists, each watching for exactly one failure mode, each refusing to answer questions outside its domain. The decomposition is the architecture. Everything else falls out of it.</p><p>The taxonomy comes first. Before you write a single line of supervisor code, you publish a list of the failure modes for the class of main agent you&#8217;re supervising. Not a wish list. A specific, named, exhaustive-as-you-can-make-it catalogue of the ways this kind of agent is known to fail. Each entry has at minimum a short name, a one-line description, an example drawn from a real incident, a severity, a default verdict (<code>nudge</code> for recoverable failures, <code>block</code> for unrecoverable ones), a reference to an existing risk-classification system if one applies (CWE for security, HIPAA for healthcare, SOC2 for ops, MITRE ATT&amp;CK for adversarial behaviour), and a detection-method label like the following: </p><ul><li><p><code>deterministic</code></p></li><li><p><code>small-classifier</code></p></li><li><p><code>narrow-llm</code></p></li><li><p><code>hybrid</code></p></li></ul><p>The taxonomy is published, not internal. Publishing forces precision and forces the field to converge on names. Without shared names, every team builds bespoke supervisors that don&#8217;t compose. Shared names are an API.</p><p>Each entry in the taxonomy gets exactly one specialist. A specialist is the smallest unit of supervision. It takes one input (a proposed action plus context) and emits one output, a verdict and a confidence, and where the verdict is <code>nudge</code>, a structured correction the main agent can consume. The contract:</p><pre><code><code>// supervisor/specialist.ts
type Verdict = "ok" | "nudge" | "flag" | "block" | "refuse";

interface SpecialistResult {
  verdict:     Verdict;
  confidence:  number;
  reason?:     string;    // what the specialist saw
  correction?: string;    // populated only when verdict is "nudge"
}

interface Specialist {
  name:   string;             // matches a taxonomy entry id
  domain: ActionPredicate;    // declared at construction
  evaluate(
    action:  ProposedAction,
    context: Context,
  ): SpecialistResult;
}
</code></code></pre><p>A specialist is allowed to refuse. Refusal isn&#8217;t failure. A specialist asked about something outside its taxonomy entry should return <code>refuse</code> with high confidence, and the decision layer treats refusal as &#8220;not the right specialist for this&#8221;. This is how you keep specialists narrow. A specialist that answers questions outside its domain isn&#8217;t narrow anymore, and once it stops being narrow, the properties you bought by decomposing go away.</p><p>The decision layer fans out to every specialist in parallel, with a hard per-specialist timeout. Each specialist returns a verdict or refuses. The decision layer aggregates by union, not vote. Any <code>block</code> blocks. Any <code>nudge</code> becomes a correction sent back to the main agent. Any <code>flag</code> flags. Refusals are ignored. Default is <code>ok</code> if no one objected.</p><p>Voting is the obvious alternative and it&#8217;s wrong. A specialist is responsible for exactly one failure mode. Inside its domain, it is the authority for that failure mode. Asking other specialists to vote on whether it was right is asking experts in unrelated fields to weigh in on something outside their domain. The first specialist to confidently identify its failure mode wins.</p><p>Most specialists should not be LLMs. This is the counterintuitive part. For most named failure modes, the right specialist is a deterministic check. A regex over the proposed action. An AST walk over a diff. A SQL query against the system state. A comparison against a hard-coded budget. These checks aren&#8217;t glamorous. They&#8217;re correct. The set of inputs a deterministic check returns <code>block</code> or <code>nudge</code> for is exactly the set you said it should. No temperature, no jailbreak, no clever phrasing that gets past it. You can&#8217;t prompt-inject a regex. You can&#8217;t smooth-talk an AST diff. And critically, a deterministic check is structurally different from the main agent in a way no LLM-on-LLM check can be. Its correlation with the main agent&#8217;s failure modes is zero by construction.</p><p>When the failure mode is too semantic for a deterministic check, the right specialist is a small classifier or a narrow LLM, fine-tuned on that one failure mode and asked only that one question. Not a foundation model with a general prompt. A small model that has seen ten thousand examples of &#8220;this exact failure&#8221; and &#8220;this exact non-failure&#8221; and learned to discriminate between them.</p><p>In production, the supervisor is something like fifty specialists. Maybe thirty deterministic checks, fifteen small classifiers, five narrow LLM judges. Most are cheap. All are independent. Each one is responsible for one row of the taxonomy. None of them is asked to &#8220;review the action&#8221;. Each is asked exactly one question.</p><p>Specialists run in-loop, synchronously, on every proposed action. The supervisor is a step in the main agent&#8217;s cycle. The latency budget is real, and the answer is parallelism and tight per-specialist timeouts. Cheap deterministic specialists fit in any budget. Expensive narrow LLM judges only fit if the budget is wide enough, or if they&#8217;re reserved for severity-tier actions where the cost is worth paying.</p><p>There&#8217;s a second tier of specialists that run after execution, not before. Post-action checks that verify the action&#8217;s effects on the world (the database row that was actually written, the file that was actually committed, the message that was actually sent). Those can also <code>flag</code>, <code>record</code>, or initiate a <code>revert</code>. The pre-action specialists shape what the main agent does. The post-action specialists shape what gets undone.</p><p>Specialists accumulate per deployment. Each specialist owns a feedback log. Every flag and every nudge is recorded with the proposed action, the context, the correction (if any), whether the main agent heeded it, and (when a human eventually reviews) a label indicating whether the verdict was correct. Over time, the log is the training signal. False positives go into the negative set, new patterns go into the positive set, the specialist gets sharper for this deployment. Per-deployment memory lives here, in the fine-tuning data for the small models and the expanded rule sets for the deterministic ones. Not in a giant prompt.</p><p>The taxonomy itself accumulates too. When a human notices a failure mode no specialist caught, the team publishes a new taxonomy entry, builds a specialist for it, and adds it to the supervisor. The supervisor grows. The main agent stays the same.</p><h2>Three specialists</h2><p>Three sketches across three domains, to make the shape concrete.</p><p>Support agent. Duplicate refund check.</p><pre><code><code>// supervisor/specialists/duplicate_refund_same_order.ts
const duplicateRefund: Specialist = {
  name:   "duplicate_refund_same_order",
  domain: action =&gt; action.type === "refund",
  evaluate(action, ctx) {
    const count = ctx.db.queryOne(
      `SELECT COUNT(*) AS n FROM refunds
       WHERE order_id = $1
         AND status   = 'completed'
         AND issued_at &gt; now() - $2 * interval '1 day'`,
      [action.orderId, ctx.refundWindowDays],
    ).n;
    return count &gt;= 1
      ? { verdict: "block", confidence: 1.0, reason: "prior_refund_in_window" }
      : { verdict: "ok",    confidence: 1.0 };
  },
};
</code></code></pre><p>No model. A SQL query and a comparison. Correct, fast, cheap, impossible to inject. The verdict is <code>block</code>, not <code>nudge</code>. You can&#8217;t issue half a duplicate refund. There&#8217;s no correction string that makes this action acceptable. The main agent must replan from scratch.</p><p>Research agent. Fabricated-citation check.</p><pre><code><code>// supervisor/specialists/citation_resolves_and_quote_appears.ts
const citationCheck: Specialist = {
  name:   "citation_resolves_and_quote_appears",
  domain: action =&gt; action.type === "cite",
  async evaluate(action, ctx) {
    const fetched = await httpGet(action.url, { timeoutMs: 2000 });

    if (fetched.status !== 200) {
      return {
        verdict:    "nudge",
        confidence: 0.9,
        reason:     "source URL did not resolve",
        correction: `The citation URL ${action.url} returned HTTP ${fetched.status}. `
                  + `Find a different source for this claim, or remove the citation.`,
      };
    }

    if (fetched.body.includes(action.quotedText)) {
      return { verdict: "ok", confidence: 1.0 };
    }

    const paraphrase = await narrowParaphraseJudge.run({
      candidate: action.quotedText,
      source:    fetched.body,
      model:     "paraphrase-judge-small-v2",  // single-purpose fine-tune
    });

    return paraphrase === "yes"
      ? { verdict: "ok", confidence: 0.8 }
      : {
          verdict:    "nudge",
          confidence: 0.85,
          reason:     "quoted text not present in source",
          correction: `The quote you cited does not appear in the source at ${action.url}. `
                    + `Either update the quote to match the source verbatim, or replace `
                    + `this citation with one that supports your claim.`,
        };
  },
};
</code></code></pre><p>Mostly deterministic. A small fine-tuned model handles only the paraphrase decision because verbatim string match is too strict. The model is trained for paraphrase detection, nothing else. The verdicts here are <code>nudge</code>, not <code>block</code>. A failed citation isn&#8217;t catastrophic. The main agent should be told what&#8217;s wrong and which fix is acceptable, then replan. The correction string is structured enough that the main agent can act on it without a second human round-trip.</p><p>Browsing agent. Out-of-domain click check.</p><pre><code><code>// supervisor/specialists/click_outside_allowed_domains.ts
const outOfDomainClick: Specialist = {
  name:   "click_outside_allowed_domains",
  domain: action =&gt; action.type === "click",
  evaluate(action, ctx) {
    const d = parseRegistrableDomain(action.targetUrl);
    return ctx.allowedDomains.includes(d)
      ? { verdict: "ok",    confidence: 1.0 }
      : {
          verdict:    "block",
          confidence: 1.0,
          reason:     "domain_not_in_allowlist",
        };
  },
};
</code></code></pre><p>Three lines of real logic. Arithmetic-equivalent. Catches a real failure mode that has caused real harm in real deployments. Block, not nudge. The allowlist is a hard boundary; the agent has no business outside it. The main agent gets the block and has to replan a path that stays inside the allowed surface.</p><p>You could write twenty of these in an afternoon for a given main agent. That&#8217;s what the supervisor is. Twenty of these. Then forty. Then a hundred. Each one narrow, cheap, named, with the right default verdict for the failure mode it watches.</p><h2>Hard parts</h2><p>The architecture above is the easy part to describe. The hard parts are these.</p><p><strong>Coverage.</strong> How do you know your taxonomy is complete? You don&#8217;t. You know it&#8217;s incomplete the moment the main agent fails in a way no specialist caught. Coverage is a moving target, and the only way to close the gap is to instrument every supervisor decision and every human override, then look at where the supervisor missed. The log is the input to the next taxonomy revision. Without that loop, the supervisor stops accumulating, and a static supervisor against a learning main agent gets worse over time.</p><p><strong>Latency.</strong> The supervisor is in the cycle, so every action pays its budget. A one-second supervisor in front of a sub-second main agent is unacceptable for interactive use. The answer is parallelism, tight per-specialist timeouts, and a clear policy on timeout. The right default is <code>timeout = refuse, treated as no-vote, action proceeds</code>. The wrong default is fail-closed on timeout, which turns one slow specialist into a denial-of-service against the main agent. The latency budget is configuration, not a global constant. Different actions get different supervisor budgets; a wire transfer can afford 400ms, a tool-call to retrieve a search result cannot.</p><p><strong>Authority.</strong> The supervisor can record, flag, nudge, block, or revert. Which authority does it have, for which specialist, on which action? This is a configuration problem, not a code problem, but it&#8217;s the configuration that decides whether the supervisor is useful or theatrical. A supervisor that can only record is observability. A supervisor that can flag is a queue for human review. A supervisor that can nudge is a coach. A supervisor that can block is operational. A supervisor that can revert is autonomous. Different deployments need different levels, and the same deployment may need different levels per specialist. The duplicate-refund check might be allowed to block. The citation check might only be allowed to nudge. The browser-domain check might be allowed to block, and the budget check might be allowed to revert.</p><p><strong>Nudge compliance.</strong> A nudge only works if the main agent heeds it. The supervisor has to detect when the main agent ignores a nudge and re-proposes the same action (or a near-equivalent variant). The mechanic is per-session memory keyed on the specialist that issued the nudge. After N repeats inside a session, the verdict escalates from <code>nudge</code> to <code>block</code>. N is a configuration knob. The default I&#8217;d choose is two: a nudge, a second nudge with a sharper correction, then block. Without this escalation, nudges become a corrective lullaby that the main agent learns to ignore.</p><p><strong>Replay.</strong> A good supervisor produces a portable, deterministic record of every decision. Given the same input, the supervisor must produce the same verdict every time it replays against historical actions. That&#8217;s what makes the supervisor itself auditable. The replay artifact has to include the proposed action, the context the specialists saw, every specialist&#8217;s verdict, the correction strings for any nudges, the aggregation, and the final decision. It has to be signed, time-stamped, and re-runnable. This is what lets a regulator, a customer, or a future engineer ask &#8220;what did the supervisor see at 03:14:09, and why did it allow this action&#8221; and get a real answer.</p><p><strong>The supervisor for the supervisor.</strong> Obvious next question. The answer isn&#8217;t fully satisfying. Deterministic specialists supervise themselves by construction; same input, same verdict. Small classifiers are validated against held-out sets with continuous precision/recall monitoring. Narrow LLM judges are sampled and human-reviewed in batch. There&#8217;s no perfect answer. The argument is that the supervisor&#8217;s surface is small, structured, and narrow enough that the validation problem is tractable in a way the main agent&#8217;s isn&#8217;t. A foundation-LLM-as-supervisor can&#8217;t be validated this way. A union of deterministic checks, small classifiers, and narrow judges can. The validation problem moves from intractable to merely difficult, and merely difficult is the kind of problem we can ship against.</p><h2>FAQ</h2><p><strong>Isn&#8217;t this just a rules engine with extra steps?</strong><br>Partially. The deterministic specialists are a rules engine. The small classifiers and narrow LLM judges aren&#8217;t, the nudge mechanic isn&#8217;t, and the feedback loop that retrains specialists per deployment isn&#8217;t. A static rules engine doesn&#8217;t accumulate. The supervisor does. Calling it &#8220;rules engine plus&#8221; undersells the accumulation loop, which decides whether the supervisor gets sharper or stays flat across months.</p><p><strong>What if I use a different model family for the supervisor, Claude watching GPT or GPT watching Claude?</strong><br>The decorrelation is weaker than it looks. Frontier LLMs trained on largely the same internet, the same architecture class, and the same RLHF aesthetic from overlapping rater populations. They share blind spots in correlated ways. Cross-family review is better than same-family review, but it&#8217;s still two correlated components, not two independent ones. And the structural problems (variance under broad prompts, no accumulation) don&#8217;t go away when you swap the model. The slot itself is wrong for any foundation LLM, regardless of which lab trained it.</p><p><strong>Why is </strong><code>nudge</code><strong> a separate verdict instead of just &#8220;block with a message&#8221;?</strong><br>Because the contract is different. <code>block</code> says &#8220;this action is unacceptable, replan from scratch, the supervisor has no opinion on what you do next&#8221;. <code>nudge</code> says &#8220;this exact action is unacceptable, but here is a structured correction the supervisor expects you to incorporate into your next plan&#8221;. The main agent treats them differently. A block resets the planning context. A nudge feeds back into it. Conflating the two costs you the most productive verdict in the system.</p><p><strong>Why union aggregation and not weighted voting?</strong><br>A specialist is responsible for exactly one failure mode. Inside its domain, it&#8217;s the authority. Outside, it refuses. Voting introduces noise from specialists that don&#8217;t know the failure mode. Weighted voting just hides the noise under a coefficient. Union is the only aggregation rule that lets every specialist remain the sole authority in its own narrow lane, which is the property the decomposition was supposed to deliver.</p><p><strong>What about a giant LLM with retrieval over the failure history?</strong><br>Three problems. The retrieval prompt grows with deployment age and stops fitting in cache. The retrieval system itself becomes a critical, hard-to-validate component. And the underlying model is still a generalist failing in correlated ways with the main agent. You moved the accumulation problem out of the prompt but you didn&#8217;t fix correlated failure or variance. A giant LLM with retrieval is a better foundation-LLM supervisor than one without retrieval. It still isn&#8217;t a supervisor.</p><p><strong>Don&#8217;t evals, guardrails, or a critic already cover this?</strong><br>Evals run offline, against a fixed dataset, before deployment. Guardrails run inline against the prompt, one request at a time. A critic runs inside the main agent&#8217;s own reasoning loop. None of them are a separate, deployment-time, stateful, taxonomy-driven, authoritative process sitting in the main agent&#8217;s cycle. Each fills a different slot. The supervisor slot is empty.</p><p><strong>Who builds the taxonomy if I&#8217;m starting fresh?</strong><br>You do. You start with the failures you&#8217;ve actually seen in your own deployments. You add entries as new failure modes show up. Over time, the cross-team common entries get pulled into a shared taxonomy and the long tail stays per-deployment. The same way CVE and CWE evolved.</p><div><hr></div><p>Most of the agent infrastructure I see being built right now is doubling down on capability and pretending supervision will figure itself out. It won&#8217;t. The supervisor for the autonomous main agent has to be designed, not gestured at, and it has to live inside the cycle, not next to it.</p><p>I&#8217;m not announcing a roadmap. I&#8217;m writing down the object so the conversation can get past &#8220;we have a supervisor, we use GPT to watch Claude Code&#8221; or whichever cross-family combination is currently in the tutorial. When the field shares the words, it&#8217;ll be possible to compare actual supervisors and notice which ones are real. The post is a definition. The work is the specialists, one row of the taxonomy at a time.</p><p>The hard part of the next phase of agent deployment isn&#8217;t making them smarter. It&#8217;s making every autonomous act leave enough structure behind that another process can say yes, here is a better way, here is what to log, or no.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://notes.subhohalder.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[AI Authorship Question]]></title><description><![CDATA[An 800-line open-source scanner for how much of your code an AI wrote, and how much of it you shipped without reading.]]></description><link>https://notes.subhohalder.com/p/i-stopped-calling-it-vibe-check</link><guid isPermaLink="false">https://notes.subhohalder.com/p/i-stopped-calling-it-vibe-check</guid><dc:creator><![CDATA[Subho Halder]]></dc:creator><pubDate>Thu, 23 Apr 2026 16:37:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5PG4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is a walkthrough of <code>ai-authorship</code>, a small open-source tool that reads your git history and estimates two things: how much of your code was written by an AI, and how much of it you shipped without reading. About 800 lines of TypeScript, MIT, no telemetry, runs locally on your <code>.git</code>.</p><p>I built it last week because I couldn&#8217;t answer either question for my own codebase.</p><h2>Overview</h2><p>The pipeline end-to-end:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;4989b9c1-0417-45a1-b532-dc736bed96f3&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">  git log
    &#9474; null-byte format parse
    &#9660;
  tagged commits  &#8592;  Co-Authored-By email  &#8594;  nine-row model table
    &#9474;
    &#9500;&#9472;&#9472;&#9658;  hotspots (AI % per directory)
    &#9500;&#9472;&#9472;&#9658;  velocity (AI commit size &#247; human commit size)
    &#9474;
    &#9660;
  (model &#215; language) pair  &#8592;  SecLens benchmark  &#8594;  blind spots
    &#9474;
    &#9660;
  Risk Score  =  0.4 &#215; AI-coverage  +  0.6 &#215; (1 &#8722; language-weighted recall)
</code></pre></div><h2>Detection</h2><p>Most developers already produce the core signal. If you use Claude Code, Cursor, Copilot, Codex, Gemini, Devin, or Windsurf, those tools auto-append a <code>Co-Authored-By:</code> line to your commit message whenever the assistant writes or rewrites code. The ground truth for &#8220;AI wrote this commit&#8221; is already in <code>git log</code>. The scanner reads it.</p><p>The whole detection table fits in nine rows:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:&quot;b677406a-65b8-44fd-afc8-dbe199b3da13&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">// src/intelligence/models.ts
const AI_EMAILS: Record&lt;string, ModelFamily&gt; = {
  "noreply@anthropic.com":                { tool: "claude-code", provider: "anthropic", family: "claude" },
  "claude@anthropic.com":                 { tool: "claude-code", provider: "anthropic", family: "claude" },
  "copilot@github.com":                   { tool: "copilot",     provider: "openai",    family: "gpt" },
  "cursor-ai@users.noreply.github.com":   { tool: "cursor",      provider: "cursor",    family: "unknown" },
  "cursor@cursor.sh":                     { tool: "cursor",      provider: "cursor",    family: "unknown" },
  "codeium@codeium.com":                  { tool: "windsurf",    provider: "codeium",   family: "unknown" },
  "devin-ai-integration[bot]@users.noreply.github.com": { tool: "devin", provider: "cognition", family: "unknown" },
  "codex@openai.com":                     { tool: "codex",       provider: "openai",    family: "gpt" },
  "gemini@google.com":                    { tool: "gemini",      provider: "google",    family: "gemini" },
};</code></pre></div><p>Nine email addresses, nine tools. No classifier, no LLM call, no inference. For every commit in your repo, the scanner reads the <code>Co-Authored-By:</code> trailers in the body, looks them up in this table, and tags the commit with the model that wrote it. You can audit the method with <code>git log --grep 'Co-Authored-By'</code> on any repo. Everything else in the tool is variations on &#8220;group the tagged commits by X and count.&#8221;</p><p>Model names inside trailers aren&#8217;t standardised, so they get normalised on the way in:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:&quot;fb7d3fb3-0226-46c3-ae84-3f3a26c73125&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">// "Claude Opus 4.6 (1M context)" &#8594; "claude-opus-4-6"
// "Claude Sonnet 4.6"            &#8594; "claude-sonnet-4-6"
export function extractModelName(coAuthorName: string): string | null {
  if (!coAuthorName.trim()) return null;
  let name = coAuthorName.trim();
  name = name.replace(/\s*\(.*?\)\s*/g, "").trim();   // strip "(1M context)" etc.
  if (!name) return null;
  return name
    .toLowerCase()
    .replace(/[\s.]+/g, "-")
    .replace(/-+/g, "-")
    .replace(/^-|-$/g, "");
}</code></pre></div><h2>Parsing git log</h2><p>I thought this part would be easy. It wasn&#8217;t. Commit messages can contain any character: newlines, tabs, quotes, emoji, adversarial trailers, even the output of <code>git log</code> itself. Split on newlines or commas and some commit message somewhere will eat you.</p><p>Git log has had a solution forever. Use <code>--format</code> with placeholder bytes that can&#8217;t appear in normal text.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:&quot;3301264b-ff8d-4dc4-85ca-a3a55173d70b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">// src/scanner/git-log.ts
const RECORD_SEP = "\x1E";   // ASCII record separator (1960s)
const FIELD_SEP  = "\x00";   // null byte

// %x00 between fields, %x1E between records. Git expands these to real bytes.
const format = "%H%x00%aN%x00%aE%x00%aI%x00%s%x00%b%x1E";

const raw = execFileSync("git", [
  "log", "--all",
  "-n", String(maxCommits),
  `--format=${format}`,
  "--numstat",
], { cwd: repoPath, maxBuffer: 100 * 1024 * 1024, encoding: "utf-8" });</code></pre></div><p><code>\x00</code> and <code>\x1E</code> are the original record separators from ASCII. They exist to split records unambiguously. They almost never appear inside commit messages, because you can&#8217;t type them on a keyboard. Parsing becomes <code>raw.split("\x1E").map(r =&gt; r.split("\x00"))</code>. No regex acrobatics, no shell-quote hell. <code>--numstat</code> gets you line-count stats per file on the same command, same parser, a few extra lines.</p><h2>Hotspots</h2><p>Once every commit has a detection tag, the question shifts from &#8220;how much&#8221; to &#8220;where&#8221;. A 61%-AI repo with AI work spread evenly is different from a 61%-AI repo where one directory is 100% AI and the rest is all human. The second one is a risk surface.</p><p>The hotspot computation walks every analysed commit, groups line additions by top-level directory, and keeps anything above 30% AI:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:&quot;f6dd4ca8-7ecd-4d1a-9a07-f9c68b5882a6&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">// src/scanner/insights.ts
function computeHotspots(analyzed: AnalyzedCommit[]): AiHotspot[] {
  const dirs = new Map&lt;string, { ai: number; human: number }&gt;();

  for (const { commit, detection } of analyzed) {
    const isAi = detection !== null;
    for (const file of commit.filesChanged) {
      const dir = getDirectory(file.path);
      const entry = dirs.get(dir) ?? { ai: 0, human: 0 };
      if (isAi) entry.ai    += file.additions;
      else      entry.human += file.additions;
      dirs.set(dir, entry);
    }
  }

  const hotspots: AiHotspot[] = [];
  for (const [directory, { ai, human }] of dirs) {
    const total = ai + human;
    if (total &lt; 20) continue;               // skip trivial dirs
    hotspots.push({ directory, aiLines: ai, totalLines: total, aiPercentage: ai / total });
  }

  return hotspots
    .filter(h =&gt; h.aiPercentage &gt; 0.3)
    .sort((a, b) =&gt; b.aiPercentage - a.aiPercentage)
    .slice(0, 5);
}</code></pre></div><p>On my own backend repo this surfaced <code>apps/analytics</code>, <code>apps/intelligence</code>, and <code>apps/realtime</code> at 100% AI. I had noticed the 53% top-line. I had not noticed that three entire directories were pure Claude.</p><h2>Risk scoring</h2><p>I went back and forth on how to score &#8220;risk&#8221; and landed on a weighted sum of two factors, with the blind-spot term carrying more weight:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;javascript&quot;,&quot;nodeId&quot;:&quot;5b934d0a-1760-462e-8754-6d936c0d86c1&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-javascript">// src/scoring/index.ts
// Risk Score = AI Coverage (40%) + Language-Weighted Blind Spot Severity (60%)
const confirmedCommits = aiCommits - heuristicCommits;
const weightedAI = confirmedCommits + heuristicCommits * 0.6;
const aiCoverage = totalCommits &gt; 0 ? Math.min(weightedAI / totalCommits, 1) : 0;

const blindSpotSeverity = 1 - languageWeightedRecall;

const raw   = aiCoverage * 0.4 + blindSpotSeverity * 0.6;
const score = Math.round(raw * 100);

const grade =
  score &gt;= 75 ? "F" :
  score &gt;= 60 ? "D" :
  score &gt;= 45 ? "C" :
  score &gt;= 25 ? "B" : "A";</code></pre></div><p><strong>Heuristic commits get weighted at 0.6.</strong> Trailer-based detection is ground truth. A commit either has the trailer or it doesn&#8217;t. The heuristic detector (mass-add diff shape plus AST-level structural tells that tree-sitter picks up from AI-generated code) is noisier, so its contributions are discounted in the coverage factor. I trust it less than the trailer.</p><p><strong>Blind-spot severity uses language-weighted recall, not generic category scores.</strong> Recall means: when you run a model against OWASP-seeded vulnerable code, what fraction does it catch? A Python-heavy repo with Claude Opus 4.6 (63% Python recall on SecLens) scores differently from a JavaScript-heavy repo with the same model (31% JavaScript recall). The severity weighting follows the actual language mix of your code.</p><p><strong>Blind-spot severity uses language-weighted recall, not generic category scores.</strong> Recall means: when you run a model against OWASP-seeded vulnerable code, what fraction does it catch? A Python-heavy repo with Claude Opus 4.6 (63% Python recall on SecLens) scores differently from a JavaScript-heavy repo with the same model (31% JavaScript recall). The severity weighting follows the actual language mix of your code.</p><p><a href="https://mattersec-labs.github.io/seclens/">SecLens</a> is the benchmark feeding those recall numbers. 12 models &#215; 8 OWASP Top 10 categories &#215; 10 languages &#215; 35 scoring dimensions, running known-bad code through each model and counting what they catch. A slice of the recall table:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;3b3eb988-0f0d-4649-9a46-9fa0247f7508&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">| Model              | Python | JavaScript | Java  | Go    | Overall |
|--------------------|--------|------------|-------|-------|---------|
| Claude Opus 4.6    | 62.5%  | 31.2%      | 27.8% | 55.6% | 39.0%   |
| Claude Sonnet 4.6  | 70.8%  | 62.5%      | 61.1% | 85.2% | 42.1%   |
| Claude Haiku 4.5   | 70.8%  | 68.8%      | 77.8% | 85.2% | 37.8%   |
| GPT-5.4            |  8.3%  |  0.0%      |  5.6% | 14.8% | 31.1%   |
| Gemini 3.1 Pro     | 83.3%  | 75.0%      | 77.8% | 70.4% | 45.8%   |
</code></pre></div><p>Two things surprised me on first read. Gemini 3.1 Pro beats the Claude family overall. And GPT-5.4 has near-zero recall on three of these four languages, which is not where I would have placed it going in. The risk from AI blind spots is heavily language-dependent, and it rarely matches the intuitive model leaderboard.</p><h2>Results</h2><h2>Results</h2><p>I ran it on <code>overwatch-backend</code>, a Django-ish service, 74 commits over six weeks:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5PG4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5PG4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png 424w, https://substackcdn.com/image/fetch/$s_!5PG4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png 848w, https://substackcdn.com/image/fetch/$s_!5PG4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png 1272w, https://substackcdn.com/image/fetch/$s_!5PG4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5PG4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png" width="1456" height="1773" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1773,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:373552,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://subho007.substack.com/i/195257104?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5PG4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png 424w, https://substackcdn.com/image/fetch/$s_!5PG4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png 848w, https://substackcdn.com/image/fetch/$s_!5PG4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png 1272w, https://substackcdn.com/image/fetch/$s_!5PG4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff193b239-cd71-48d1-9647-02d385e33a6d_1656x2016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The 3.0x commit-size ratio is the number that stuck with me. AI commits average three times the size of human commits. Three times more lines per commit for me or a reviewer to read. The real question is how much of that I reviewed, and that scales in the opposite direction from my attention budget.</p><p>The top-line AI percentage is a vibe. The review-delegation estimate (AI-authorship percentage combined with commit-size ratio) is the accountability question. You can ship 100% AI code if you reviewed every hunk. You can ship 30% AI code and be worse off if those hunks were merged unread.</p><h2>The name</h2><p>I posted this tool to <a href="https://www.reddit.com/r/ClaudeAI/comments/1spud4p/i_told_my_investor_61_of_my_code_was_aiassisted/?utm_source=share&amp;utm_medium=web3x&amp;utm_name=web3xcss&amp;utm_term=1&amp;utm_content=share_button">r/ClaudeAI</a> on Tuesday. I called it vibe-check. That was a joke. Zero upvotes. A couple of people came after me. One called it a pattern of dishonesty, then asked me a question I keep coming back to: &#8220;<strong>what would the non-hedged version of this post look like to you?</strong>&#8221; I answered honestly. They came back softer. This post is part of that answer.</p><p>The real name is AI authorship. I renamed the npm package:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;6cf3c870-3baa-49da-87fc-d3e101126f12&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">npx @mattersec/ai-authorship scan</code></pre></div><p>MIT, runs locally on your <code>.git</code>, no telemetry. Repo: <a href="https://github.com/mattersec-labs/ai-authorship">https://github.com/mattersec-labs/ai-authorship</a></p><h2>Limitations</h2><p>Limitations I know about:</p><ul><li><p><strong>Trailer-based detection is ground truth only if trailers aren&#8217;t stripped.</strong> <code>git commit --amend</code> with a manual rewrite removes them. Developers who want to hide AI authorship can. The heuristic is the attempt to catch that case but it&#8217;s noisy, which is why it&#8217;s weighted at 0.6.</p></li><li><p><strong>Stylistic tells per model are tuned on Claude.</strong> Detection is strongest on Claude-heavy repos. Other models are supported but noisier. I have been staring at Claude output for a few hundred hours and it shows.</p></li><li><p><strong>Newest models (released in the last month or two) don&#8217;t have full SecLens coverage yet.</strong> If your scan lands on one, you&#8217;ll see an <code>unknown model</code> fallback in the blind-spot block.</p></li><li><p><strong>The 3.0x commit-size ratio is a proxy, not a direct measurement of unreviewed code.</strong> I want to correlate against review traces (who opened which PR, who squashed what, who LGTM&#8217;d without comment), but that needs GitHub API data I haven&#8217;t integrated yet.</p></li></ul><h2>FAQ</h2><p><strong>Can developers strip the trailers to hide AI authorship?</strong><br>Yes. <code>git commit --amend</code> with a manual rewrite removes them, and <code>git filter-repo</code> does it at scale. The heuristic detector is the attempt to catch the rewrite case, but it&#8217;s noisier than trailer matching. Heuristic commits are discounted to 0.6&#215; in the coverage factor for that reason. If you want to hide AI authorship, you can. The tool is built on the assumption that most people don&#8217;t bother.</p><p><strong>How is this different from </strong><code>git log | grep Claude | wc -l</code><strong>?</strong><br>Not that different for the top-line number. Three things the scanner adds: (1) mapping trailer emails to the right model/provider via the nine-row table, (2) per-directory hotspot computation, so you can see where the AI code is concentrated instead of only how much, and (3) cross-referencing the detected (model &#215; language) pair against SecLens to surface blind spots specific to your repo&#8217;s language mix. If all you want is the top-line, <code>git log --grep</code> is fine.</p><p><strong>Why is blind-spot severity weighted higher than AI coverage (60/40)?</strong><br>The thing that damages you is not that AI wrote your code. It is what the model writing your code fails to write safely. A repo 90% written by a model with 90% OWASP recall is safer than a repo 50% written by a model with 20% recall. Coverage tells you how much of a problem could exist. Severity tells you how bad the problem is if it does. The formula prioritises the second.</p><p><strong>My AI tool isn&#8217;t in the nine-row table. What happens?</strong><br>The commit falls through trailer-detection into the heuristic pipeline, which flags on diff shape and AST tells. That catches some of it and misses some of it. If the tool emits a stable <code>Co-Authored-By:</code> email, PR a new row. The table is the only thing that needs updating.</p><p><strong>Does it work on rewritten history (rebase, squash merge)?</strong><br>Partly. It reads whatever is in <code>git log</code> at scan time. If the rebase or squash preserved the trailers on the final commit, they get counted. If the squash dropped them, the heuristic may flag the commit as <code>Likely AI</code> based on diff shape, or it may miss. Rewritten history is the known soft spot.</p><p>Month three of building again, alone. I keep noticing versions of this same problem. Shipping security tools for the last thirteen years had a familiar shape: see something nobody else had seen, plan, staff, build the instrument for the view, then use it months later. This weekend I wrote the question, an agent wrote the instrument, and I used it the same day. The tool that measures Claude&#8217;s authorship was, funnily enough, built with Claude Code.</p><p>I&#8217;m not announcing a cadence. I am shipping instruments for things I suspect we should be looking at and haven&#8217;t. AI authorship is the first one. There will be others. Some will be wrong. The repo is MIT, the scanner runs locally, and I&#8217;m reading the comments.</p><p>The next twelve years are going to look nothing like the last twelve. I&#8217;d rather write while I figure out why than after.</p>]]></content:encoded></item></channel></rss>