AI Authorship Question

Subho Halder — Thu, 23 Apr 2026 16:37:59 GMT

This is a walkthrough of ai-authorship, a small open-source tool that reads your git history and estimates two things: how much of your code was written by an AI, and how much of it you shipped without reading. About 800 lines of TypeScript, MIT, no telemetry, runs locally on your .git.

I built it last week because I couldn’t answer either question for my own codebase.

Overview

The pipeline end-to-end:

  git log
    │ null-byte format parse
    ▼
  tagged commits  ←  Co-Authored-By email  →  nine-row model table
    │
    ├──►  hotspots (AI % per directory)
    ├──►  velocity (AI commit size ÷ human commit size)
    │
    ▼
  (model × language) pair  ←  SecLens benchmark  →  blind spots
    │
    ▼
  Risk Score  =  0.4 × AI-coverage  +  0.6 × (1 − language-weighted recall)

Detection

Most developers already produce the core signal. If you use Claude Code, Cursor, Copilot, Codex, Gemini, Devin, or Windsurf, those tools auto-append a Co-Authored-By: line to your commit message whenever the assistant writes or rewrites code. The ground truth for “AI wrote this commit” is already in git log. The scanner reads it.

The whole detection table fits in nine rows:

// src/intelligence/models.ts
const AI_EMAILS: Record = {
  "noreply@anthropic.com":                { tool: "claude-code", provider: "anthropic", family: "claude" },
  "claude@anthropic.com":                 { tool: "claude-code", provider: "anthropic", family: "claude" },
  "copilot@github.com":                   { tool: "copilot",     provider: "openai",    family: "gpt" },
  "cursor-ai@users.noreply.github.com":   { tool: "cursor",      provider: "cursor",    family: "unknown" },
  "cursor@cursor.sh":                     { tool: "cursor",      provider: "cursor",    family: "unknown" },
  "codeium@codeium.com":                  { tool: "windsurf",    provider: "codeium",   family: "unknown" },
  "devin-ai-integration[bot]@users.noreply.github.com": { tool: "devin", provider: "cognition", family: "unknown" },
  "codex@openai.com":                     { tool: "codex",       provider: "openai",    family: "gpt" },
  "gemini@google.com":                    { tool: "gemini",      provider: "google",    family: "gemini" },
};

Nine email addresses, nine tools. No classifier, no LLM call, no inference. For every commit in your repo, the scanner reads the Co-Authored-By: trailers in the body, looks them up in this table, and tags the commit with the model that wrote it. You can audit the method with git log --grep 'Co-Authored-By' on any repo. Everything else in the tool is variations on “group the tagged commits by X and count.”

Model names inside trailers aren’t standardised, so they get normalised on the way in:

// "Claude Opus 4.6 (1M context)" → "claude-opus-4-6"
// "Claude Sonnet 4.6"            → "claude-sonnet-4-6"
export function extractModelName(coAuthorName: string): string | null {
  if (!coAuthorName.trim()) return null;
  let name = coAuthorName.trim();
  name = name.replace(/\s*\(.*?\)\s*/g, "").trim();   // strip "(1M context)" etc.
  if (!name) return null;
  return name
    .toLowerCase()
    .replace(/[\s.]+/g, "-")
    .replace(/-+/g, "-")
    .replace(/^-|-$/g, "");
}

Parsing git log

I thought this part would be easy. It wasn’t. Commit messages can contain any character: newlines, tabs, quotes, emoji, adversarial trailers, even the output of git log itself. Split on newlines or commas and some commit message somewhere will eat you.

Git log has had a solution forever. Use --format with placeholder bytes that can’t appear in normal text.

// src/scanner/git-log.ts
const RECORD_SEP = "\x1E";   // ASCII record separator (1960s)
const FIELD_SEP  = "\x00";   // null byte

// %x00 between fields, %x1E between records. Git expands these to real bytes.
const format = "%H%x00%aN%x00%aE%x00%aI%x00%s%x00%b%x1E";

const raw = execFileSync("git", [
  "log", "--all",
  "-n", String(maxCommits),
  `--format=${format}`,
  "--numstat",
], { cwd: repoPath, maxBuffer: 100 * 1024 * 1024, encoding: "utf-8" });

\x00 and \x1E are the original record separators from ASCII. They exist to split records unambiguously. They almost never appear inside commit messages, because you can’t type them on a keyboard. Parsing becomes raw.split("\x1E").map(r => r.split("\x00")). No regex acrobatics, no shell-quote hell. --numstat gets you line-count stats per file on the same command, same parser, a few extra lines.

Hotspots

Once every commit has a detection tag, the question shifts from “how much” to “where”. A 61%-AI repo with AI work spread evenly is different from a 61%-AI repo where one directory is 100% AI and the rest is all human. The second one is a risk surface.

The hotspot computation walks every analysed commit, groups line additions by top-level directory, and keeps anything above 30% AI:

// src/scanner/insights.ts
function computeHotspots(analyzed: AnalyzedCommit[]): AiHotspot[] {
  const dirs = new Map();

  for (const { commit, detection } of analyzed) {
    const isAi = detection !== null;
    for (const file of commit.filesChanged) {
      const dir = getDirectory(file.path);
      const entry = dirs.get(dir) ?? { ai: 0, human: 0 };
      if (isAi) entry.ai    += file.additions;
      else      entry.human += file.additions;
      dirs.set(dir, entry);
    }
  }

  const hotspots: AiHotspot[] = [];
  for (const [directory, { ai, human }] of dirs) {
    const total = ai + human;
    if (total < 20) continue;               // skip trivial dirs
    hotspots.push({ directory, aiLines: ai, totalLines: total, aiPercentage: ai / total });
  }

  return hotspots
    .filter(h => h.aiPercentage > 0.3)
    .sort((a, b) => b.aiPercentage - a.aiPercentage)
    .slice(0, 5);
}

On my own backend repo this surfaced apps/analytics, apps/intelligence, and apps/realtime at 100% AI. I had noticed the 53% top-line. I had not noticed that three entire directories were pure Claude.

Risk scoring

I went back and forth on how to score “risk” and landed on a weighted sum of two factors, with the blind-spot term carrying more weight:

// src/scoring/index.ts
// Risk Score = AI Coverage (40%) + Language-Weighted Blind Spot Severity (60%)
const confirmedCommits = aiCommits - heuristicCommits;
const weightedAI = confirmedCommits + heuristicCommits * 0.6;
const aiCoverage = totalCommits > 0 ? Math.min(weightedAI / totalCommits, 1) : 0;

const blindSpotSeverity = 1 - languageWeightedRecall;

const raw   = aiCoverage * 0.4 + blindSpotSeverity * 0.6;
const score = Math.round(raw * 100);

const grade =
  score >= 75 ? "F" :
  score >= 60 ? "D" :
  score >= 45 ? "C" :
  score >= 25 ? "B" : "A";

Heuristic commits get weighted at 0.6. Trailer-based detection is ground truth. A commit either has the trailer or it doesn’t. The heuristic detector (mass-add diff shape plus AST-level structural tells that tree-sitter picks up from AI-generated code) is noisier, so its contributions are discounted in the coverage factor. I trust it less than the trailer.

Blind-spot severity uses language-weighted recall, not generic category scores. Recall means: when you run a model against OWASP-seeded vulnerable code, what fraction does it catch? A Python-heavy repo with Claude Opus 4.6 (63% Python recall on SecLens) scores differently from a JavaScript-heavy repo with the same model (31% JavaScript recall). The severity weighting follows the actual language mix of your code.

SecLens is the benchmark feeding those recall numbers. 12 models × 8 OWASP Top 10 categories × 10 languages × 35 scoring dimensions, running known-bad code through each model and counting what they catch. A slice of the recall table:

| Model              | Python | JavaScript | Java  | Go    | Overall |
|--------------------|--------|------------|-------|-------|---------|
| Claude Opus 4.6    | 62.5%  | 31.2%      | 27.8% | 55.6% | 39.0%   |
| Claude Sonnet 4.6  | 70.8%  | 62.5%      | 61.1% | 85.2% | 42.1%   |
| Claude Haiku 4.5   | 70.8%  | 68.8%      | 77.8% | 85.2% | 37.8%   |
| GPT-5.4            |  8.3%  |  0.0%      |  5.6% | 14.8% | 31.1%   |
| Gemini 3.1 Pro     | 83.3%  | 75.0%      | 77.8% | 70.4% | 45.8%   |

Two things surprised me on first read. Gemini 3.1 Pro beats the Claude family overall. And GPT-5.4 has near-zero recall on three of these four languages, which is not where I would have placed it going in. The risk from AI blind spots is heavily language-dependent, and it rarely matches the intuitive model leaderboard.

Results

I ran it on overwatch-backend, a Django-ish service, 74 commits over six weeks:

The 3.0x commit-size ratio is the number that stuck with me. AI commits average three times the size of human commits. Three times more lines per commit for me or a reviewer to read. The real question is how much of that I reviewed, and that scales in the opposite direction from my attention budget.

The top-line AI percentage is a vibe. The review-delegation estimate (AI-authorship percentage combined with commit-size ratio) is the accountability question. You can ship 100% AI code if you reviewed every hunk. You can ship 30% AI code and be worse off if those hunks were merged unread.

The name

I posted this tool to r/ClaudeAI on Tuesday. I called it vibe-check. That was a joke. Zero upvotes. A couple of people came after me. One called it a pattern of dishonesty, then asked me a question I keep coming back to: “what would the non-hedged version of this post look like to you?” I answered honestly. They came back softer. This post is part of that answer.

The real name is AI authorship. I renamed the npm package:

npx @mattersec/ai-authorship scan

MIT, runs locally on your .git, no telemetry. Repo: https://github.com/mattersec-labs/ai-authorship

Limitations

Limitations I know about:

Trailer-based detection is ground truth only if trailers aren’t stripped. git commit --amend with a manual rewrite removes them. Developers who want to hide AI authorship can. The heuristic is the attempt to catch that case but it’s noisy, which is why it’s weighted at 0.6.
Stylistic tells per model are tuned on Claude. Detection is strongest on Claude-heavy repos. Other models are supported but noisier. I have been staring at Claude output for a few hundred hours and it shows.
Newest models (released in the last month or two) don’t have full SecLens coverage yet. If your scan lands on one, you’ll see an unknown model fallback in the blind-spot block.
The 3.0x commit-size ratio is a proxy, not a direct measurement of unreviewed code. I want to correlate against review traces (who opened which PR, who squashed what, who LGTM’d without comment), but that needs GitHub API data I haven’t integrated yet.

FAQ

Can developers strip the trailers to hide AI authorship?
Yes. git commit --amend with a manual rewrite removes them, and git filter-repo does it at scale. The heuristic detector is the attempt to catch the rewrite case, but it’s noisier than trailer matching. Heuristic commits are discounted to 0.6× in the coverage factor for that reason. If you want to hide AI authorship, you can. The tool is built on the assumption that most people don’t bother.

How is this different from git log | grep Claude | wc -l?
Not that different for the top-line number. Three things the scanner adds: (1) mapping trailer emails to the right model/provider via the nine-row table, (2) per-directory hotspot computation, so you can see where the AI code is concentrated instead of only how much, and (3) cross-referencing the detected (model × language) pair against SecLens to surface blind spots specific to your repo’s language mix. If all you want is the top-line, git log --grep is fine.

Why is blind-spot severity weighted higher than AI coverage (60/40)?
The thing that damages you is not that AI wrote your code. It is what the model writing your code fails to write safely. A repo 90% written by a model with 90% OWASP recall is safer than a repo 50% written by a model with 20% recall. Coverage tells you how much of a problem could exist. Severity tells you how bad the problem is if it does. The formula prioritises the second.

My AI tool isn’t in the nine-row table. What happens?
The commit falls through trailer-detection into the heuristic pipeline, which flags on diff shape and AST tells. That catches some of it and misses some of it. If the tool emits a stable Co-Authored-By: email, PR a new row. The table is the only thing that needs updating.

Does it work on rewritten history (rebase, squash merge)?
Partly. It reads whatever is in git log at scan time. If the rebase or squash preserved the trailers on the final commit, they get counted. If the squash dropped them, the heuristic may flag the commit as Likely AI based on diff shape, or it may miss. Rewritten history is the known soft spot.

Month three of building again, alone. I keep noticing versions of this same problem. Shipping security tools for the last thirteen years had a familiar shape: see something nobody else had seen, plan, staff, build the instrument for the view, then use it months later. This weekend I wrote the question, an agent wrote the instrument, and I used it the same day. The tool that measures Claude’s authorship was, funnily enough, built with Claude Code.

I’m not announcing a cadence. I am shipping instruments for things I suspect we should be looking at and haven’t. AI authorship is the first one. There will be others. Some will be wrong. The repo is MIT, the scanner runs locally, and I’m reading the comments.

The next twelve years are going to look nothing like the last twelve. I’d rather write while I figure out why than after.