Skip to content

> agentops — the operational layer for coding agents

12-Factor AgentOps

Coding agents don't do their own bookkeeping.

AgentOps is the operational layer that does — so every session starts where the last one left off.

Bookkeeping, validation, primitives, and flows: the four layers that turn ad-hoc agent use into compounding knowledge. 12-Factor AgentOps is the public doctrine — the rules that explain why AgentOps is shaped the way it is.

Inspired by The Twelve-Factor App, 12-Factor Agents, and HumanLayer.

The AgentOps knowledge flywheel

> doctrine — the four layers, the flywheel, the three gaps

What AgentOps actually is.

Under the hood: a context compiler. Raw session signal becomes reusable knowledge, compiled prevention, and better next work. The model stays the same; the environment around it compounds.

01 / Four layers

Four-layer model

Each layer is independently useful, and they compose. Bookkeeping is the foundation: the local, diff-able ledger of what happened. Validation gates the work that flows through it. Primitives are the reusable building blocks every skill and hook is made of. Flows are named compositions that run primitives end-to-end so an operator can say /rpi instead of orchestrating eight steps by hand.

01

Bookkeeping

Learnings, findings, handoffs, and reusable context land in local .agents/ files.

Agent knowledge is managed like code: version-controlled, reviewed, promoted, and decayed instead of trapped in ephemeral chat history. Each session captures learnings scored on specificity, actionability, novelty, context, and confidence.

/retro/forge/injectao lookupfinding registry
02

Validation

/pre-mortem, /vibe, and /council challenge plans and code before they ship.

Multi-model consensus — Claude and Codex judges debate independently — validates plans before build and code before commit. Validation gates block, not advise.

/pre-mortem/vibe/council/post-mortem
03

Primitives

Skills, hooks, and the ao CLI give agents reusable building blocks.

Standalone components with clean contracts. Use one (/council validate this PR), compose several, or run the full lifecycle. The same recursive shape repeats at every scale.

/research/plan/implementsession hooksao
04

Flows

/research, /implement, /validation, and /rpi compose those primitives end to end.

Named compositions that run end-to-end without manual choreography. Manual or hands-free — both shapes write the same audit trail.

/rpi/crank/evolve/dream

02 / Knowledge flywheel

The compiler runs in a loop. Each cycle extracts what happened, scores it on five axes, promotes what matters, surfaces what's actionable, and decays what doesn't. Not memory — compounding.

  1. 1

    Capture

    Sessions emit learnings, findings, handoffs.

  2. 2

    Score

    Five-axis scoring: specificity, actionability, novelty, context, confidence.

  3. 3

    Promote

    Learnings become patterns; patterns become planning rules.

  4. 4

    Inject

    Next session starts with the right context, not a blank slate.

Loop closes when stage 4 feeds the next session's stage 1. Escape velocity: retrieval rate × usage rate > decay rate.

03 / Three gaps the doctrine closes

Every factor maps to one or more failure modes the system makes harder. The 12 factors aren't a checklist — they're three proof obligations instantiated twelve different ways.

Three gaps converging

Judgment

Plans are pressure-tested before code is written.

The plan looks coherent but picks the wrong abstraction. The code passes tests but misses the edge case. No one challenged it before it shipped.

Closed by: /pre-mortem · /vibe · /council

Durable Learning

Solved problems stay solved across sessions.

Auth bug fixed Monday. Same auth bug returns Wednesday. The lesson was in a chat transcript that got compacted. Next agent has no idea.

Closed by: /retro · /forge · ao lookup · the finding registry

Loop Closure

Work that ships informs work that comes next.

Session ends with a code diff. No extracted lesson. No constraint hardened. Next session starts cold and re-learns the same lesson — or, more often, doesn’t.

Closed by: /post-mortem · finding compiler · GOALS.md · /evolve

04 / Steelman the moat

We'd rather lose this argument before you try the product than after. Here's the strongest version of the critique we get most often, and the specific gate that rebuts it.

Claim

Compound growth is actually happening — not just possible.

Strongest critique

Plenty of agent frameworks claim a learning loop. Most produce a folder of stale notes. Without proof, this sounds like the same pitch — context windows are bigger than this knowledge base, model providers ship better recall every quarter, and a markdown directory is a step backwards from a vector store.

Our answer

Escape velocity is a measurable condition: retrieval rate × usage rate must exceed decay rate. The flywheel-compounding gate evaluates σρ > δ on every CI run; if learnings stop being retrieved or stop being cited, the gate goes red and the flywheel-lifecycle gate names the broken stage. Markdown is the storage format because diff-able plain text is the only audit substrate that scales; the retrieval layer on top is what does the work.

Evidence: agentops/GOALS.md directives 4–5 (flywheel-lifecycle and complexity-regression gates) + scripts/check-flywheel-lifecycle.sh

Foundation (I - III)
Flow (IV - VI)
Knowledge (VII - IX)
Scale (X - XII)

The 12 Factors

I
foundation

Context Is Everything

Manage what enters the context window like you manage what enters production.

ClosesJudgmentDurable Learning
read more
II
foundation

Track Everything in Git

If it's not in git, it didn't happen.

ClosesDurable LearningLoop Closure
read more
III
foundation

One Agent, One Job

Each agent gets a scoped task and fresh context. Never reuse a saturated window.

ClosesJudgment
read more
IV
workflow

Research Before You Build

Understand the problem space before generating a single line of code.

ClosesJudgmentDurable Learning
read more
V
workflow

Validate Externally

No agent grades its own work. Ever.

ClosesJudgment
read more
VI
workflow

Lock Progress Forward

Once work passes validation, it ratchets — it cannot regress.

ClosesJudgmentLoop Closure
read more
VII
knowledge

Extract Learnings

Every session produces two outputs — the work product and the lessons learned.

ClosesDurable Learning
read more
VIII
knowledge

Compound Knowledge

Learnings must flow back into future sessions automatically.

ClosesDurable LearningLoop Closure
read more
IX
knowledge

Measure What Matters

Track fitness toward goals, not activity metrics.

ClosesLoop Closure
read more
X
scale

Isolate Workers

Each worker gets its own workspace, its own context, and zero shared mutable state.

ClosesJudgment
read more
XI
scale

Supervise Hierarchically

Escalation flows up, never sideways.

ClosesJudgmentLoop Closure
read more
XII
scale

Harvest Failures as Wisdom

Failed attempts are data. Extract and index them with the same rigor as successes.

ClosesDurable LearningLoop Closure
read more

convergence — three primitives, multiple implementations

Industry convergence on three primitives.

April 2026 source-level analysis of Claude Code's internal architecture surfaced the same three primitives AgentOps already runs in production. The category is real — and it has a shape.

  • Primitive

    Learning loop

    Industry framing

    Memory extraction → off-session consolidation → future-session injection.

    AgentOps surface

    Knowledge flywheel: /retro → /forge → /harvest → ao inject. Tiered promotion (learning → pattern → rule). Private overnight compounding via /dream and ao overnight; GitHub nightly is the public proof harness, not your runtime.

    Status

    Shipped — on-demand and overnight.

  • Primitive

    Skill packaging

    Industry framing

    AI watches recurring patterns and packages them as reusable skills.

    AgentOps surface

    Skills system: 69 skills, /heal-skill audit, /converter cross-runtime export. ao flywheel close-loop drafts review-only skills from repeated patterns.

    Status

    Prototype — drafting works; promotion polish in progress.

  • Primitive

    Adversarial verification

    Industry framing

    Independent agents audit other agents’ output and surface verdicts for human review.

    AgentOps surface

    Council architecture: /council, /pre-mortem, /vibe, /post-mortem with multi-model consensus and prediction tracking. Behavioral validation (holdout scenarios + satisfaction scoring) fires automatically inside /validation.

    Status

    Shipped — on-demand and always-on.

The architecture exists. The question is whether compound growth is happening — and that’s a question with proof gates, not promises.

Steelman the cross-runtime claim

Claim

The same skills work across Claude Code, Codex CLI, Cursor, and OpenCode.

Strongest critique

'Cross-platform' usually means 'tested once on the primary, shimmed for everything else, broken in subtle ways for users on the other three.' This is the kind of claim that gets users into trouble.

Our answer

We don’t claim full parity — we publish a tiered proof model. Tier S (structural/install) is a default CI gate for all four runtimes. Tier I (live inventory) is gated for Claude Code and Codex when local CLIs are available. Tier E (live execution) is opt-in / nightly, not a default gate. The honesty about what is and isn’t proven is the feature.

Evidence: agentops/GOALS.md directive 1 + tests/skills/test-runtime-{claude-code,codex,cursor,opencode}-smoke.sh

// ready_to_start.sh

Start with Factor I

Begin with context management and build the foundation for durable bookkeeping, validation, and compounding work.

Begin Factor I