Judgment
Plans are pressure-tested before code is written.
The plan looks coherent but picks the wrong abstraction. The code passes tests but misses the edge case. No one challenged it before it shipped.
Closed by: /pre-mortem · /vibe · /council
> agentops — the operational layer for coding agents
Coding agents don't do their own bookkeeping.
AgentOps is the operational layer that does — so every session starts where the last one left off.
Bookkeeping, validation, primitives, and flows: the four layers that turn ad-hoc agent use into compounding knowledge. 12-Factor AgentOps is the public doctrine — the rules that explain why AgentOps is shaped the way it is.
Inspired by The Twelve-Factor App, 12-Factor Agents, and HumanLayer.
> doctrine — the four layers, the flywheel, the three gaps
Under the hood: a context compiler. Raw session signal becomes reusable knowledge, compiled prevention, and better next work. The model stays the same; the environment around it compounds.
Each layer is independently useful, and they compose. Bookkeeping is the foundation: the local, diff-able ledger of what happened. Validation gates the work that flows through it. Primitives are the reusable building blocks every skill and hook is made of. Flows are named compositions that run primitives end-to-end so an operator can say /rpi instead of orchestrating eight steps by hand.
Learnings, findings, handoffs, and reusable context land in local .agents/ files.
Agent knowledge is managed like code: version-controlled, reviewed, promoted, and decayed instead of trapped in ephemeral chat history. Each session captures learnings scored on specificity, actionability, novelty, context, and confidence.
/retro/forge/injectao lookupfinding registry/pre-mortem, /vibe, and /council challenge plans and code before they ship.
Multi-model consensus — Claude and Codex judges debate independently — validates plans before build and code before commit. Validation gates block, not advise.
/pre-mortem/vibe/council/post-mortemSkills, hooks, and the ao CLI give agents reusable building blocks.
Standalone components with clean contracts. Use one (/council validate this PR), compose several, or run the full lifecycle. The same recursive shape repeats at every scale.
/research/plan/implementsession hooksao/research, /implement, /validation, and /rpi compose those primitives end to end.
Named compositions that run end-to-end without manual choreography. Manual or hands-free — both shapes write the same audit trail.
/rpi/crank/evolve/dreamThe compiler runs in a loop. Each cycle extracts what happened, scores it on five axes, promotes what matters, surfaces what's actionable, and decays what doesn't. Not memory — compounding.
Sessions emit learnings, findings, handoffs.
Five-axis scoring: specificity, actionability, novelty, context, confidence.
Learnings become patterns; patterns become planning rules.
Next session starts with the right context, not a blank slate.
Loop closes when stage 4 feeds the next session's stage 1. Escape velocity: retrieval rate × usage rate > decay rate.
Every factor maps to one or more failure modes the system makes harder. The 12 factors aren't a checklist — they're three proof obligations instantiated twelve different ways.
Plans are pressure-tested before code is written.
The plan looks coherent but picks the wrong abstraction. The code passes tests but misses the edge case. No one challenged it before it shipped.
Closed by: /pre-mortem · /vibe · /council
Solved problems stay solved across sessions.
Auth bug fixed Monday. Same auth bug returns Wednesday. The lesson was in a chat transcript that got compacted. Next agent has no idea.
Closed by: /retro · /forge · ao lookup · the finding registry
Work that ships informs work that comes next.
Session ends with a code diff. No extracted lesson. No constraint hardened. Next session starts cold and re-learns the same lesson — or, more often, doesn’t.
Closed by: /post-mortem · finding compiler · GOALS.md · /evolve
We'd rather lose this argument before you try the product than after. Here's the strongest version of the critique we get most often, and the specific gate that rebuts it.
Claim
Compound growth is actually happening — not just possible.
Strongest critique
Plenty of agent frameworks claim a learning loop. Most produce a folder of stale notes. Without proof, this sounds like the same pitch — context windows are bigger than this knowledge base, model providers ship better recall every quarter, and a markdown directory is a step backwards from a vector store.
Our answer
Escape velocity is a measurable condition: retrieval rate × usage rate must exceed decay rate. The flywheel-compounding gate evaluates σρ > δ on every CI run; if learnings stop being retrieved or stop being cited, the gate goes red and the flywheel-lifecycle gate names the broken stage. Markdown is the storage format because diff-able plain text is the only audit substrate that scales; the retrieval layer on top is what does the work.
Evidence: agentops/GOALS.md directives 4–5 (flywheel-lifecycle and complexity-regression gates) + scripts/check-flywheel-lifecycle.sh
Manage what enters the context window like you manage what enters production.
If it's not in git, it didn't happen.
Each agent gets a scoped task and fresh context. Never reuse a saturated window.
Understand the problem space before generating a single line of code.
No agent grades its own work. Ever.
Once work passes validation, it ratchets — it cannot regress.
Every session produces two outputs — the work product and the lessons learned.
Learnings must flow back into future sessions automatically.
Track fitness toward goals, not activity metrics.
Each worker gets its own workspace, its own context, and zero shared mutable state.
Escalation flows up, never sideways.
Failed attempts are data. Extract and index them with the same rigor as successes.
convergence — three primitives, multiple implementations
April 2026 source-level analysis of Claude Code's internal architecture surfaced the same three primitives AgentOps already runs in production. The category is real — and it has a shape.
Primitive
Industry framing
AgentOps surface
Status
Primitive
Learning loop
Industry framing
Memory extraction → off-session consolidation → future-session injection.
AgentOps surface
Knowledge flywheel: /retro → /forge → /harvest → ao inject. Tiered promotion (learning → pattern → rule). Private overnight compounding via /dream and ao overnight; GitHub nightly is the public proof harness, not your runtime.
Status
Shipped — on-demand and overnight.
Primitive
Skill packaging
Industry framing
AI watches recurring patterns and packages them as reusable skills.
AgentOps surface
Skills system: 69 skills, /heal-skill audit, /converter cross-runtime export. ao flywheel close-loop drafts review-only skills from repeated patterns.
Status
Prototype — drafting works; promotion polish in progress.
Primitive
Adversarial verification
Industry framing
Independent agents audit other agents’ output and surface verdicts for human review.
AgentOps surface
Council architecture: /council, /pre-mortem, /vibe, /post-mortem with multi-model consensus and prediction tracking. Behavioral validation (holdout scenarios + satisfaction scoring) fires automatically inside /validation.
Status
Shipped — on-demand and always-on.
The architecture exists. The question is whether compound growth is happening — and that’s a question with proof gates, not promises.
Claim
The same skills work across Claude Code, Codex CLI, Cursor, and OpenCode.
Strongest critique
'Cross-platform' usually means 'tested once on the primary, shimmed for everything else, broken in subtle ways for users on the other three.' This is the kind of claim that gets users into trouble.
Our answer
We don’t claim full parity — we publish a tiered proof model. Tier S (structural/install) is a default CI gate for all four runtimes. Tier I (live inventory) is gated for Claude Code and Codex when local CLIs are available. Tier E (live execution) is opt-in / nightly, not a default gate. The honesty about what is and isn’t proven is the feature.
Evidence: agentops/GOALS.md directive 1 + tests/skills/test-runtime-{claude-code,codex,cursor,opencode}-smoke.sh
// ready_to_start.sh
Begin with context management and build the foundation for durable bookkeeping, validation, and compounding work.
Begin Factor I