Judgment
Pressure-test plans before code.
Plan looks coherent. Code passes tests. Both miss the edge case. No one challenged either.
Closed by: /pre-mortem · /vibe · /council
> agentops · context compiler for coding agents
Compile context. Gate output. Compound knowledge.
Every session reads from the corpus on the way in and writes back on the way out.
Vendor memory follows the chat. The corpus follows the team.
After Twelve-Factor App · 12-Factor Agents · HumanLayer
> doctrine · three layers · one flywheel · three gaps
Vendor memory follows the chat. The corpus follows the team. Every coding session reads from the corpus on the way in and writes back on the way out — typed, versioned, validated, decay-ranked. Your agent’s context is now an engineering artifact, not chat history.
Each layer solves a different problem. All three compound through the Context Development Life Cycle.
Right context, right window, right time.
Assembles phase-scoped context packets, scores knowledge by utility and freshness, trims to the token budget, and delivers at session start automatically. Your agent starts loaded, not cold.
ao injectao context assembleao compile79 skills12 hooksGates that block, not advise.
Multi-model consensus validates plans before build and code before commit. Independent Claude + Codex judges debate and return one auditable verdict. No agent grades its own work.
/pre-mortem/vibe/council63 eval suitesbaseline A/BThe system gets smarter every session.
Every session extracts learnings. Learnings get scored and promoted to permanent patterns. Patterns become planning rules. The flywheel runs overnight unattended. Session 15 starts with everything session 1 learned.
/forgeao flywheel/evolve/dream1,842 learningsCapture. Score. Promote. Inject. Not memory — compounding.
Every session emits learnings.
Five axes: specificity, actionability, novelty, context, confidence.
Learnings become patterns. Patterns become rules.
Next session starts loaded, not cold.
Stage 4 feeds the next session's stage 1. Escape velocity: retrieval × usage > decay.
Twelve factors. Three proof obligations. Each factor closes one or more.
Pressure-test plans before code.
Plan looks coherent. Code passes tests. Both miss the edge case. No one challenged either.
Closed by: /pre-mortem · /vibe · /council
Solved problems stay solved.
Auth bug fixed Monday. Same auth bug returns Wednesday. The lesson lived in a chat transcript that got compacted.
Closed by: /retro · /forge · ao lookup
Shipped work informs the next session.
Code diff lands. No lesson extracted. No constraint hardened. Next session re-learns from scratch.
Closed by: /post-mortem · finding compiler · /evolve
We’d rather lose the argument before you install than after.
Claim
Compound growth is happening, not just possible.
Strongest critique
Every agent framework claims a learning loop. Most ship a folder of stale notes. A markdown directory looks like a step backwards.
Our answer
Escape velocity is measurable: retrieval × usage > decay. CI gates the inequality every run. When learnings stop being cited, the gate goes red and names the broken stage. Diff-able plain text is the only audit substrate that scales.
Evidence: agentops/GOALS.md directives 4–5 + scripts/check-flywheel-lifecycle.sh
the internal operating mechanism
The discipline that compounds across the three layers has a name: the Context Development Life Cycle — seven phases that translate the DevOps SDLC to the context coding agents consume.
> stance · fungibility first
AgentOps 3.0 ships with single-model defaults: one runtime, one harness, install-to-first-value cheap. Cross-vendor coordination stays as an opt-in lane (`/council --mixed`) because the v1 workbench A/B saturated — Δ=+0.0000 at v1 difficulty means forcing it on by default would buy nothing measurable. The default re-opens when the v2 substrate lands realistic agent-task difficulty and the hook layer has room to differentiate; until then, the corpus stays portable, the second vendor stays one flag away, and the receipts live on /sovereignty-proof.
default posture
AgentOps 3.0 defaults to single-model. One runtime, one harness, one judge — install-to-first-value stays cheap and the surface area an operator must trust on day one stays small. Bring whichever frontier vendor you already pay for; the corpus is what's portable, not the harness.
mixed-model opt-in
Cross-vendor coordination is preserved as an explicit opt-in: `/council --mixed` runs Claude and Codex judges over the same evidence packet and returns one auditable verdict. Reach for it when a decision is load-bearing enough that you want a second-vendor disagreement before shipping — not on every commit.
empirical rationale
The posture is empirical, not aesthetic. The v1 workbench A/B (2026-05-06, 12 cases) returned Δ=+0.0000 — both `skill-on` and `skill-off` legs scored 12/12 — because the v1 task floor (off-by-one bugs, simple validators, basic SQLi) doesn't require the hook layer at all. Forcing mixed-model on by default would buy nothing measurable against that floor.
v2 substrate roadmap
Substrate v2 is the explicit roadmap: realistic agent-task difficulty where the planner / implementer / validator separation and the cross-vendor judging actually have room to differentiate. Until that substrate lands, the honest claim is single-model is enough; mixed-model earns its place where artifact evidence already shows the second vendor moving the verdict.
re-evaluation trigger
The default re-opens when v2-substrate evidence ships and the workbench delta separates from zero on realistic tasks. Until then the lever stays: `/council --mixed` is one flag away, the corpus is vendor-agnostic by construction, and the cross-vendor receipts on /sovereignty-proof are the standing artifact log.
Cross-references: /sovereignty-proof (mixed-model artifact evidence) · /factors (vendor-fungibility law in the doctrine)
Manage the context window like production input.
If it's not in git, it didn't happen.
Scoped task. Fresh context. Never reuse a saturated window.
Understand before you generate.
Consequence-bearing work gets an external check.
Validated work ratchets into a recorded checkpoint.
Every session produces two outputs: the work, and the lessons.
Learnings flow into future sessions automatically.
Track fitness, not activity.
Own workspace. Own context. Zero shared mutable state.
Escalation flows up, never sideways.
Failures are data. Index them like successes.
convergence · three primitives · multiple implementations
Any vendor can target the schema. We publish it.
Every harness gets absorbed into the model — memory, learning loops, validation gates included. The corpus is what stays yours. AgentOps is the bridge: build the moat before the harness commoditizes. Vendor-agnostic on purpose — use Claude and Codex side by side.
Primitive
Industry framing
AgentOps surface
Status
Primitive
Learning loop
Industry framing
Extract memory. Consolidate off-session. Inject next session.
AgentOps surface
/retro → /forge → /harvest → ao inject. Private overnight runs via /dream. Tiered promotion: learning → pattern → rule.
Status
Shipped.
Primitive
Skill packaging
Industry framing
Watch recurring patterns. Package them as reusable skills.
AgentOps surface
79 skills. /heal-skill audits. /converter exports cross-runtime. ao flywheel close-loop drafts new skills from repeats.
Status
Drafting works. Promotion polish in flight.
Primitive
Adversarial verification
Industry framing
Independent agents audit other agents. Verdicts go to a human.
AgentOps surface
/council, /pre-mortem, /vibe, /post-mortem. Multi-model consensus with prediction tracking. Behavioral validation fires inside /validation.
Status
Shipped.
Harness commoditizes. Corpus doesn’t. Skills, hooks, and a CLI smooth today’s sharp edges; the corpus is the moat that stays. Audit the contract at /spec/jobspec/v0.
Claim
Same skills across Claude Code, Codex, Cursor, OpenCode.
Strongest critique
'Cross-platform' usually means tested on the primary and shimmed for everything else. Users get burned on three of four.
Our answer
Three test tiers — structural, live inventory, live execution — gate Claude and Codex in CI. The conformance schema is public at /spec/jobspec/v0, so any vendor harness gets audited against the same contract we use. Externally falsifiable beats trust-us.
Evidence: agentops/GOALS.md directive 1 + tests/skills/test-runtime-*-smoke.sh + /spec/jobspec/v0/openapi.yaml
spec · doctrine made executable
JobSpec OpenAPI v0 describes the current AgentOps daemon. Any vendor can target it. Audit it. Implement it. Push back on it.