XII. Harvest Failures as Wisdom

This factor is part of the Scale tier (X-XII) — advanced patterns for multi-agent workflows. Not a prerequisite for getting value from Factors I-IX.

The Rule

Failed attempts are data. Extract and index them with the same rigor as successes.

Every failed approach is a negative result. Negative results are knowledge. Knowledge compounds. Most systems treat failures as noise to be suppressed. 12-Factor AgentOps treats them as signal to be harvested.

When an agent tries three approaches before the fourth works, you don't just have one success — you have three documented learnings about what doesn't work under specific conditions. That's the wisdom that prevents the next agent from burning cycles on the same dead ends.

The Rationale

Failures Are Tuition, Learnings Are the Degree

Compute costs money. Failed attempts cost compute. If you pay tuition but don't extract the learning, you're funding the same education twice.

Example: Agent tries to parse configuration with regex. Fails on nested structures. Tries YAML parser. Fails on custom tags. Tries JSON with preprocessing. Works. Next week, different agent encounters similar config. Without harvested wisdom, they repeat the regex attempt. With harvested wisdom, they skip straight to JSON preprocessing.

The second agent saved two failed attempts worth of compute. The savings compound across every agent, every similar task, forever. That's the return on harvesting failures.

Most Systems Discard Failed Attempts

Standard logging: "ERROR: Parse failed. Retrying with alternative method."

What you lost:

Which method failed
What input triggered the failure
What the error signature looked like
What context made this method seem reasonable to try
Why the next method succeeded where this one failed

Without that detail, the next agent starts from zero knowledge. With it, they start from negative knowledge — "don't try X when Y is true" — which is often more valuable than positive knowledge because it prunes the search space.

Negative Knowledge Accumulates

Positive knowledge: "Use JSON preprocessing for configs with custom tags."

Negative knowledge:

"Don't use regex on nested structures"
"Don't use standard YAML parser on custom tags"
"Don't trust config schema documentation when vendor uses extensions"

Each negative rule prevents a class of failures. As you accumulate negative knowledge, your failure rate drops not because agents get smarter, but because the knowledge base gets wiser.

This is how human expertise works. Experts don't just know what works — they know what doesn't work and why. They've paid the tuition. Harvesting failures lets your agent system build expertise without paying the same tuition repeatedly.

Post-Mortems on Failures, Not Just Successes

Most teams do post-mortems on outages. Few do them on failed agent attempts. But a failed agent attempt is a micro-outage — a local failure to deliver value.

Post-mortem discipline:

What was attempted
Why it seemed reasonable
What failed
What the failure revealed
What should be tried instead next time
What should never be tried again under these conditions

This isn't expensive. It's a structured extraction at the moment of failure when context is fresh. The alternative is agents repeating failures because nobody captured why they happened.

Failure Rate as a Metric

Track your agent system's failure rate over time. If it's not decreasing, you're not harvesting failures effectively.

Good trajectory:

Month 1: 40% of agent attempts fail before success
Month 3: 25% fail (harvested wisdom preventing repeats)
Month 6: 15% fail (negative knowledge accumulating)
Month 12: 10% fail (only novel failures, harvested immediately)

If your failure rate stays flat, you're paying tuition repeatedly. Every plateau in the curve is wasted learning.

Olympus Runs Harvested Post-Mortems Through Enhancement Loops for Synthetic Learnings

The harvested failures feed back into knowledge systems. Failed attempts become training data. Patterns in failures become rules. Rules become preventative checks. The system learns not just "this failed" but "this class of approaches fails under these conditions."

What Good Looks Like

Failed Attempt Capture

Every failed attempt is logged with:

Approach attempted
Contextual state (what made this seem reasonable)
Failure mode (what broke, how it broke)
Agent's hypothesis about why
Next approach to try
Timestamp and session ID

Format:

failed_attempt:
  id: "fa-20260215-001"
  session: "ses-abc123"
  task: "Parse custom config format"
  approach: "Regex pattern matching"
  context:
    - "Config appeared to have regular structure"
    - "Prior configs in this repo used simple key=value"
  failure_mode: "Regex failed on nested structures in lines 45-67"
  hypothesis: "Custom format supports nesting not visible in first 40 lines"
  next_approach: "Try YAML parser"
  timestamp: "2026-02-15T14:23:01Z"

This isn't a wall of stack traces. It's structured knowledge.

Post-Mortem Extraction

After task completion (whether success or abandoned), run extraction:

ao forge transcript <session-id>
# Extracts:
# - All failed attempts
# - Pattern across attempts
# - What eventually worked
# - Negative rules to prevent repeats

Extraction produces:

Individual failed attempt records
Meta-learning: "When X condition is present, approaches Y and Z fail because W"
Preventative rule: "Before attempting regex on config, check for nested structures"

Knowledge Base Integration

Failed attempts feed into knowledge pools:

ao pool list --type=negative-knowledge
# Shows:
# - Documented failures
# - Conditions under which they failed
# - Alternative approaches that worked
# - Confidence (how many times has this pattern held)

When planning new work, agents query negative knowledge:

ao recall "config parsing failures"
# Returns:
# - "Regex fails on nested structures (confidence: 95%, n=12)"
# - "YAML parser fails on custom tags (confidence: 87%, n=8)"
# - "Recommend: JSON with preprocessing (success rate: 89%, n=15)"

Failure Rate Dashboard

Track and display failure metrics:

Agent Failure Rate Trends
┌─────────────────────────────────────────┐
│ Month    Failed  Total  Rate   Trend    │
├─────────────────────────────────────────┤
│ 2025-11    45    112   40.2%    --     │
│ 2025-12    38    125   30.4%    ↓ 9.8  │
│ 2026-01    28    134   20.9%    ↓ 9.5  │
│ 2026-02    15    118   12.7%    ↓ 8.2  │
└─────────────────────────────────────────┘

Top Failure Categories (2026-02):
- Parsing novel formats: 6 (harvested: 6)
- Network timeouts: 4 (harvested: 4)
- Dependency conflicts: 3 (harvested: 3)
- API rate limits: 2 (harvested: 2)

Repeat Failures (same pattern as prior month): 0

Zero repeat failures means perfect harvest. Any repeat means you missed a learning.

Negative Knowledge as Documentation

Failed approaches become documentation:

# Config Parsing Patterns

## What Works
- JSON with preprocessing: 89% success rate
- Custom parser for known formats: 95% success rate

## What Doesn't Work
- **Regex on nested structures** (n=12 failures)
  - Fails when: Nesting depth > 2, or nesting not visible in first scan
  - Why: Regex can't handle recursive structures
  - Instead: Use JSON or custom parser

- **Standard YAML parser on custom tags** (n=8 failures)
  - Fails when: Vendor uses custom YAML extensions
  - Why: Parser doesn't recognize extension syntax
  - Instead: Preprocess to strip extensions or use vendor-specific parser

This is more valuable than "here's what works" because it prunes the search space. An agent reading this knows to skip two entire classes of approaches.

Failure Replay for Validation

When you harvest a failure pattern, validate it:

ao validate-failure-pattern "fa-20260215-001"
# Checks:
# - Does this failure pattern match prior failures?
# - Would the negative rule have prevented this failure?
# - Is the confidence threshold justified?
# - Are there edge cases where this rule shouldn't apply?

Validation prevents false negatives — rules that are too broad and prevent valid approaches.

Time-to-Repeat Metric

Measure time between identical failures:

Failure Repeat Analysis
┌────────────────────────────────────────────────┐
│ Pattern                  First    Repeat  Days │
├────────────────────────────────────────────────┤
│ Regex on nested config   11/12    12/08   26  │
│ YAML on custom tags      11/15    12/15   30  │
│ API without rate limit   11/20    (none)  --  │
│ Network without retry    11/22    (none)  --  │
└────────────────────────────────────────────────┘

Harvested patterns with no repeats: 8
Unharvested patterns (repeated within 30d): 2
Harvest effectiveness: 80%

Goal: Increase time-to-repeat until it approaches infinity. Harvested failures shouldn't repeat.

Failure Libraries by Domain

Organize negative knowledge by domain:

knowledge/
├── parsing/
│   ├── config-parsing-failures.md
│   ├── log-parsing-failures.md
│   └── schema-inference-failures.md
├── network/
│   ├── timeout-failures.md
│   ├── retry-failures.md
│   └── rate-limit-failures.md
└── dependencies/
    ├── version-conflict-failures.md
    ├── circular-dependency-failures.md
    └── missing-dependency-failures.md

Each file documents:

What was attempted
Why it failed
What works instead
Conditions under which the failure pattern applies

Confidence Decay on Stale Failures

Failure patterns age. A failure from 2023 might not apply to 2026 tooling.

failure_pattern:
  id: "fp-regex-nested-config"
  rule: "Don't use regex on nested config structures"
  confidence: 0.95
  last_validated: "2026-01-15"
  first_observed: "2025-11-12"
  observation_count: 12
  decay_rate: 0.02  # 2% confidence loss per month without revalidation

If a failure pattern hasn't been revalidated in 6 months, confidence drops. If it drops below threshold (e.g., 0.6), require revalidation before applying the rule.

This prevents outdated negative knowledge from blocking valid approaches.

Pre-Flight Failure Checks

Before attempting an approach, check negative knowledge:

# Agent planning to parse config with regex
ao check-approach "regex config parsing" --context "nested structures present"

# Returns:
# ⚠️  High failure risk (95% confidence, n=12)
# Pattern: "Regex fails on nested config structures"
# Recommendation: Use JSON parser with preprocessing
# Override: Proceed anyway (will be logged as informed decision)

Agents can override, but overrides are logged. If an override succeeds, that's data too — maybe the rule needs refinement.

Failure Clustering

Group similar failures to find patterns:

ao cluster-failures --time-range="last-30d"

# Output:
# Cluster 1: "Parsing custom formats" (15 failures)
#   - 8: Regex on nested structures
#   - 4: YAML on custom tags
#   - 3: Schema inference on dynamic types
#   Common theme: Tool mismatch for complexity level
#   Recommendation: Add complexity pre-check before tool selection

# Cluster 2: "Network operations" (8 failures)
#   - 4: Timeout without retry
#   - 2: Rate limit without backoff
#   - 2: DNS resolution in flaky network
#   Common theme: Missing resilience patterns
#   Recommendation: Add network operation wrapper with retries

Clusters reveal meta-patterns. Individual failures are data points. Clusters are insights.

Success Rate by Approach

Track which approaches work and which don't:

Approach Success Rates (Config Parsing)
┌──────────────────────────────────────────────────┐
│ Approach              Attempts  Success  Rate    │
├──────────────────────────────────────────────────┤
│ JSON preprocessing       18       16     88.9%  │
│ Custom parser            12       11     91.7%  │
│ YAML standard             9        4     44.4%  │
│ Regex                     8        2     25.0%  │
└──────────────────────────────────────────────────┘

This isn't just negative knowledge (regex bad) — it's ranked approaches. Agent should try custom parser first (highest rate), then JSON preprocessing (second highest).

Failure Retrospectives

Quarterly, review all failures:

ao retro failures --quarter=Q1-2026

# Agenda:
# 1. Failure rate trend (are we learning?)
# 2. Repeat failures (what didn't we harvest?)
# 3. Novel failure patterns (new classes of problems?)
# 4. Obsolete negative knowledge (rules we can retire?)
# 5. Gaps in documentation (where do failures cluster?)

Retrospectives turn failures into process improvements. High failure rate in a domain? Add pre-checks. Lots of repeats? Improve harvest automation. Novel patterns? Update agent training.

Implementation Checklist

Basic harvest:

Log all failed attempts with context
Extract structured failure records after task completion
Store failures in searchable knowledge base
Track failure rate over time

Advanced harvest:

Cluster failures to find patterns
Generate negative knowledge rules
Integrate negative knowledge into agent planning
Implement confidence decay for stale failures
Track time-to-repeat for failure patterns
Build domain-specific failure libraries
Add pre-flight failure checks
Run quarterly failure retrospectives

Expert harvest:

Automate failure extraction from session transcripts
Generate synthetic training data from failure patterns
Build approach ranking based on success rates
Implement override tracking for negative knowledge
Create failure dashboards for visibility
Feed failures into model fine-tuning pipelines
Measure harvest effectiveness (repeat rate)
Build tooling to validate failure patterns

Anti-Patterns

Silent Failures

Agent fails, retries, succeeds. No record of the failure. Next agent repeats it.

Why it's bad: You paid tuition and got no degree. Silent failures are wasted learning opportunities.

Fix: Log every failed attempt, even if subsequent attempt succeeds.

Failure Blame Culture

Team sees high failure harvesting as "look how much we're failing."

Why it's bad: Discourages transparent logging. Agents hide failures. Learning stops.

Fix: Reframe failures as tuition payments. Celebrate harvest, not absence of failure. Declining failure rate is the success metric.

Overfitting Negative Knowledge

One failure → "Never do X." Agent blocks valid approaches based on single data point.

Why it's bad: False negatives. You prevent solutions that would work.

Fix: Require minimum observation count (n>3) before promoting failure to rule. Use confidence thresholds.

Harvesting Without Application

You log failures beautifully. Nobody reads them. Agents repeat failures anyway.

Why it's bad: Harvest without application is theater. Knowledge that doesn't inform decisions is trivia.

Fix: Integrate negative knowledge into agent planning. Require agents to query failures before attempting approaches. Track whether harvest reduces repeat rate.

Stale Negative Knowledge

Rule from 2023: "API X is unreliable." API was fixed in 2024. Agent still avoids it.

Why it's bad: Negative knowledge becomes superstition. You block approaches that would work now.

Fix: Implement confidence decay. Revalidate old rules. Retire obsolete negative knowledge.

Analysis Paralysis

Before attempting anything, check 50 failure patterns. Planning takes longer than execution.

Why it's bad: Perfect harvest, zero velocity. The goal is faster learning, not perfect knowledge.

Fix: Quick pre-flight checks on major decisions. Don't block on every micro-decision. Trust agents to log surprises.

Why This Is Optional

Factors I-IX get you 80% of the value: research before implementation, explicit transitions, tight feedback loops, quality pools, knowledge accumulation. You can run effective agent workflows without harvesting failures.

Factor XII is about the last 20%: compound learning efficiency. You're already learning from successes (Factor X: Knowledge Flywheel). Harvesting failures adds negative knowledge — what doesn't work and why.

This matters at scale. One agent, one project? Manual learning is fine. Ten agents, ten projects, six months? Unharvested failures become expensive. Agents repeat mistakes. Compute costs compound.

Start with Factors I-IX. Add Factor XII when:

You're running multiple agents on similar tasks
Failure rate isn't declining over time
You notice agents repeating mistakes
Compute costs are rising without proportional output increase
You want compound learning across agent generations

Without Tooling

Since this is a Scale tier factor, the "without tooling" approach is simpler: document failures alongside successes.

The Failure Log

Keep a failures/ directory in your project or a ## What Didn't Work section in every session log:

# Failed Approach: Regex Config Parsing

**Date:** 2026-02-15
**Task:** Parse nested configuration
**What we tried:** Regex-based parsing
**Why it failed:** Nested structures break regex (no recursive matching)
**What worked instead:** JSON preprocessing + standard parser
**Time wasted:** ~45 minutes

Pre-Implementation Failure Check

Before starting any task, search your failure logs:

grep -r "config pars" failures/

If someone already failed at this approach, skip it.

Team Practice

In code reviews, ask: "What did you try before this approach?"
In retrospectives, dedicate time to harvested failures
In onboarding, share the failure log alongside the codebase tour

The failures are the tuition. The log is the degree. Don't pay tuition twice.

The Payoff

Harvesting failures turns compute waste into knowledge assets. Every failed attempt that doesn't repeat is compute saved forever. The return compounds.

Month 1: High failure rate, high learning cost. Month 6: Declining failure rate, harvested wisdom preventing repeats. Month 12: New agents start with negative knowledge from all prior failures.

The system gets smarter not because agents get smarter, but because the knowledge base gets wiser. That's the multiplier effect of harvesting failures.

You paid tuition. Extract the degree.