all factors
Voperations

Measure Everything

Track the five core metrics. Know when things break.

7 min read

V. Measure Everything

V
The 5 core metrics

You can't improve what you don't measure. Data beats intuition every time.


The Problem

Without Telemetry

Agents without metrics:

  • Fail silently with no trace
  • Can't measure improvement
  • No data for debugging
  • Decisions based on intuition
  • Can't identify bottlenecks

With Telemetry

Comprehensive observability enables:

  • Data-driven decisions
  • Clear failure attribution
  • Continuous optimization
  • Proven improvements
  • Capacity planning

The Solution

Hope-Driven Operations

Agent runs, something breaks

User: "What went wrong?"

Agent: "Unknown"

Result:

  • 2 hours of guessing
  • No data to prove fixes work
  • Repeat same mistakes

Data-Driven Operations

Agent runs, telemetry collected

Dashboard shows:

  • research-agent: 30s, success
  • plan-agent: ERROR at 14:32:15
  • Root cause: Invalid input format

Result:

  • 30 seconds to identify issue
  • 5 minutes to fix
  • 24x faster debugging

The Four Golden Signals

From Google SRE, applied to agents:

Latency

How long did it take?
Target: <2m per phase

Traffic

How much load?
Target: 5-12/hour

Errors

What failed?
Target: >95% success

Saturation

How close to limits?
Target: <40% context


The Three Pillars of Observability

::: info Complete Visibility Every agent execution must emit:

  1. Metrics - Quantitative data (latency, success rate, resource usage)
  2. Logs - Event streams (what happened, what decided, what failed)
  3. Traces - Request paths (end-to-end workflow, bottlenecks) :::

Metrics Example

MetricValueStatus
Research Phase45sPass
Plan Phase30sPass
Implement Phase120sSlow
Total Duration195s (3m 15s)Pass
Success Rate95% (190/200)Pass
Context Usage35%Pass

Why It Works

::: info SRE Wisdom Google SRE principle:

"Hope is not a strategy. Data-driven decisions beat intuition every time."

For AI agents:

  • Metrics reveal patterns invisible to humans
  • Logs provide debugging breadcrumbs
  • Traces show system behavior under load :::

The Feedback Loop

No telemetry: Execute → No data → No improvement → Stagnation

With telemetry: Execute → Collect data → Analyze patterns → Improve → Execute better

Result: Continuous improvement based on real data


Implementation

Metrics Collection

::: code-group

from prometheus_client import Counter, Histogram, Gauge

# Counters
agent_executions = Counter(
    'agent_executions_total',
    'Total agent executions',
    ['agent_name', 'status']
)

agent_errors = Counter(
    'agent_errors_total',
    'Total agent errors',
    ['agent_name', 'error_type']
)

# Histograms
agent_duration = Histogram(
    'agent_duration_seconds',
    'Agent execution duration',
    ['agent_name', 'phase']
)

# Gauges
context_utilization = Gauge(
    'agent_context_utilization_percent',
    'Current context window utilization',
    ['agent_name']
)
class ResearchAgent:
    def execute(self, task):
        agent_executions.labels(
            agent_name='research',
            status='started'
        ).inc()
        start_time = time.time()

        try:
            result = self.research(task)

            # Record success
            duration = time.time() - start_time
            agent_duration.labels(
                agent_name='research',
                phase='execute'
            ).observe(duration)

            agent_executions.labels(
                agent_name='research',
                status='success'
            ).inc()

            context_utilization.labels(
                agent_name='research'
            ).set(self.get_context_usage())

            return result

        except Exception as e:
            # Record failure
            agent_errors.labels(
                agent_name='research',
                error_type=type(e).__name__
            ).inc()

            agent_executions.labels(
                agent_name='research',
                status='failure'
            ).inc()
            raise

:::


Structured Logging

::: code-group

import logging
import json
from datetime import datetime

class StructuredLogger:
    def __init__(self, agent_name):
        self.agent_name = agent_name
        self.logger = logging.getLogger(agent_name)

    def log(self, level, message, **kwargs):
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'agent': self.agent_name,
            'level': level,
            'message': message,
            **kwargs
        }
        self.logger.log(
            getattr(logging, level),
            json.dumps(log_entry)
        )
logger = StructuredLogger('research-agent')

logger.log('INFO',
    'Research phase started',
    task_id='12345',
    phase='research'
)

logger.log('ERROR',
    'API timeout',
    task_id='12345',
    error='TimeoutError',
    duration_ms=30000
)

:::


Distributed Tracing

Workflow Span

Top-level trace
Full execution path

Research Span

Child trace
30s duration

Plan Span

Child trace
45s duration

Implement Span

Child trace
120s duration

::: code-group

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(OTLPSpanExporter())
trace.get_tracer_provider().add_span_processor(span_processor)
def execute_workflow(task):
    with tracer.start_as_current_span("workflow") as workflow_span:
        workflow_span.set_attribute("task.id", task.id)

        # Research phase
        with tracer.start_as_current_span("research") as research_span:
            research_result = research_agent.execute(task)
            research_span.set_attribute("result.size", len(research_result))

        # Plan phase
        with tracer.start_as_current_span("plan") as plan_span:
            plan_result = plan_agent.execute(research_result)
            plan_span.set_attribute("plan.complexity", plan_result.complexity)

        # Implement phase
        with tracer.start_as_current_span("implement") as impl_span:
            impl_result = implement_agent.execute(plan_result)
            impl_span.set_attribute("files.changed", len(impl_result.files))

        return impl_result

:::


Real-World Evidence

Before Telemetry

Problem diagnosis:

User: "The agent failed"

Dev: "What failed?"

User: "I don't know, it just didn't work"

Dev: "What was the error?"

User: "There was no error message"

Dev: "What step did it fail on?"

User: "Not sure"

Result: 2 hours of guessing and reproducing

After Telemetry

Problem diagnosis:

User: "The agent failed"

Dev: Checks dashboard

  • Trace ID: abc-123
  • Failed at: plan-agent execution
  • Error: ValidationError: missing 'namespace'
  • Context: Task ID 456, 14:32:15 UTC
  • Root cause: research-agent missing namespace

Result: 30 seconds to identify, 5 minutes to fix

Improvement: 24x faster debugging


Production Insights

Telemetry Revealed (200+ Sessions)

FindingImpact
85% of failures in implementation phaseAdded validation gate before implement
95th percentile context at 38%Confirmed 40% rule is safe
ValidationError most common (12%)Improved input validation, reduced to 3%
Average workflow takes 3m 15sOptimized implement phase: 180s to 120s
Peak load 12 executions/hourScaled for 15/hour headroom

Result: Data-driven improvements, not guesswork


Dashboard Queries

::: code-group

# Success rate (last 24h)
sum(rate(agent_executions_total{status="success"}[24h]))
  /
sum(rate(agent_executions_total[24h]))
# Average execution time by agent
avg(agent_duration_seconds_sum / agent_duration_seconds_count)
  by (agent_name)
# Error rate by type
sum(rate(agent_errors_total[5m]))
  by (error_type)
# Context utilization (95th percentile)
histogram_quantile(0.95,
  agent_context_utilization_percent
)

:::


Implementation Checklist

  • Instrument agents with metrics, logs, traces
  • Set up collection infrastructure (Prometheus, Loki, Jaeger)
  • Create dashboards for key metrics
  • Define SLOs (Service Level Objectives)
  • Set up alerts for critical failures

Anti-Patterns

The "Logging is Enough" Trap

Wrong: Only collect logs, no metrics

Right: Metrics + Logs + Traces (all three pillars)

The "Too Much Data" Trap

Wrong: Log everything at DEBUG level (noise)

Right: Strategic logging, sample high-volume data

The "Manual Metrics" Trap

Wrong: Manually track metrics in spreadsheets

Right: Automated telemetry collection

The "No Retention Policy" Trap

Wrong: Keep all telemetry forever (storage costs)

Right: Aggregate old data, retain raw 30-90 days


Telemetry Design Patterns

Pattern 1: Hierarchical Metrics

Workflow metrics (top-level)

  • Phase metrics (research, plan, implement)
  • Agent metrics (per agent execution)
  • Tool metrics (per tool invocation)

Pattern 2: Correlation IDs

import uuid

class Workflow:
    def __init__(self):
        self.correlation_id = str(uuid.uuid4())

    def execute(self, task):
        # All logs/metrics share correlation_id
        logger.log('INFO', 'Workflow started',
            correlation_id=self.correlation_id)

        research_result = research_agent.execute(
            task, correlation_id=self.correlation_id)
        plan_result = plan_agent.execute(
            research_result, correlation_id=self.correlation_id)

        logger.log('INFO', 'Workflow completed',
            correlation_id=self.correlation_id)

# Query: WHERE correlation_id = 'abc-123'
# Result: All related log entries

Pattern 3: Telemetry-Driven Alerts

# alerts.yml
groups:
  - name: agent_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(agent_errors_total[5m]) > 0.05
        annotations:
          summary: "Agent error rate above 5%"

      - alert: ContextUtilizationHigh
        expr: agent_context_utilization_percent > 40
        annotations:
          summary: "Context exceeds 40% threshold"

      - alert: SlowExecution
        expr: agent_duration_seconds > 300
        annotations:
          summary: "Agent taking longer than 5 minutes"

FactorRelationship
I. Automated TrackingGit logs decisions, telemetry logs execution
IV. Continuous ValidationMeasure validation effectiveness
VI. Resume WorkTrack multi-session workflows
IX. Mine PatternsTelemetry reveals patterns