V. Measure Everything
You can't improve what you don't measure. Data beats intuition every time.
The Problem
Without Telemetry
Agents without metrics:
- Fail silently with no trace
- Can't measure improvement
- No data for debugging
- Decisions based on intuition
- Can't identify bottlenecks
With Telemetry
Comprehensive observability enables:
- Data-driven decisions
- Clear failure attribution
- Continuous optimization
- Proven improvements
- Capacity planning
The Solution
Hope-Driven Operations
Agent runs, something breaks
User: "What went wrong?"
Agent: "Unknown"
Result:
- 2 hours of guessing
- No data to prove fixes work
- Repeat same mistakes
Data-Driven Operations
Agent runs, telemetry collected
Dashboard shows:
- research-agent: 30s, success
- plan-agent: ERROR at 14:32:15
- Root cause: Invalid input format
Result:
- 30 seconds to identify issue
- 5 minutes to fix
- 24x faster debugging
The Four Golden Signals
From Google SRE, applied to agents:
Latency
How long did it take?
Target: <2m per phase
Traffic
How much load?
Target: 5-12/hour
Errors
What failed?
Target: >95% success
Saturation
How close to limits?
Target: <40% context
The Three Pillars of Observability
::: info Complete Visibility Every agent execution must emit:
- Metrics - Quantitative data (latency, success rate, resource usage)
- Logs - Event streams (what happened, what decided, what failed)
- Traces - Request paths (end-to-end workflow, bottlenecks) :::
Metrics Example
| Metric | Value | Status |
|---|---|---|
| Research Phase | 45s | Pass |
| Plan Phase | 30s | Pass |
| Implement Phase | 120s | Slow |
| Total Duration | 195s (3m 15s) | Pass |
| Success Rate | 95% (190/200) | Pass |
| Context Usage | 35% | Pass |
Why It Works
::: info SRE Wisdom Google SRE principle:
"Hope is not a strategy. Data-driven decisions beat intuition every time."
For AI agents:
- Metrics reveal patterns invisible to humans
- Logs provide debugging breadcrumbs
- Traces show system behavior under load :::
The Feedback Loop
No telemetry: Execute → No data → No improvement → Stagnation
With telemetry: Execute → Collect data → Analyze patterns → Improve → Execute better
Result: Continuous improvement based on real data
Implementation
Metrics Collection
::: code-group
from prometheus_client import Counter, Histogram, Gauge
# Counters
agent_executions = Counter(
'agent_executions_total',
'Total agent executions',
['agent_name', 'status']
)
agent_errors = Counter(
'agent_errors_total',
'Total agent errors',
['agent_name', 'error_type']
)
# Histograms
agent_duration = Histogram(
'agent_duration_seconds',
'Agent execution duration',
['agent_name', 'phase']
)
# Gauges
context_utilization = Gauge(
'agent_context_utilization_percent',
'Current context window utilization',
['agent_name']
)
class ResearchAgent:
def execute(self, task):
agent_executions.labels(
agent_name='research',
status='started'
).inc()
start_time = time.time()
try:
result = self.research(task)
# Record success
duration = time.time() - start_time
agent_duration.labels(
agent_name='research',
phase='execute'
).observe(duration)
agent_executions.labels(
agent_name='research',
status='success'
).inc()
context_utilization.labels(
agent_name='research'
).set(self.get_context_usage())
return result
except Exception as e:
# Record failure
agent_errors.labels(
agent_name='research',
error_type=type(e).__name__
).inc()
agent_executions.labels(
agent_name='research',
status='failure'
).inc()
raise
:::
Structured Logging
::: code-group
import logging
import json
from datetime import datetime
class StructuredLogger:
def __init__(self, agent_name):
self.agent_name = agent_name
self.logger = logging.getLogger(agent_name)
def log(self, level, message, **kwargs):
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'agent': self.agent_name,
'level': level,
'message': message,
**kwargs
}
self.logger.log(
getattr(logging, level),
json.dumps(log_entry)
)
logger = StructuredLogger('research-agent')
logger.log('INFO',
'Research phase started',
task_id='12345',
phase='research'
)
logger.log('ERROR',
'API timeout',
task_id='12345',
error='TimeoutError',
duration_ms=30000
)
:::
Distributed Tracing
Workflow Span
Top-level trace
Full execution path
Research Span
Child trace
30s duration
Plan Span
Child trace
45s duration
Implement Span
Child trace
120s duration
::: code-group
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(OTLPSpanExporter())
trace.get_tracer_provider().add_span_processor(span_processor)
def execute_workflow(task):
with tracer.start_as_current_span("workflow") as workflow_span:
workflow_span.set_attribute("task.id", task.id)
# Research phase
with tracer.start_as_current_span("research") as research_span:
research_result = research_agent.execute(task)
research_span.set_attribute("result.size", len(research_result))
# Plan phase
with tracer.start_as_current_span("plan") as plan_span:
plan_result = plan_agent.execute(research_result)
plan_span.set_attribute("plan.complexity", plan_result.complexity)
# Implement phase
with tracer.start_as_current_span("implement") as impl_span:
impl_result = implement_agent.execute(plan_result)
impl_span.set_attribute("files.changed", len(impl_result.files))
return impl_result
:::
Real-World Evidence
Before Telemetry
Problem diagnosis:
User: "The agent failed"
Dev: "What failed?"
User: "I don't know, it just didn't work"
Dev: "What was the error?"
User: "There was no error message"
Dev: "What step did it fail on?"
User: "Not sure"
Result: 2 hours of guessing and reproducing
After Telemetry
Problem diagnosis:
User: "The agent failed"
Dev: Checks dashboard
- Trace ID: abc-123
- Failed at: plan-agent execution
- Error: ValidationError: missing 'namespace'
- Context: Task ID 456, 14:32:15 UTC
- Root cause: research-agent missing namespace
Result: 30 seconds to identify, 5 minutes to fix
Improvement: 24x faster debugging
Production Insights
Telemetry Revealed (200+ Sessions)
| Finding | Impact |
|---|---|
| 85% of failures in implementation phase | Added validation gate before implement |
| 95th percentile context at 38% | Confirmed 40% rule is safe |
| ValidationError most common (12%) | Improved input validation, reduced to 3% |
| Average workflow takes 3m 15s | Optimized implement phase: 180s to 120s |
| Peak load 12 executions/hour | Scaled for 15/hour headroom |
Result: Data-driven improvements, not guesswork
Dashboard Queries
::: code-group
# Success rate (last 24h)
sum(rate(agent_executions_total{status="success"}[24h]))
/
sum(rate(agent_executions_total[24h]))
# Average execution time by agent
avg(agent_duration_seconds_sum / agent_duration_seconds_count)
by (agent_name)
# Error rate by type
sum(rate(agent_errors_total[5m]))
by (error_type)
# Context utilization (95th percentile)
histogram_quantile(0.95,
agent_context_utilization_percent
)
:::
Implementation Checklist
- Instrument agents with metrics, logs, traces
- Set up collection infrastructure (Prometheus, Loki, Jaeger)
- Create dashboards for key metrics
- Define SLOs (Service Level Objectives)
- Set up alerts for critical failures
Anti-Patterns
The "Logging is Enough" Trap
Wrong: Only collect logs, no metrics
Right: Metrics + Logs + Traces (all three pillars)
The "Too Much Data" Trap
Wrong: Log everything at DEBUG level (noise)
Right: Strategic logging, sample high-volume data
The "Manual Metrics" Trap
Wrong: Manually track metrics in spreadsheets
Right: Automated telemetry collection
The "No Retention Policy" Trap
Wrong: Keep all telemetry forever (storage costs)
Right: Aggregate old data, retain raw 30-90 days
Telemetry Design Patterns
Pattern 1: Hierarchical Metrics
Workflow metrics (top-level)
- Phase metrics (research, plan, implement)
- Agent metrics (per agent execution)
- Tool metrics (per tool invocation)
Pattern 2: Correlation IDs
import uuid
class Workflow:
def __init__(self):
self.correlation_id = str(uuid.uuid4())
def execute(self, task):
# All logs/metrics share correlation_id
logger.log('INFO', 'Workflow started',
correlation_id=self.correlation_id)
research_result = research_agent.execute(
task, correlation_id=self.correlation_id)
plan_result = plan_agent.execute(
research_result, correlation_id=self.correlation_id)
logger.log('INFO', 'Workflow completed',
correlation_id=self.correlation_id)
# Query: WHERE correlation_id = 'abc-123'
# Result: All related log entries
Pattern 3: Telemetry-Driven Alerts
# alerts.yml
groups:
- name: agent_alerts
rules:
- alert: HighErrorRate
expr: rate(agent_errors_total[5m]) > 0.05
annotations:
summary: "Agent error rate above 5%"
- alert: ContextUtilizationHigh
expr: agent_context_utilization_percent > 40
annotations:
summary: "Context exceeds 40% threshold"
- alert: SlowExecution
expr: agent_duration_seconds > 300
annotations:
summary: "Agent taking longer than 5 minutes"
Related Factors
| Factor | Relationship |
|---|---|
| I. Automated Tracking | Git logs decisions, telemetry logs execution |
| IV. Continuous Validation | Measure validation effectiveness |
| VI. Resume Work | Track multi-session workflows |
| IX. Mine Patterns | Telemetry reveals patterns |