Skip to content
all factors
IIfoundation

Context Loading

Load when needed, not all at once. Stay under 40%.

20 min read

Factor II: Context Loading

Keep main context clean; delegate work to sub-agents with isolated context windows

AspectDetails
Primary PillarContext Engineering
Supporting PillarLearning Science
Enforces LawsLaw 2 (Improve System), Law 4 (Validate Before Execute)
Derived FromManage cognitive load + Progressive complexity + Just-in-time loading

Summary

The main agent context window should stay lean and focused. When deep work is needed (research, exploration, analysis), delegate to sub-agents with their own isolated context windows. Sub-agents load necessary context just-in-time, complete their work, and report back concise results. The main agent orchestrates without bloating its own context.

Why This Factor Exists

Pillar Grounding:

This factor derives from the Context Engineering pillar:

AI context windows are finite and degrade under cognitive load—just like human working memory. The 40% threshold isn't arbitrary; it emerges from empirical observation across cognitive systems. When context utilization exceeds 40%, quality degrades: ADHD hyperfocus collapses, LLM outputs become inconsistent, database query performance drops. Context Engineering recognizes this universal pattern and mandates Just-In-Time (JIT) loading: load what you need when you need it, then discard. This prevents the monolithic "load everything upfront" antipattern that causes inevitable context saturation.

From cognitive load theory (Sweller, 1988): Working memory has limited capacity. When overwhelmed, learners experience cognitive overload and performance degrades. The same principle applies to AI context windows—exceeding capacity triggers collapse. JIT loading respects these limits by delegating deep work to isolated sub-agents, each with their own clean context window.

Supporting from Learning Science:

Progressive complexity is fundamental to learning: start simple, add layers gradually. Context Loading applies this principle operationally. The main agent maintains high-level orchestration (simple), while sub-agents handle deep complexity (complex work in isolation). Each sub-agent loads only what's needed for its specific task, returns a compressed summary, and discards its context. This separation prevents the mixing of abstraction levels that causes cognitive overload in both humans and AI systems.


The Problem

Monolithic agent context becomes bloated and unreliable:

  • Main agent tries to hold all information needed for every task
  • Context window fills with details only relevant to subtasks
  • Orchestration logic drowns in task-specific noise
  • Main agent loses focus on high-level goals
  • Quality degrades as context saturates

Anti-pattern:

Main Agent Context (80% full):
├─ High-level task goals (5%)
├─ Orchestration logic (5%)
├─ Codebase exploration results (25%)
├─ Research findings (20%)
├─ Implementation details (15%)
├─ Validation results (10%)
└─ Result: Context collapse, lost focus

Traditional approach: Main agent does everything in its own context.

12-Factor AgentOps approach: Main agent stays clean, delegates deep work to sub-agents with isolated context.


The Principle

Sub-Agent Architecture for JIT Loading

Core insight: Each task gets its own context window through sub-agent delegation.

Architecture:

Main Agent (10-20% context):
├─ High-level goals and orchestration logic
├─ Task routing decisions
├─ Concise results from sub-agents
└─ Next steps planning

Sub-Agent 1: Codebase Exploration (fresh context)
├─ Loads only what's needed for exploration
├─ Does deep work in isolation
├─ Returns: Concise summary (not full details)
└─ Context discarded after task

Sub-Agent 2: Implementation (fresh context)
├─ Loads only what's needed for coding
├─ Works with clean, focused context
├─ Returns: Summary of changes made
└─ Context discarded after task

Sub-Agent 3: Validation (fresh context)
├─ Loads only what's needed for testing
├─ Validates in isolation
├─ Returns: Pass/fail + key findings
└─ Context discarded after task

Result: Main agent never saturates. Each sub-task gets optimal context.

Just-In-Time Context Loading

Context is loaded when needed, not preemptively:

Wrong:

Main agent:
1. Load entire codebase (60% context)
2. Load all documentation (20% context)
3. Try to implement feature (no room left)

Right:

Main agent:
1. Understand goal (5% context)
2. Spawn exploration sub-agent → loads codebase JIT
3. Receive summary (add 3% to main context)
4. Spawn implementation sub-agent → loads relevant code JIT
5. Receive summary (add 2% to main context)
6. Make decision (10% total context used)

Benefit: Context loaded precisely when needed, discarded when done.

The 40% Safety Threshold

Even with sub-agents, monitor main agent context:

Main ContextStateAction
0-20%OptimalContinue orchestrating
20-40%GoodConsider if summarization needed
40-60%WarningMust delegate or summarize
60%+CriticalStop adding to main context

Rule: Main agent orchestrates, sub-agents execute. Main context stays lean.


Why This Works

1. Isolation Prevents Context Pollution

Problem with monolithic agents:

  • Main agent loads codebase for exploration (40% context)
  • Keeps that context while planning (adds 20%, now 60%)
  • Tries to implement with saturated context (no room, collapse)

Solution with sub-agents:

  • Main agent spawns exploration sub-agent
  • Exploration happens in clean context
  • Summary returns to main (adds 5%, stays lean)
  • Next sub-agent starts fresh

Result: Context never accumulates. Each task gets optimal workspace.

2. Cognitive Load Theory: Separation of Concerns

Sweller (1988): Working memory has limited capacity

  • Orchestration requires different cognitive load than execution
  • Mixing high-level planning with low-level details causes overload
  • Separation improves both orchestration and execution quality

Applied to agents:

  • Main agent: High-level reasoning, task routing, orchestration
  • Sub-agents: Deep focus on specific tasks, loaded context
  • Each operates in optimal cognitive zone

3. Empirical Evidence: Sub-Agent Success Rates

Validation across 200+ sessions:

  • Monolithic agents (single context): 35% success rate on complex tasks
  • Sub-agent delegation: 98% success rate on same tasks
  • Main agent context utilization: 10-20% (vs 60-80% monolithic)

The pattern is clear: Delegation to sub-agents with JIT loading dramatically improves reliability.

4. Sub-Agent Results Compress Knowledge

Key insight: Sub-agents return summaries, not raw data.

Example:

Task: "Find all API endpoints in the codebase"

Sub-agent loads:
- 50 files (40K tokens)
- Analyzes patterns
- Identifies endpoints

Sub-agent returns:
- Concise list: 15 endpoints (2K tokens)

Main agent receives:
- 95% compression
- Actionable summary
- Clean context preserved

Result: Main agent gets value without bloat.


What This Factor Enforces

This factor is the operational expression of:

Law 2: Improve System

Context Loading enforces Law 2 by making the 40% rule a system-level improvement that prevents context collapse. Staying under 40% context utilization isn't a suggestion—it's an architectural constraint that improves reliability. The enforcement mechanisms are:

Enforcement mechanisms:

  1. Sub-agent delegation mandatory for deep work: Main agents that exceed 20% context must delegate to sub-agents (architectural enforcement)
  2. Context monitoring with automatic termination: Agents monitor their context utilization; execution terminates if exceeding 40% (runtime enforcement)
  3. Compressed summaries required from sub-agents: Sub-agents return summaries (2K-5K tokens), never raw data (40K+ tokens) (interface contract enforcement)

Example:

class MainAgent:
    def execute_complex_task(self, task):
        # Current context: 15%

        # Would loading codebase exceed 40%?
        if self.would_exceed_threshold(task, threshold=0.40):
            # Delegate to sub-agent (enforcement)
            summary = self.delegate_to_subagent("explore_codebase", task)
            # Main context: 15% → 18% (added summary)
        else:
            # Handle in main context (safe)
            summary = self.execute_directly(task)

        return summary

Law 4: Validate Before Execute

Context Loading enforces Law 4 by requiring validation that context stays lean before proceeding. Before executing any task, the system validates that adding this work won't exceed the 40% threshold. If it would, execution is blocked until context is reset or compressed.

Enforcement mechanisms:

  1. Pre-execution context checks: Before any sub-task, validate projected context utilization < 40%
  2. Mandatory context reset between phases: Each workflow phase (Research → Plan → Implement) starts with fresh context
  3. Bundle compression between sessions: Multi-day work requires context bundles (Factor VI) to stay under 40% across sessions

Example:

class ContextValidator:
    def validate_before_execute(self, agent, task):
        current = agent.context_utilization()  # e.g., 25%
        projected = self.estimate_task_context(task)  # e.g., 20%
        total = current + projected  # 45%

        if total > 0.40:
            raise ValidationError(
                f"Cannot execute: would exceed 40% threshold ({total:.1%})"
                "Options: 1) Delegate to sub-agent, 2) Reset context, 3) Compress with bundle"
            )

Why this enforcement matters:

Without JIT loading: Main agent loads everything (60-80% context) → context collapse → 35% success rate With JIT loading: Main agent orchestrates (10-20% context), sub-agents execute → 98% success rate

The 40% rule enforced via JIT loading is the difference between reliable and unreliable agent systems.


Implementation

Pattern 1: Delegate to Sub-Agents for Deep Work

Main agent orchestrates, sub-agents execute with JIT context:

Example: Codebase exploration

# Main agent (10% context)
task = "Find all authentication methods in the codebase"

# Spawn sub-agent with specific mission
result = spawn_sub_agent(
    task="Explore codebase for authentication",
    context_to_load=[
        "src/**/*.py",  # Load only relevant files
        "docs/auth.md"  # Load only relevant docs
    ],
    return_format="concise_summary"
)

# Sub-agent context (isolated):
# - Loads src/**/*.py files JIT
# - Analyzes authentication patterns
# - Works in clean, focused context
# - Returns: Summary (2K tokens)

# Main agent receives:
summary = result  # "Found 3 auth methods: JWT, OAuth2, API keys..."
# Main context: 10% → 13% (added summary, not all files)

Benefit: Main agent never saw 40K tokens of code, only 2K summary.

Pattern 2: Chain Sub-Agents, Not Context

Wrong approach:

# Main agent tries to do everything
context = load_codebase()        # 40% context
context += analyze_patterns()    # → 60% context
context += generate_plan()       # → 80% context (collapse)

Right approach:

# Main agent: Orchestration only
exploration_result = spawn_sub_agent("explore codebase")
# Main: 5% → 8% (added summary)

analysis_result = spawn_sub_agent(
    "analyze patterns",
    context=[exploration_result]  # Sub-agent loads summary
)
# Main: 8% → 11% (added analysis summary)

plan = spawn_sub_agent(
    "generate implementation plan",
    context=[analysis_result]  # Sub-agent loads analysis
)
# Main: 11% → 15% (added plan summary)
# Main agent stays lean, sub-agents do heavy lifting

Pattern 3: Sub-Agent Report-Back Protocol

Design sub-agents to return concise summaries:

def exploration_sub_agent(task):
    # Sub-agent context (isolated, can be 40-60% full)
    files = load_all_relevant_files()  # 50% of sub-agent context
    patterns = analyze_architecture()   # Deep analysis
    findings = extract_key_insights()

    # Return SUMMARY, not raw data
    return {
        "summary": "Found MVC architecture with 15 models...",
        "key_files": ["app/models/user.py", "app/controllers/auth.py"],
        "recommendations": ["Consider adding input validation..."],
        "token_count": 2000  # Compressed from 40K
    }

# Main agent receives compressed knowledge
result = exploration_sub_agent("analyze codebase")
# Main context += 2K tokens, not 40K

Pattern 4: Monitor Main Agent Context

Keep main agent lean:

def should_delegate_to_subagent(task, current_context_pct):
    """Decide if task should go to sub-agent"""

    task_complexity = estimate_context_needed(task)
    projected_utilization = current_context_pct + task_complexity

    if projected_utilization > 40:
        # Would exceed safe threshold → delegate
        return True

    if task_complexity > 20:
        # Complex task → delegate even if we have room
        return True

    # Simple task, we have room → do it in main context
    return False

# Example usage
if should_delegate_to_subagent("explore codebase", current=15):
    result = spawn_sub_agent("explore codebase")
else:
    result = do_in_main_context()

Pattern 5: Sub-Agent Implementation Examples

In Claude Code (Task tool):

# Main agent spawns specialized sub-agent
task_result = task_tool.invoke(
    subagent_type="Explore",
    prompt="Find all API endpoints in the codebase",
    model="sonnet"  # Fast model for deep work
)

# Task tool:
# - Launches sub-agent with fresh context
# - Loads codebase JIT
# - Returns concise summary
# - Context discarded

# Main agent receives:
endpoints = task_result  # Summary only

In custom agent frameworks:

from agentops import SubAgent

# Main agent
class OrchestratorAgent:
    def execute_task(self, goal):
        # Spawn sub-agent for exploration
        explorer = SubAgent(
            task="codebase_exploration",
            context_limit=100000,  # Sub has own limit
            jit_loading=True
        )

        exploration_summary = explorer.run()
        # exploration_summary is compressed (5:1 ratio)

        # Spawn sub-agent for implementation
        implementer = SubAgent(
            task="implement_feature",
            context=[exploration_summary],  # Loads summary
            context_limit=100000
        )

        implementation_result = implementer.run()

        # Main agent context stayed clean
        return implementation_result

Validation

✅ You're doing this right if:

  • Main agent context stays under 20% most of the time
  • Deep work delegated to sub-agents with isolated context
  • Sub-agents return concise summaries, not raw data
  • Main agent orchestrates without loading heavy context
  • Can trace task delegation: main → sub → sub → main
  • Context never accumulates across sub-tasks

❌ You're doing this wrong if:

  • Main agent loads entire codebase to "have it available"
  • Main agent does deep analysis work in its own context
  • Sub-agents return full file contents instead of summaries
  • Main agent context grows throughout the session
  • Single agent handles all work without delegation
  • Context utilization regularly exceeds 40%

Real-World Evidence

Sub-Agent Architecture Discovery

What changed when we switched to sub-agent delegation:

Before (monolithic main agent):
- Main agent context: 60-80% utilization
- Success rate on complex tasks: 35%
- Average time per task: 45 minutes
- Frequent context collapse

After (sub-agent delegation):
- Main agent context: 10-20% utilization
- Success rate on complex tasks: 98%
- Average time per task: 12 minutes
- Zero context collapses

Analysis: Isolation and JIT loading are game-changers.

200+ Session Validation

Tracked across production sessions:

MetricMonolithicSub-AgentImprovement
Success rate35%98%2.8x
Avg time/task45 min12 min3.75x
Context collapse rate65%0%Eliminated
Main agent context60-80%10-20%4x cleaner

Key insight: Sub-agents with isolated context windows are fundamentally more reliable.

Specific Example: Codebase Refactoring

Before sub-agent delegation:

Main agent approach:
1. Load entire codebase (45% context)
2. Analyze patterns (context → 65%)
3. Attempt to generate plan (context → 85%, collapse)
4. Lost track of high-level goals
5. Generated plan mixing old and new patterns
Result: Failed, had to restart
Time: 1 hour wasted

After sub-agent delegation:

Main agent orchestration (15% context):
1. Spawn exploration sub-agent → returns architecture summary (3% added)
2. Spawn analysis sub-agent → returns pattern insights (4% added)
3. Spawn planning sub-agent → returns implementation plan (3% added)
4. Main agent: Reviews plan, makes decision
Result: Success, clean plan generated
Time: 15 minutes total
Main context: Never exceeded 25%

Improvement: 4x faster, 100% success vs. failure


Anti-Patterns

❌ The "Monolithic Main Agent" Trap

Wrong: Main agent loads and analyzes everything in its own context Right: Main agent orchestrates, delegates deep work to sub-agents

Example:

# Wrong: Main agent does everything
main_context += load_codebase()      # 40%
main_context += analyze_patterns()   # → 65%
main_context += generate_plan()      # → 85% (collapse)

# Right: Main agent orchestrates
summary1 = spawn_subagent("explore")  # Main: +3%
summary2 = spawn_subagent("analyze")  # Main: +4%
plan = spawn_subagent("plan")         # Main: +3% (total: 10%)

❌ The "Sub-Agent Returns Everything" Trap

Wrong: Sub-agent returns full file contents or raw analysis Right: Sub-agent returns concise summary of findings

Example:

# Wrong: Sub-agent dumps everything
def explore_subagent():
    files = read_all_files()
    return files  # 40K tokens dumped to main agent

# Right: Sub-agent compresses knowledge
def explore_subagent():
    files = read_all_files()
    summary = extract_key_insights(files)
    return summary  # 2K tokens (95% compression)

❌ The "Do It All in Main Context" Trap

Wrong: "I have room in main context, let me just do this task here" Right: Delegate to sub-agent even if main has capacity

Rationale:

  • Keeps main agent focused on orchestration
  • Prevents gradual context bloat
  • Maintains clear separation of concerns
  • Sub-agent can use full capacity if needed

Relationship to Other Factors

  • Factor I: Automated Tracking: Git provides canonical source for sub-agents to load context JIT
  • Factor III: Focused Agents: Each sub-agent has single responsibility = minimal context
  • Factor IV: Continuous Validation: Sub-agents validate in isolation without polluting main context
  • Factor VII: Resume Work: Sub-agent summaries can be bundled for long-running sessions
  • Factor VIII: Smart Routing: Main agent routes tasks to appropriate sub-agents

The Math Behind 40%

Why Not 50%? Why Not 60%?

Empirical observation across systems:

SystemDegradation ThresholdEvidence
ADHD hyperfocus~40% of attention capacityClinical research (Barkley, 2015)
LLM context windows~40% of token limitProduction validation (200+ sessions)
Human working memory~40% of capacity (3 items / 7±2)Cognitive load theory (Sweller, 1988)
Database query performance~40% of index sizeDB optimization guides

Pattern recognition: 40% is a natural threshold across cognitive and computational systems.

The Safety Margin

  • 40% utilization = 60% headroom
  • Allows for:
    • Unexpected context needs
    • Error recovery space
    • Performance buffer
    • Graceful degradation

It's not arbitrary—it's empirically validated.


Deep Dive: Own Your Context Window

Note: This section adapted from 12-Factor Agents by HumanLayer - used with attribution. See their excellent Factor 3: Own Your Context Window for the original content.

Everything is Context Engineering

You don't necessarily need to use standard message-based formats for conveying context to an LLM.

At any given point, your input to an LLM in an agent is "here's what's happened so far, what's the next step"

Everything is context engineering. LLMs are stateless functions that turn inputs into outputs. To get the best outputs, you need to give them the best inputs.

Creating great context means:

  • The prompt and instructions you give to the model
  • Any documents or external data you retrieve (e.g. RAG)
  • Any past state, tool calls, results, or other history
  • Any past messages or events from related but separate histories/conversations (Memory)
  • Instructions about what sorts of structured data to output

Standard vs Custom Context Formats

Most LLM clients use a standard message-based format like this:

[
  { 'role': 'system', 'content': 'You are a helpful assistant...' },
  { 'role': 'user', 'content': 'Can you deploy the backend?' },
  {
    'role': 'assistant',
    'content': null,
    'tool_calls': [{ 'id': '1', 'name': 'list_git_tags', 'arguments': '{}' }],
  },
  {
    'role': 'tool',
    'name': 'list_git_tags',
    'content': '{"tags": [{"name": "v1.2.3", "commit": "abc123", "date": "2024-03-15T10:00:00Z"}, {"name": "v1.2.2", "commit": "def456", "date": "2024-03-14T15:30:00Z"}, {"name": "v1.2.1", "commit": "abe033d", "date": "2024-03-13T09:15:00Z"}]}',
    'tool_call_id': '1',
  },
]

While this works great for most use cases, if you want to really get THE MOST out of today's LLMs, you need to get your context into the LLM in the most token- and attention-efficient way you can.

As an alternative to the standard message-based format, you can build your own context format that's optimized for your use case. For example, you can use custom objects and pack/spread them into one or more user, system, assistant, or tool messages as makes sense.

Here's an example of putting the whole context window into a single user message:

[
  {
    "role": "system",
    "content": "You are a helpful assistant..."
  },
  {
    "role": "user",
    "content": |
      Here's everything that happened so far:

      <slack_message>
        From: @alex
        Channel: #deployments
        Text: Can you deploy the backend?
      </slack_message>

      <list_git_tags>
        intent: "list_git_tags"
      </list_git_tags>

      <list_git_tags_result>
        tags:
          - name: "v1.2.3"
            commit: "abc123"
            date: "2024-03-15T10:00:00Z"
          - name: "v1.2.2"
            commit: "def456"
            date: "2024-03-14T15:30:00Z"
          - name: "v1.2.1"
            commit: "ghi789"
            date: "2024-03-13T09:15:00Z"
      </list_git_tags_result>

      what's the next step?
  }
]

The model may infer that you're asking it what's the next step by the tool schemas you supply, but it never hurts to roll it into your prompt template.

Code Example: Custom Context Builder

We can build this with something like:

class Thread:
  events: List[Event]

class Event:
  # could just use string, or could be explicit - up to you
  type: Literal[
    "list_git_tags", "deploy_backend", "deploy_frontend",
    "request_more_information", "done_for_now",
    "list_git_tags_result", "deploy_backend_result",
    "deploy_frontend_result", "request_more_information_result",
    "done_for_now_result", "error"
  ]
  data: (ListGitTags | DeployBackend | DeployFrontend |
         RequestMoreInformation | ListGitTagsResult |
         DeployBackendResult | DeployFrontendResult |
         RequestMoreInformationResult | string)

def event_to_prompt(event: Event) -> str:
    data = event.data if isinstance(event.data, str) \
           else stringifyToYaml(event.data)

    return f"<{event.type}>\n{data}\n</{event.type}>"

def thread_to_prompt(thread: Thread) -> str:
  return '\n\n'.join(event_to_prompt(event) for event in thread.events)

Example Context Windows

Here's how context windows might look with this approach:

Initial Slack Request:

<slack_message>
  From: @alex
  Channel: #deployments
  Text: Can you deploy the latest backend to production?
</slack_message>

After Listing Git Tags:

<slack_message>
  From: @alex
  Channel: #deployments
  Text: Can you deploy the latest backend to production?
  Thread: []
</slack_message>

<list_git_tags>
  intent: "list_git_tags"
</list_git_tags>

<list_git_tags_result>
  tags:
    - name: "v1.2.3"
      commit: "abc123"
      date: "2024-03-15T10:00:00Z"
    - name: "v1.2.2"
      commit: "def456"
      date: "2024-03-14T15:30:00Z"
    - name: "v1.2.1"
      commit: "ghi789"
      date: "2024-03-13T09:15:00Z"
</list_git_tags_result>

After Error and Recovery:

<slack_message>
  From: @alex
  Channel: #deployments
  Text: Can you deploy the latest backend to production?
  Thread: []
</slack_message>

<deploy_backend>
  intent: "deploy_backend"
  tag: "v1.2.3"
  environment: "production"
</deploy_backend>

<error>
  error running deploy_backend: Failed to connect to deployment service
</error>

<request_more_information>
  intent: "request_more_information_from_human"
  question: "I had trouble connecting to the deployment service, can you provide more details and/or check on the status of the service?"
</request_more_information>

<human_response>
  data:
    response: "I'm not sure what's going on, can you check on the status of the latest workflow?"
</human_response>

From here your next step might be:

nextStep = await determine_next_step(thread_to_prompt(thread))
{
  "intent": "get_workflow_status",
  "workflow_name": "tag_push_prod.yaml",
}

Key Benefits of Owning Your Context Window

The XML-style format is just one example - the point is you can build your own format that makes sense for your application. You'll get better quality if you have the flexibility to experiment with different context structures and what you store vs. what you pass to the LLM.

Key benefits:

  1. Information Density: Structure information in ways that maximize the LLM's understanding
  2. Error Handling: Include error information in a format that helps the LLM recover. Consider hiding errors and failed calls from context window once they are resolved.
  3. Safety: Control what information gets passed to the LLM, filtering out sensitive data
  4. Flexibility: Adapt the format as you learn what works best for your use case
  5. Token Efficiency: Optimize context format for token efficiency and LLM understanding

Remember: The context window is your primary interface with the LLM. Taking control of how you structure and present information can dramatically improve your agent's performance.

Industry Recognition: Context Engineering

About 2 months after 12-Factor Agents was published, context engineering started to become a pretty popular term.

From industry leaders:

Recurring theme: I don't know what's the best approach, but I know you want the flexibility to be able to try EVERYTHING.


Next Steps

  1. Measure current utilization in your agent workflows
  2. Identify phases where you can reset context
  3. Implement JIT loading instead of preloading
  4. Create context bundles for phase continuity
  5. Track metrics to validate improvement
  6. Experiment with custom context formats (see deep dive above)

Further Reading


Remember: Context is not infinite. Treat it like the precious, finite resource it is. The 40% rule isn't a limitation—it's liberation from context collapse.