Skip to content
all factors
Ifoundation

Context Is Everything

Manage what enters the context window like you manage what enters production.

17 min read

Factor I: Context Is Everything

Rule: Manage what enters the context window like you manage what enters production.

Every token you load into an agent's context window is a decision. Every file read, every search result, every log line, every instruction appended to the prompt. These decisions compound. Get them wrong and your agent drowns. Get them right and it executes with precision.

Context management is the foundational discipline of operational agent work. It separates agents that ship reliable work from agents that confabulate, repeat themselves, and ignore half of what you tell them.


The Rationale

The Problem: Context Windows Are Not Infinite Memory

LLMs have large context windows now—200K tokens, 500K tokens, millions of tokens on the horizon. This creates a dangerous illusion: that you can dump everything into context and let the model sort it out.

You cannot.

Research on the "lost in the middle" effect (Liu et al., 2023) demonstrates that LLMs attend strongly to the beginning and end of their context window but lose track of information in the middle. Place a critical fact in the middle of a 100K-token context and the model may ignore it entirely, even when that fact is the key to answering the question you just asked.

This isn't a bug. It's how attention mechanisms work. The model has finite computational resources to distribute across the entire context. Early tokens and recent tokens get the most attention. The middle gets less. The longer the context, the more severe this effect becomes.

Observable Symptoms of Context Overload

You know you have a context management problem when you see:

  1. The agent repeats itself. It tells you the same thing multiple times because it can't track what it already said 50K tokens ago.

  2. The agent ignores recent instructions. You tell it to stop doing X, and two responses later it's doing X again. The instruction got lost in the noise.

  3. The agent confabulates details. It invents facts that sound plausible but aren't in the codebase. It's pattern-matching on noise instead of retrieving accurate information.

  4. The agent contradicts its own earlier output. It proposes a solution, then forgets it proposed that solution and suggests something incompatible five messages later.

  5. The agent asks for information it already has. You gave it the schema three prompts ago. Now it's asking for the schema again because it can't find it in the 80K tokens of accumulated transcript.

These are not model failures. These are context management failures.

Cognitive Load Theory Applied to AI

Cognitive load theory (Sweller, 1988) describes how human working memory is limited and how learning is impaired when working memory is overloaded. The principle applies to LLMs as well.

An LLM's "working memory" is its context window. The model can only effectively process a subset of what's in that window at any given time. Load it with 200K tokens of irrelevant code, debug logs, and conversation history, and you impose massive extraneous cognitive load. The model spends its attention budget sifting through noise instead of reasoning about the problem.

The agent that knows less but knows the RIGHT things outperforms the one drowning in context.

The Context Budget Is Finite

Even with a 200K token context window, you have a finite budget. Here's how it gets spent:

  • System prompt and instructions: 2K-5K tokens
  • Tool definitions: 5K-15K tokens (if you have many tools)
  • Conversation history: Grows unbounded if not managed
  • Retrieved context (code, docs, search results): Can easily hit 50K-100K tokens
  • Agent's own output: Accumulates with every response

If you're not actively managing this budget, you'll hit the limit or create a context so diluted that the model can't focus on what matters.

Just-In-Time Loading

The solution is just-in-time context loading. Don't load everything the agent might need. Load what it needs NOW, for THIS step.

Example: You're working on a multi-phase refactoring task.

  • Phase 1 (Analysis): Load the target module, its direct dependencies, and test files. Don't load the entire codebase. Don't load the git history. Don't load unrelated modules.

  • Phase 2 (Design): Load the analysis summary from Phase 1. Don't reload the full codebase. The agent doesn't need to re-analyze—it needs to design based on the summary.

  • Phase 3 (Implementation): Load the design document and the specific files being edited. Don't reload the analysis. Don't reload unrelated files.

  • Phase 4 (Validation): Load the implementation summary, the test suite, and the validation checklist. Don't reload the design. Don't reload the codebase.

Each phase gets exactly what it needs, when it needs it. Nothing more.

Deliberate Compression Between Phases

When you hand off work from one phase to the next—or from one session to the next—don't pass the entire transcript.

Write a summary. This is deliberate compression — you choose what to keep. This is the opposite of automatic compaction, where the tool chooses for you (and gets it wrong).

A good handoff summary is 3-20 lines:

  • What was the goal?
  • What did you do?
  • What decisions did you make?
  • What's the next step?

This is compression. You're distilling 10K tokens of conversation into 200 tokens of essential information. The next agent (or the next session of the same agent) loads the summary, not the full transcript.

This is how you prevent context bloat. This is how you keep agents focused.

Real-World Example: The 80K Token Debugging Session

You're debugging a production issue. The agent:

  1. Loads the error logs (5K tokens)
  2. Loads the module where the error occurred (8K tokens)
  3. Loads the test suite (10K tokens)
  4. Loads the git history for that module (15K tokens)
  5. Loads three related modules "just in case" (20K tokens)
  6. Loads the entire conversation history (22K tokens)

Total: 80K tokens.

The agent spends five responses wandering through irrelevant code paths because it can't distinguish signal from noise. It suggests changes that don't match the error. It forgets what you already ruled out.

Now try this:

  1. Load the error log (5K tokens)
  2. Load the failing function and its direct callees (3K tokens)
  3. Load the relevant test (1K tokens)

Total: 9K tokens.

The agent identifies the bug in one response. It proposes a fix. It works.

The difference is focus. The smaller context contained exactly what was needed. The larger context was 90% noise.


What Good Looks Like

1. Explicit Context Budgets

You know what you're loading and why. Before you start a task, you decide:

  • What files does this task require?
  • What conversation history is essential?
  • What can be summarized or discarded?

You treat context like memory allocation. You allocate what you need. You free what you don't.

2. Phase-Based Context Windows

You structure work into phases, and each phase has its own clean context window:

  • Research phase: Load the codebase slices needed for exploration. At the end, write a research summary.
  • Planning phase: Load the research summary (not the full transcript). Write a plan document.
  • Implementation phase: Load the plan (not the research). Load the files being edited. Write an implementation summary.
  • Validation phase: Load the implementation summary and tests. Write a validation report.

Each phase starts fresh with only the essential handoff from the previous phase.

3. Aggressive Summarization

You summarize constantly:

  • After every subtask, write a 3-line summary.
  • After every session, write a handoff note.
  • After every phase, compress the findings into a structured document.

You never pass a 50K token transcript to the next step when you can pass a 500-token summary.

4. Search Before Load

Before loading an entire file, you search. You use grep to find the relevant function. You use git log -S to find the commit that introduced a change. You use AST tools to extract just the class definition.

You load the minimum viable slice of context to answer the current question.

5. Ephemeral Contexts for Throwaway Work

For one-off tasks (answering a quick question, generating a snippet, checking a fact), you spawn a fresh session with a minimal context. You don't pollute your main session's context with throwaway work.

This is the agent equivalent of using a scratch buffer.

6. Monitoring Context Utilization

You track how much context you're using:

  • How many tokens in the current window?
  • How much is conversation history vs. retrieved content?
  • Is the model still attending to early instructions, or are they lost in the middle?

You adjust based on observed behavior. If the agent starts repeating itself or ignoring instructions, you know you've exceeded the effective context limit.


Without Tooling

You don't need sophisticated context management frameworks to apply this factor. You can do it manually with basic discipline:

1. Keep Sessions Short

Don't run a single agent session for six hours. After 30-60 minutes, or after completing a discrete subtask, wrap up. Write a 3-line summary. Start a fresh session for the next task.

This prevents unbounded transcript growth.

2. Start Fresh for New Tasks

When you switch from Task A to Task B, start a new session. Don't carry forward the entire context from Task A. If Task B needs information from Task A, write a handoff summary and paste it into the new session.

This keeps context focused on one problem at a time.

3. Manually Write Handoff Summaries

When you finish a phase or a session, write:

HANDOFF:
- Goal: [what you were trying to do]
- Done: [what you accomplished]
- Decisions: [key choices made]
- Next: [what comes next]

This takes 60 seconds. It saves you from re-reading 10K tokens of transcript in the next session.

4. Load Files Selectively

Don't tell the agent "read the entire src/ directory." Tell it "read src/auth.py and src/models/user.py."

You manually identify the relevant files before invoking the agent. This is research time well spent.

5. Use Scratch Sessions for Questions

If you need to answer a one-off question (e.g., "what's the signature of this API?"), open a scratch session, ask the question, get the answer, close the session.

Don't pollute your main session's context with every stray question.

6. Regularly Audit Your Context

Every 10-15 minutes, ask yourself:

  • Is the agent still using the information I loaded earlier?
  • Have I loaded things I haven't referenced in the last three exchanges?
  • Is the conversation wandering?

If yes to any of these, prune the context or start fresh.


Anti-Patterns

The Kitchen Sink Session

"Let me just load everything so the agent has all the context it needs."

You open a session and immediately read 40 files, load the full git log, paste in three Slack threads, and dump the entire test suite. The agent now has 150K tokens of context. You ask it to fix a one-line bug. It suggests changes to files that aren't relevant. It proposes a refactoring you didn't ask for. It's lost in the noise.

Fix: Load only the files directly involved in the bug. Add more only when the agent asks for them or when you discover they're needed.

The Eternal Transcript

You've been working with an agent for three hours. The conversation history is 80K tokens. You're still adding tasks to the same session because "it already knows the context."

It doesn't. Not anymore. The early context is buried. The agent is operating on the last 20-30K tokens of working memory. It has forgotten decisions from hour one. It contradicts things it said earlier.

Fix: When you notice drift, wrap up the current task, write a handoff summary, and start fresh. The new session with a 500-token summary will outperform the old session with an 80K-token transcript.

The Copy-Paste Context

You copy-paste a 5,000-word design document, a 3,000-word requirements spec, and a 2,000-word architecture overview into the prompt. "Here's everything you need to know."

The agent reads all 10K tokens but can't distinguish critical requirements from nice-to-have background. It treats everything with equal weight. It misses the three-word constraint buried on page four that makes the entire design feasible.

Fix: Write a focused brief for the specific task. Extract the 10-15 bullet points that actually matter for THIS implementation step. Reference the full documents by path for the agent to read on demand.

The "Just In Case" File Load

"I'll load the database schema, the API routes, the middleware, the models, the tests, the config, and the deployment manifest. The agent might need any of them."

It won't need most of them. But now the agent has 60K tokens of "might need" context competing for attention with the 3K tokens it actually needs.

Fix: Start with the minimum. Let the agent request more context. If it needs the database schema, it'll ask for it (or you'll realize it needs it when the first attempt fails). Reactive loading beats speculative loading.

The Undifferentiated Dump

You pipe the output of find . -name "*.py" | head -50 into the agent. All 50 file paths with no indication of which ones matter. The agent reads them all dutifully, filling its context with 50 files when only 3 are relevant.

Fix: Curate. Tell the agent which 3 files to read and why. "Read src/auth/handler.py (the failing endpoint), src/auth/middleware.py (the auth check), and tests/test_auth.py (the failing test)."

The Stale Context

You loaded a file at the beginning of the session. Forty messages later, the agent references that file. But you've edited it since then. The agent is reasoning from a stale snapshot without knowing it.

Fix: When you modify files outside the agent's session, explicitly reload them. Or better: work in phases where the agent both reads and writes within the same short session, then hand off to a fresh session for validation.

Hitting the Compaction Wall

You've been running a long session. The tool silently summarizes your earlier conversation to make room for new messages. You don't notice — until the agent forgets a critical decision, contradicts its earlier plan, or loses track of files it already edited.

This is automatic context compaction — the tool's emergency measure when conversation history exceeds the context window. It's lossy. It's uncontrolled. You don't choose what gets kept and what gets discarded. The tool's summarizer decides, and it routinely drops the details that matter most: specific file paths, exact error messages, nuanced decisions, partial progress on multi-step work.

Compaction is not compression. Deliberate summarization (writing a handoff summary yourself) preserves what matters because you choose what to keep. Automatic compaction preserves what the summarizer thinks matters — which is often wrong.

The symptoms: Agent contradicts decisions from earlier in the session. Agent re-reads files it already read. Agent loses track of which issues are done vs pending. Agent forgets the plan and starts improvising.

Fix: Never let a session reach the compaction wall. The 40% rule exists precisely to prevent this. When you're past 20-30 exchanges, wrap up the current task, write a handoff summary to a file, and start a fresh session. If you're running autonomous workflows (like multi-agent orchestration), have agents write results to files and return only a signal — not their full output into the orchestrator's context.

The rule: If compaction fires, you already failed at context management. Treat compaction like an OOM kill — it means you didn't manage your resources. The goal is to never need it.


Why This Factor Comes First

Context management is Factor I because it's the foundation. If you get this wrong, nothing else works.

  • You can't plan effectively if the agent is drowning in irrelevant context.
  • You can't implement reliably if the agent forgets its own decisions mid-task.
  • You can't validate thoroughly if the agent can't attend to the checklist you gave it.
  • You can't iterate quickly if every session requires re-loading the entire world.

Every other factor in this framework depends on the agent having a clean, focused context window.

Master context management, and you unlock the agent's ability to reason clearly, execute precisely, and operate autonomously. Ignore it, and you'll spend your time fighting confabulation, repetition, and drift.

Context is everything. Manage it like production.


Key Takeaways

  1. Large context windows are not infinite memory. LLMs suffer from the "lost in the middle" effect. More context is not always better.

  2. Observable symptoms of context overload: repetition, ignored instructions, confabulation, contradictions, asking for already-provided information.

  3. Just-in-time loading: Load what you need for THIS step, not everything you might need for ALL steps.

  4. Deliberate compression between phases: Summarize before handoff. Don't pass full transcripts. Never let automatic compaction decide what to keep — you decide.

  5. The agent that knows less but knows the RIGHT things wins. Focus beats volume.

  6. Without tooling: Keep sessions short, start fresh for new tasks, write handoff summaries, load files selectively, use scratch sessions for one-off questions.

Context management is not a nice-to-have optimization. It's the foundational discipline that determines whether your agent operates or flails.

Treat your context budget like production resources. Allocate deliberately. Monitor actively. Prune aggressively.


The 40% Rule

A practical heuristic: keep your context utilization under 40% of the window size. If your model has 200K tokens, aim to stay under 80K of active context.

Why 40%? Three reasons:

  1. The model needs room to reason. Generated output consumes context too. If you start at 160K tokens in a 200K window, the model has 40K tokens to respond — barely enough for a complex implementation.

  2. Attention quality degrades before the hard limit. The "lost in the middle" effect intensifies as context grows. At 40% utilization, you're in the sweet spot where attention is distributed well. At 80%, the middle is largely ignored.

  3. Buffer for unexpected context. Tool calls return variable amounts of data. A file read might return 500 tokens or 5,000. If you're already at 90% capacity, one unexpected result pushes you over the cliff.

The 40% rule isn't dogma — it's a starting point. Adjust based on your model, your task complexity, and the quality of your results. But if you're regularly seeing the symptoms of context overload, check your utilization first.

Measuring Context Utilization in Practice

You rarely get an exact token count during a session. Instead, use these proxies:

  • Message count: After 20-30 exchanges in a single session, you're likely past the 40% mark. Time to write a handoff summary and start fresh — before the tool compacts for you.
  • Loaded file count: Each source file averages 200-500 tokens. Loading 20 files means 4K-10K tokens just from file reads.
  • Behavioral signals: The symptoms listed above (repetition, contradictions, confabulation) are your real-time meter. When they appear, you've exceeded the effective limit regardless of raw token count.

The Context Is the Product

In traditional software engineering, the code is the product. In agent-driven workflows, the context is the product. The quality of what you put into the context window directly determines the quality of what comes out.

A carefully curated 20K-token context produces better results than a carelessly assembled 200K-token context. Every time.

This is counterintuitive. More information should mean better decisions. But LLMs are not databases — they don't simply look up facts. They attend to information proportionally, and irrelevant information dilutes the attention given to relevant information.

The discipline of context management is the discipline of asking: "What does the agent need to know RIGHT NOW to do THIS task well?" Everything else is noise.


References

  • Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv preprint arXiv:2307.03172.

  • Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257-285.


Next: Factor II: Track Everything in Git