What Fills the Context

Before compression runs, something has to fill the bucket. This is the anatomy of accumulation.

Compression is reactive. It doesn't decide when to run — the filling does. Understanding what fills the context, and how it's measured, reveals why the threshold is set where it is and what the system is trying to preserve room for.

The Token Budget

Every session starts with a budget: 200 turns, 1,000,000 tokens, 200 tool calls, $2.00 cost cap, 30 minutes. These aren't suggestions — they're hard limits encoded in the configuration. When any dimension is exceeded, the run aborts.

The token budget is the one that triggers compression. At the start of each turn, before any work happens, the prepare node runs a simple check:

if (b.tokensUsed >= b.maxTokens) {
    state.outcome = 'aborted';
    return;
}

But there's a second check that runs before that abort — a compression check. If estimated tokens exceed 80% of the budget, compression fires. The abort is the safety net. The compression is the escape valve.

Estimation

Token counting isn't exact — the model provider's tokenizer isn't available at runtime. Instead, there's an estimator: estimateTokens() from the agentic package. It takes a string and returns an approximation.

The approximation matters because compression decisions depend on it. The threshold is set at 80% not because 80% is special, but because the estimate is approximate. If you waited until 95%, you might hit the hard limit before compression finishes. The buffer exists because the measurement is fuzzy.

What gets estimated? Two things: the system prompt, and every message in the conversation history. The system prompt is constant — it's the identity, constraints, tool catalog, memory results, all the prompt contributors assembled into one block. The messages are dynamic — they grow turn by turn.

The Prompt Assembly Pipeline

Before each turn, prompt contributors are assembled into sections. Each section has:

Phase: constraint, task, tools, memory
Priority: a number determining order within phase
Sticky flag: whether this section resists compression

The phases are ordered: constraint first (identity, contract, runtime environment, system clock), then tools, then task, then memory. Within each phase, higher priority numbers are assembled first. Sticky sections — marked sticky: true — are excluded from compression entirely.

What's sticky? The identity block (AGENT.md), the contract rules, the tool catalog, the recent execution footprint, runtime environment, system clock. These are the non-negotiables. They stay regardless of how full the context gets. They're also not counted for compression because they're part of the system prompt, not the message history.

How Messages Grow

Each turn adds messages. A user message comes in. An assistant message goes out — that's one exchange. But assistant messages can contain tool calls, and each tool call produces a tool_result message. A single turn with three tool calls adds four messages to the history: the assistant response with tool calls, then three tool results.

Tool results are where most token growth happens. A fs_read on a large file can return thousands of tokens in one result. A search_grep with many matches can flood the context. This is why tool results have a dedicated pruning stage in compression — they're the primary source of bloat.

The conversation history accumulates without limit until compression runs. There's no circular buffer, no rolling window. Every message is preserved until the 80% threshold triggers compression. This is intentional: the system maximizes context retention, compressing only when necessary, not speculatively.

The 80% Threshold

Why 80%? The number isn't arbitrary, but it also isn't derived from first principles. It's a compromise between two pressures:

Preserve working context. If you compress too early, you lose useful detail before you need to. The last few exchanges — the tail — are where active work happens. Compressing at 50% would mean losing that work prematurely.

Leave room for the model. The model needs context to think. Its output counts against the token budget on some providers. If you compress at 95%, the model might not have enough space to produce a useful response, especially for complex reasoning or code generation.

80% is the configured default. It can be changed via the COMPRESSION_THRESHOLD environment variable. Lower values mean more frequent compression, smaller working context, earlier loss of detail. Higher values mean fuller context, more risk of hitting the hard limit, less buffer for model output.

What Accumulates

The message history contains:

User messages: The user's input each turn. These are preserved in the head (first 2) and tail (last 6) during compression.
Assistant messages: Responses with or without tool calls. These have tool call metadata estimated separately.
Tool results: Output from tool executions. These are the primary pruning target — old tool results get replaced with [Cleared for context space].
Steering messages: Injected guidance from the system, merged into the last user message to preserve role alternation.

The steering queue is emptied at the start of each turn — its messages don't accumulate, they flow into the user message. This keeps the conversation role alternation clean: user, assistant, user, assistant. Tool results don't break this pattern because they're attached to the preceding assistant message.

The Check Before Compression

The shouldCompress() function runs this calculation:

let totalTokens = estimateTokens(state.systemPrompt);
for (const msg of state.messages) {
    totalTokens += estimateTokens(msg.content ?? '');
    if (msg.role === 'assistant' && msg.toolCalls) {
        for (const tc of msg.toolCalls) {
            totalTokens += estimateTokens(JSON.stringify(tc.args));
        }
    }
}
return totalTokens > maxTokens * threshold;

The system prompt is included because it's part of what the model sees — it's not free space. Tool call arguments are counted because they're structured data that can be substantial — a file path is small, but a code patch can be hundreds of tokens.

If the check returns true, compression runs. If not, the turn proceeds with the full message history intact.

What This Reveals

The architecture of filling is about delay. The system doesn't compress speculatively. It doesn't trim edges or discard old messages until the threshold forces the decision. The 80% threshold is a commitment to preserving context as long as possible, compressing only when the alternative is worse than loss.

This is why the compression architecture — the head protection, tail protection, structured summary, file tracking — matters. When compression finally runs, it has to do a good enough job that the session can continue. The summary has to capture what matters because the original messages won't survive.

But the summary can only work with what filled the context in the first place. If the context was filled with irrelevant tool output, repetitive queries, verbose results — the summary inherits that noise. Compression is a transformation, not a filter. It reduces structure, it doesn't extract signal.

What fills the context is ultimately what the agent chooses to do. Every tool call, every read, every search. The bucket fills by accumulation, and compression can only reorganize what accumulated. The real filter is upstream: choosing what to put into the context in the first place.