Article

Context Engineering Is Moving from Manual Packing to Agent Judgment

My working model for context engineering: the industry moved from adding more context, to pricing every token, to letting agents decide what should enter the working set.

5 min read

I used to think most agent bugs were prompt bugs.

Now I think that is only a small part of the problem.

The more useful question is usually: what did the model see, what did it not see, and what noisy thing did we force into its working memory?

That is the reason the phrase “context engineering” became useful. It is not just a new name for prompt engineering. A prompt is one part of context. The larger job is deciding what goes into the model’s active working set, what stays outside, what gets compressed, and what gets isolated.

The first stage was adding

The first instinct was obvious: bigger context window means we can put more things in.

More documents. More tools. More memory. More examples. More instructions. More chain state.

This was not stupid. Early agent systems really did fail because they lacked information. If the model cannot see the file, the API shape, the user goal, or the previous decision, it will guess.

So builders started to add.

The file system became context. Scratchpads became context. Long todo lists became context. Tool definitions became context. Browser history became context. Every useful intermediate artifact wanted a place inside the prompt.

For a while, this felt like progress.

But it also created a new problem: the model was not only seeing more signal. It was seeing more everything.

The second stage was subtraction

At some point you start paying attention to the input side.

Agent runs are input-heavy. The model often reads a large amount of state to produce a small action. That means the cost, latency, and failure modes are not only in the final answer. They live in the material you keep feeding into the system.

This changed how I think about agent architecture.

The question is no longer “can I include this?”

The question is “does this token earn its attention cost?”

A repeated todo file may help the model remember the goal, but if it keeps rewriting the same thing, it also becomes tax. A large tool list may expose capability, but it also increases confusion and cache churn. A manager-agent hierarchy may look like a human organization, but LLMs are not people. Sometimes the better abstraction is simply: call a sub-agent like a function, pass the minimum input, get back a structured result.

This is the subtraction phase.

Not less context for the sake of minimalism. Smaller high-signal context because attention is finite.

Write, select, compress, isolate

The practical map I use is simple.

Write: decide what should become durable state. Some things belong in memory, a database, a file, a run log, or a scratchpad.

Select: choose which pieces are relevant right now. Most stored context should not enter every step.

Compress: reduce material without deleting the thing that matters. Summaries, pruning, and extraction are useful, but only when they preserve the decision-critical parts.

Isolate: move noisy work into a separate context. Let a sub-agent read ten files, run tests, or inspect logs, then return evidence and risk instead of dumping the whole journey into the main thread.

This map is boring. That is why I like it.

It turns “the agent is getting confused” into a set of questions I can debug.

Did we fail to write down the important state?

Did we select the wrong material?

Did compression erase the useful detail?

Did we isolate too little, or isolate so much that key assumptions never reached the main line?

The third stage is decision transfer

The more interesting direction is what happens next.

Manual context engineering can only go so far. If every token decision depends on the engineer, the system becomes brittle. It may work for one model generation and become a burden after the next one.

So the direction I care about is decision transfer.

Can the agent decide when to branch a subtask and fold the result back?

Can it decide which skill or capability to load only when needed?

Can it update its own playbook from execution feedback?

Can it keep the active context small without becoming blind?

This is the part that feels like a real shift. Context engineering starts as a human craft, but the valuable version may be a system where the agent owns more of the context boundary.

The engineer still designs the rails. But the agent should make more local decisions about what to read, what to ignore, what to summarize, and what to bring back.

My rule of thumb

When I look at an agent architecture now, I ask one rough question:

Will this architecture get stronger when the model gets stronger?

If the answer is yes, the scaffolding is probably thin enough.

If the answer is no, the architecture may be overfitted to today’s model weakness. It might be a clever trick, but it will become drag.

That is the uncomfortable part of context engineering. A lot of the work is temporary. We build scaffolding because the current model needs it, then we should be willing to remove it when the model no longer does.

So I still care about prompts.

But when an agent fails, my first question is no longer “what sentence should I add?”

It is: what did we make it look at?