In-context learning research

When context sticks

April 20263 minute read

In-context learning is an AI model's ability to adapt during inference. It helps explain modern LLMs adaptability and is theorized to be its main learning system after training is done.

However, with longer context windows another question became more central. If we give a model one task and then abruptly switch to another, what happens? Once the model has started using one pattern, how hard is it to adapt to the new one? Additionally how does training curriculum affect the models adaptability? This writeup tries to answer those questions in a controlled synthetic environment.

We train models on two different regression tasks and then abruptly switch the task the model has to solve during inference. In this way we can measure the impact of misleading context.

Question: How much does earlier context keep influencing the model after the task has changed, and how quickly does the new pattern recover?

The experiment

At inference time, we built prompts with a sudden task switch. In the linear-to-quadratic case, the prompt starts with linear examples, then switches to quadratic examples, then asks for a quadratic prediction. We also tested the reverse direction.

The training curricula

We define the curricula as rules for choosing which function family generates each training batch at step t. Let F1 be the linear family and F2 be the quadratic family. Following common practice we start training on the simpler task in mixed and sequential curricula as this has been shown to increase performance.

Sequential curriculum

The model sees only the linear family in the first half of training, then only the quadratic family in the second half.

f_{t} \sim {F_{1}, F_{2}, 1 \leq t < \frac{T}{2}, \frac{T}{2} \leq t \leq T .

(1)

Mixed curriculum

The first half is still linear-only. In the second half, each batch is sampled from a uniform mixture of the two task families, with ξ ~ Unif{1, 2}.

f_{t} \sim ⎩ ⎨ ⎧ F_{1}, s = 1 \sum 2 1 {ξ = s} F_{s}, 1 \leq t < \frac{T}{2}, \frac{T}{2} \leq t \leq T .

(2)

Random curriculum

There is no ordering pressure: at every training step the task family is sampled uniformly, with ζ ~ Unif{1, 2} independently across steps.

f_{t} \sim s = 1 \sum 2 1 {ζ = s} F_{s}, 1 \leq t \leq T .

(3)

The question here was whether this training schedule changes the inner-loop learner in the model.

Results

In the linear-to-quadratic sweep, error rises as the number of pre-switch linear examples increases and falls as more post-switch quadratic examples are added. This is the context-stickiness effect.

The recovery curves differ by curriculum. Sequential training recovers fastest, mixed training improves more gradually, and random training recovers the least once the initial linear context is long. Across curricula, the first few quadratic examples help the most; after that, each additional example gives smaller gains.

Three 3D error surfaces comparing random, sequential, and mixed curricula across linear and quadratic context lengths.

Limits and next steps

The experiments used small self-trained transformers and simple two-dimensional function classes. That makes the setting easier to interpret, but it also means the numbers should not be read as forecasts for large production models.

A good next step would be to compare these curves against closed-form regression baselines, test more task families, and study mitigation strategies that explicitly tell the model when the task has changed.