paper reading - Anthony Wang

Agent Learning via Early Experience (arxiv.org/pdf/2510.08558)

This is interesting, I just skimmed this but the main idea is helping language models get better at world modeling with two methods:

Implicit World Modeling

I like this general idea of the model being able to learn from $s, a, s’$ tuples collected from its own experience.

\[L_{IWM} = - \sum_{(s_i, a_j^i, s_j^i) \in D_{rollout}} \log p_{\theta} (s_j^i | s_i, a_j^i),\]

I just don’t like the fact that they still frame it as a next token prediction problem, but of course-this is an obvious limitation of LLMs.

This helps align the internal representation of the model with the actual dynamics of the environment, but it’s still lossy due to the brittle next-token-prediction objective, as again, it’s a fundamental issue with LLMs.

-# What if we directly optimize for latent space matching in LLMs somehow??

Self-Reflection

Given expert dataset, get to the next state using the current action, and sample different actions and then generate chain of thought to reflect on which action would have been better.

“In practice, we mix the self-reflection data $D_{refl}$ with the expert dataset $D_{expert}$ and train the model using a standard next-token prediction loss.”

Overall, the paper contains promising ideas, but the next-token-prediction framing is a big limitation for these of approaches. I’m just a hater.

NorMuon: Making Muon more efficient and scalable (arxiv.org/pdf/2510.05491)

Applying the per-neuron optimization idea of Adam to Muon. It’s a good idea and pretty straightforward. They do show empirical evidence of improvement over Muon, I’m hesitant to say more, as for the most part, I believe optimization research can only be validated through diverse empirical results.

https://arxiv.org/pdf/2510.00739

Transformers Learn In-Context by Gradient Descent (arxiv.org/pdf/2212.07677)

Very creative methodology to show that it’s possible for transformers to implement gradient descent in context, as well as providing empirical evidence that transformers trained on a regression task shows similar weights to their constructed gradient descent weights.

One would certainly expect Transformers, as fast weight programmers, to be able to learn whatever algorithm necessary to solve the task. Perhaps the quadratic attention mechanism creates an inductive bias towards gradient descent-like algorithms?

I’ve been curious about continuous learning in transformers for a while, and I really wonder if we can somehow combine these ideas to enable continuous learning. An idea that comes to mind is applying gradient descent on the mlp/qkv weights to update the model weights to approximate the updates that the fast weight programmers would have done to modify the attention matrices.