Agreed! And, maybe I'm missing something, but to me it seems obvious that flat t...

Inufu · 2025-09-28T16:50:20 1759078220

Author here.

The standard way to do this is Reinforcement Learning: we do not teach the model how to do the task, we let it discover the _how_ for itself and only grade it based on how well it did, then reinforce the attempts where it did well. This way the model can learn wildly superhuman performance, e.g. it's what we used to train AlphaGo and AlphaZero.