i replaced a model’s reasoning with gibberish
these days the story everyone tells is: let the model “think” before it answers, and it does better. you give it a scratchpad, it writes out a long chain of reasoning, and accuracy goes up. papers like s1 and snell et al., and the whole o1 narrative, are built on this idea of spending more compute at test time by thinking longer. while running a budget-forcing study i kept wondering about something simple. when thinking helps, is it because of what the model writes in the scratchpad? or just because it spent a few thousand extra tokens before answering — the extra forward passes, the longer context — regardless of what those tokens say? ...