Sorry, rereading my own comment I'd like to clarify. Perhaps time spent thinking...

Sorry, rereading my own comment I'd like to clarify.

Perhaps time spent thinking isn't a great metric, but just looking at deepseek's logs for example, it's chain of thought for many of these "softer" questions are basically just some aggregate wikipedia article. It'll brush on one concept, then move on, without critically thinking about it.

However, for coding problems, no matter how hard or simple, you can get it to just go around in circles, second guess itself, overthink it. And I think this is is kind of a good thing? The thinking at least feels human. But it doesn't even attempt to do any of that for any "softer" questions, even with a lot of my prompting. The highest I was able to get was 50 seconds, I believe (time isn't exactly the best metric, but I'd rate the intrinsic quality of the CoT lower IMO). Again, when I brought this up to people they suggested that math/logic/programming just intrinsically is harder... I don't buy it at all.

I totally agree that it's harder to train for though. And yes, they are next token predictors, shouldn't be hasty to anthropomorphize, etc. But like.... it actually feels like it's thinking when it's coding! It genuinely backtracks and explores the search space somewhat organically. But it won't afford the same luxury for softer questions is my point.