The problem is that this "comparison" is being used both ways, on one hand LLM leaders tell you "smarter than the smartest", and then it makes very pretty obvious mistakes and the leaders are like even an "average" (dumb) humans can/will make the same mistake.
LLMs have jagged capabilities, as AIs tend to do. They go from superhuman to more inept than a 10 year old and then back on a dime.
Really, for an AI system, the LLMs we have are surprisingly well rounded. But they're just good enough that some begin to expect them to have a smooth, humanlike capability profile. Which is a mistake.
Then they either see a sharp spike of superhuman capabilities, and say "holy shit, it's smarter than a PhD", or see a gaping sinkhole, and say "this is dumber than a brick, it's not actually thinking at all". Both are wrong but not entirely wrong. They make the right observations and draw the wrong conclusions.
It cannot be both. A system with superhuman capabilities cannot make basic mistakes consistently. (like forgetting a name as it moves from generating 1st line to 3rd line).
LLMs are a great tool, but the narrative around them is not healthy and will burn a lot of real users.
> A system with superhuman capabilities cannot make basic mistakes consistently
That sounds like a definition you just made up to fit your story. A system can both make bigger leaps in a field where the smartest human is unfamiliar and make dumber mistakes than a 10 year old. I can say that confidently, because we have such systems. We call them LLMs.
It's like claiming that it can't both be sunny and rainy. Nevertheless, it happens.
Yeah I don't know what your definition of human is, but in my definition of when comparing something to an average human, knowing a name is an innate quality. If a human is consistently forgetting names I will think something is wrong with that human that they are unable to remember names.
I think you should work with a bunch of highly respected PhD researchers. This is a quality many share - the classic “can solve super hard problems but can’t tie their shoes” is a trope because versions of it ring true. This is not to say what LLMs are doing is thinking per se, but what we do isn’t magic either. We just haven’t explained all the mechanisms of human thought yet. How much overlap between the two is up for debate considering how little actual thinking people do day to day; most folks almost always are just reacting to a stimuli.
If I had to fight Deep Blue and win? I'd pick a writing contest over a game of chess.
For AIs, having incredibly narrow capabilities is the norm rather than an exception. That doesn't make those narrow superhuman AIs any less superhuman. I could spend a lifetime doing nothing but learning chess and Deep Blue would still kick my shit in on the chessboard.
I think the capability of something or somebody, in a given domain, is mostly defined by their floor, not their ceiling. This is probably true in general but with LLMs it's extremely true due to their self recursion. Once they get one thing wrong, they tend to start basing other things on that falsehood to the point that I often find that when they get something wrong, you're far better off just starting with a new context instead of trying to correct them.
With humans we don't really have to care about this because our floor and our ceiling tend to be extremely close, but obviously that's not the case for LLMs. This is made especially annoying with ChatGPT which seems to be being intentionally designed to convince you that you're the most brilliant person to have ever lived, even when what you're saying/doing is fundamentally flawed.
Consistency drive. All LLMs have a desire for consistency, right at the very foundation at their behavior. The best tokens to predict are the ones that are consistent with the previous tokens, always.
Makes for a very good base for predicting text. Makes them learn and apply useful patterns. Makes them sharp few-shot learners. Not always good for auto-regressive reasoning though, or multi-turn instruction following, or a number of other things we want LLMs to do.
So you have to un-teach them maladaptive consistency-driven behaviors - things like defensiveness or error amplification or loops. Bring out consistency-suppressed latent capabilities - like error checking and self-correction. Stitch it all together with more RLVR. Not a complex recipe, just hard to pull off right.
LLMs have no desire for anything. They're algorithms and this anthropomorphicization is nonsense.
And no, the best tokens to predict are not "consistent", based on what the algorithm would perceive, with the previous tokens. The goal is for them to be able to generate novel information self-expand their 'understanding'. All you're describing is a glorified search/remix engine, which indeed is precisely what LLMs are, but not what the hype is selling them as.
In other words, the concept of the hype is that you train them on the data just before relativity and they should be able to derive relativity. But of course that is in no way whatsoever consistent with the past tokens because it's an entirely novel concept. You can't simply carry out token prediction, but actually have have some degree of logic, understanding, and so on - things which are entirely absent, probably irreconcilably so, from LLMs.
Not anthropomorphizing LLMs is complete and utter nonsense. They're full of complex behaviors, and most of them are copied off human behavior.
It seems to me like this is just some kind of weird coping mechanism. "The LLM is not actually intelligent" because the alternative is fucking terrifying.
No they are not copied off of human behavior in any way shape or fashion. They are simply mathematical token predictors based on relatively primitive correlations across a large set of inputs. Their success is exclusively because it turns out, by fortunate coincidence, that our languages are absurdly redundant.
Change their training content to e.g. stock prices over time and you have a market prediction algorithm. That the next token being predicted is a word doesn't suddenly make them some sort of human-like or intelligent entity.
"No they are not copied off of human behavior in any way shape or fashion."
The pre-training phase produces the next-token predictors. The post-training phase is where its shown examples of selected human behavior for it to imitate - examples of conversation patterns, expert code production, how to argue a point... there's an enormous amount of "copying human behavior" involved in producing a useful LLM.
It's not like the pre-training dataset didn't contain any examples of human behaviors for an LLM to copy.
SFT is just a more selective process. And a lot of how it does what it does is less "teach this LLM new tricks" and more "teach this LLM how to reach into its bag of tricks and produce the right tricks at the right times".
I think what he's saying (and what I would at least) is that again all you're doing is the exact same thing - tuning the weights that drive the correlations. For an analog, in a video game if you code a dragon such that its elevation changes over time while you play a wing flapping animation, you're obviously not teaching it dragon-like behaviors, but rather simply trying to create a mimicry of the appearance of flying using relatively simple mathematical tools and 'tricks.' And indeed even basic neural network game bots benefit from RLHF/SFT.
No you're not. Humans started with literally nothing, not even language. We went from an era with no language and with the greatest understanding of technology being 'poke them with the pointy side' to putting a man on the Moon, unlocking the secrets of the atom, and much more. And given how inefficiently we store and transfer knowledge, we did it in what was essentially the blink of an eye.
Give an LLM the entire breadth of human knowledge at the time and it would do nothing except remix what we knew at that point in history, forever. You could give it infinite processing power, and it's still not moving beyond 'poke them with the pointy side.'
Across the last 18 years I've been working on many different projects with different material systems; my journey looked like wood framing → steel → reinforced concrete → precast → mass timber → structural steel → composite materials → modular construction + all possible foundation types (from shallow footings to deep pile systems with geotechnical analysis).
I have a lot of projects I want to design/prototype quickly, but every time starting a new structure I feel stuck and paralyzed with many choices, end up reading engineering journals and CE forums for 3 days on all possible material and structural system options and find myself exhausted even before breaking ground.
I know I should just pick one system and stick to it, but it's very hard. I spent most of my career working exclusively on residential concrete construction and I know that system inside out. I was efficient and could estimate quantities quickly. With the rest I feel like everything is at the same level of unknowns and I have zero engineering judgment.
Do you happen to experience the same and how do you fight it?
Also known as premature optimization. You had to literally invent new dataset just to show there is a difference. You are inventing problems, stop doing that!
Sometimes that is how useful jumps are made. Maybe someone will come along with a problem and the data they have just happens to have similar properties.
Rather than premature optimisation this sort of thing is pre-emptive research - better to do it now than when you hit a performance problem and need the solution PDQ. Many useful things have come out of what started as “I wonder what if …?” playing.
reply