Consistency is one aspect, but it is not enough. I believe (and this is somewhat based in other arguments from neuroscience and discussions with alignment researchers) that two more are necessary: compression, which demonstrates algorithmic development; and linear representation capacity, as this is the only way that we really interpret the world, and therefore will only define another as intelligent if it can distill knowledge into the same language that we understand.
I think compression is probably a natural consequence of coherent self models? Isn't requiring other minds to package their intelligence in human interpretable linear narratives is like requiring dolphins to demonstrate intelligence through written language?
> So the question is, how intelligent are LLMs when you reduce their training data and training? Since they rapidly devolve into nonsense, the answer must be that they have no internal intelligence
This would be the equivalent of removing all senses of a human from birth and expecting them to somehow learn things. They will not. Therefore humans are not intelligent?
> LLMs dont do that; they just lack understanding.
You have no idea what they are doing. Since they are smaller than the dataset, they must have learned an internal algorithm. This algorithm is drawing patterns from somewhere - those are its internal, incorrect assumptions. It does not operate in the same way that a human does, but it seems ridiculous to say that it lacks intelligence because of that.
It sounds like you've reached a conclusion, that LLMs cannot be intelligent because they have said really weird things before, and are trying to justify it in reverse. Sure, it may not have grasped that particular thing. But are you suggesting that you've never met a human that is feigning understanding in a particular topic say some really weird things akin to an LLM? I'm an educator, and I have heard the strangest things that I just cannot comprehend no matter how much I dig. It really feels like shifting goalposts. We need to do better than that.
In split-brain experiments this is exactly how one half of the brain retroactively justifies the action of the other half. Maybe it is the case in LLMs that an overpowered latent feature sets the overall direction of the "thought" and then inference just has to make the best of it.
Yeah, that's compression. Although your later comments neglect the many years of physical experience that humans have as well as the billions of years of evolution.
And yes, by this definition, LLMs pass with flying colours.
I hate when people bring up this “billions of years of evolution” idea. It’s completely wrong and deluded in my opinion.
Firstly humans have not been evolving for “billions” of years.
Homo sapiens have been around for maybe 300’000 years, and the “homo” genus has been 2/3 million years. Before that we were chimps etc and that’s 6/7 million years ago.
If you want to look at the entire brain development, ie from mouse like creatures through to apes and then humans that’s 200M years.
If you want to think about generations it’s only 50/75M generations, ie “training loops”.
That’s really not very many.
Also the bigger point is this, for 99.9999% of that time we had no writing, or any kind of complex thinking required.
So our ability to reason about maths, writing, science etc is only in the last 2000-2500 years! Ie only roughly 200 or so generations.
Our brain was not “evolved” to do science, maths etc.
Most of evolution was us running around just killing stuff and eating and having sex. It’s only a tiny tiny amount of time that we’ve been working on maths, science, literature, philosophy.
So actually, these models have a massive, massive amount of training more than humans had to do roughly the same thing but using insane amounts of computing power and energy.
Our brains were evolved for a completely different world and environment and daily life that the life we lead now.
So yes, LLMs are good, but they have been exposed to more data and training time than any human could have unless we lived for 100000 years and still perform worse than we do in most problems!
Okay, fine, let's remove the evolution part. We still have an incredible amount of our lifetime spent visualising the world and coming to conclusions about the patterns within. Our analogies are often physical and we draw insights from that. To say that humans only draw their information from textbooks is foolhardy; at the very least, you have to agree there is much more.
I realise upon reading the OP's comment again that they may have been referring to "extrapolation", which is hugely problematic from the statistical viewpoint when you actually try to break things down.
My argument for compression asserts that LLMs see a lot of knowledge, but are actually quite small themselves. To output a vast amount of information in such a small space requires a large amount of pattern matching and underlying learned algorithms. I was arguing that humans are actually incredible compressors because we have many years of history in our composition. It's a moot point though, because it is the ratio of output to capacity that matters.
They can't learn iterative algorithms if they cannot execute loops. And blurting out an output which we then feed back in does not count as a loop. That's a separate invocation with fresh inputs, as far as the system is concerned.
They can attempt to mimic the results for small instances of the problem, where there are a lot of worked examples in the dataset, but they will never ever be able to generalize and actually give the correct output for arbitrary sized instances of the problem. Not with current architectures. Some algorithms simply can't be expressed as a fixed-size matrix multiplication.
>Most of evolution was us running around just killing stuff and eating and having sex.
Tell Boston Dynamics how to do that.
Mice inherited brain from their ancestors. You might think you don't need a working brain to reason about math, but that's because you don't know how thinking works, it's argument from ignorance.
People argue that humans have had the equivalent of training a frontier LLM for billions of years.
But training a frontier LLM involves taking multiple petabytes of data, effectively all of recorded human knowledge and experience, every book ever written, every scientific publication ever written, all of known maths, science, encylopedias, podcasts, etc. And then training that for millions of years worth of GPU-core time.
You cannot possibly equate human evolution with LLM training, it's ridiculous.
Our "training" time didn't involve any books, maths, science, reading, 99.9999% of our time was just in the physical world. So you can quite rationally argue that our brains ability to learn without training is radically better and more efficient that the training we do for LLMs.
Us running around in the jungle wasn't training our brain to write poetry or compose music.
No, I think what he said was true. Human brains have something about them that allow for the invention of poetry or music. It wasn't something learned through prior experience and observation because there aren't any poems in the wild. You might argue there's something akin to music, but human music goes far beyond anything in nature.
We have an intrinsic (and strange) reward system for creating new things, and it's totally awesome. LLMs only started to become somewhat useful once researchers tried to tap in to that innate reward system and create proxies for it. We definitely have not succeeded in creating a perfect mimicry of that system though, as any alignment researcher would no doubt tell you.
So you're arguing that "running around in the jungle" is equivalent to feeding the entirety of human knowledge in LLM training?
Are you suggesting that somehow there were books in the jungle, or perhaps boardgames? Perhaps there was a computer lab in the jungle?
Were apes learning to conjugate verbs while munching on bananas?
I don't think I'm suggesting anything crazy here... I think people who say LLM training is equivalent to "billions of years of evolution" need to justify that argument far more than I need to justify that running around in the jungle is equivalent to mass processing petabytes of highly rich and complex dense and VARIED information.
One year of running around in the same patch of jungle, eating the same fruit, killing the same insects, and having sex with the same old group of monkeys isn't going to be equal to training with the super varied, complete, entirety of human knowledge, is it?
If you somehow think it is though, I'd love to hear your reasoning.
There is no equivalency, only contributing factors. One cannot deny that our evolutionary history has contributed to our current capacity, probably in ways that are difficult to perceive unless you're an anthropologist.
Language is one mode of expression, and humans have many. This is another factor that makes humans so effective. To be honest, I would say that physical observation is far more powerful than all the bodies of text, because it is comprehensive and can respond to interaction. But that is merely my opinion.
No-one should be arguing that an LLM training corpus is the same as evolution. But information comes in many forms.
You're comparing the hyper specific evolution of 1 individual (an AI system) to the more general evolution of the entire human species (billions of individuals). It's as if you're forgetting how evolution actually works - natural selection - and forgetting that when you have hundreds of billions of individuals over thousands of years that even small insights gained from "running around in the jungle" can compound in ways that are hard to conceptualize.
I'm saying that LLM training is not equivalent to billions of years of evolution because LLMs aren't trained using evolutionary algorithms; there will always be fundamental differences. However, it seems reasonable to think that the effect of that "training" might be more or less around the same level.
Im so confused as to how you think you can cut an endless chain at the mouse.
Were mammals the first thing? No. Earth was a ball of ice for a billion years - all life at that point existed solely around thermal vents at the bottom of the oceans... that's inside of you, too.
Evolution doesn't forget - everything that all life has ever been "taught" (violently had programmed into us over incredible timelines) all that has ever been learned in the chain of DNA from the single cell to human beings - its ALL still there.
I think you might be missing some appropriate context. I agree that it is ridiculous to expect a language model to be good at symbolic manipulation; that is best done with tool use. However, there is a significant line of work dedicated to algorithm discovery for mathematical problems using neural networks. Transformers are used here due to their popularity, but also some theoretical analysis to suggest that they are the among the most efficient architecture for learning automata. It's still unclear whether this is truly sound though, which is where this kind of research matters.
I'm a bit confused by this; are you referring to vanishing/exploding gradients during training or iteration at inference? If the former, this is only true if you take too many steps. If the latter, we already know this works and scales well.
The latter, and I would disagree that “this works and scales well” in the general sense. It clearly has very finite bounds by the fact we haven’t achieved agi by running an llm in a loop..
The approach of “try a few more things before stopping” is a great strategy akin to taking a few more stabs at RNG. It’s not the same as saying keep trying until you get there - you won’t.
> It clearly has very finite bounds by the fact we haven’t achieved agi by running an llm in a loop..
That's one hell of a criterion. Test-time inference undergoes a similar scaling law to pretraining, and has resulted in dramatically improved performance on many complex tasks. Law of diminishing returns kicks in of course, but this doesn't mean it's ineffective.
> akin to taking a few more stabs at RNG
Assuming I understand you correctly, I disagree. Scaling laws cannot appear with glassy optimisation procedures (essentially iid trials until you succeed, the mental model you seem to be implying here). They only appear if the underlying optimisation is globally connected and roughly convex. It's no different than gradient descent in this regard.
But test-time inference leads to better data to train better models that can generate better test-time inference data.
There's an obvious trend going on here, of course we're still just growing these systems and going with whatever works.
It's worked well so far, even if it's more convoluted than elegant...
What puts my mind at ease is that the current state of these AI systems isn't going to go backwards because of the data they generate which contributes to the pool of possible knowledge for more advanced systems.
I never made a claim that it's ineffective, just that it's of limited effectiveness. The diminishing returns kick in quickly, and it's not applicable in more domains than it is applicable.
It's a hyperparameter much like learning rate. If the learning rate is too high, the training process would not work either. Addressing this is just a matter of a grid search.
Biden pardoned those family members because he knew Trump would personally go after them and make an example of them regardless of what they actually did. It shouldn't have happened, but it's pretty clear why it did.
Also, how on Earth are their "crimes" as egregious as those that Trump has pardoned in his recent term? Seriously, what did they do? How many people did they hurt? You can't just say "corruption" in an ambiguous sense, because then millions of other people would also apply. Also, "I don't know but it's fishy!" doesn't work when we know literally everything in Hunter Biden's life. That was the whole controversy.
The Binance founder directly impacted people in a significant way. The rioters were violent protesters that resulted in loss of life and others fearing for their life. It isn't the pardon itself that is bad, or the number, but the message it sends as to why they would do it. Biden was "if you're in the family of the president, you'll get forgiveness for your poor life decisions". Duh, like it or not, that's always been true, and Biden got roasted for it anyway. He's done, his family is out.
Trump is "if you're rich or you worship me, literal crime is legal".
I mean, did Kim Jong Un not qualify for this role to you? Or is that one actually horrifying and you put up some arbitrary distinction because... Trump?
Does this count as slander or libel at some point? Surely it must if they produce fake material to give off a false impression of the subject. Maybe not this case, but we must be getting there.
It'd have to make a false statement of fact. The problem with that is, according to the article, Schumer actually did say the words and the video has a disclaimer that it's AI, though it could be more visible since it seems to get partially hidden by video controls. From the article:
> A small disclaimer tucked in the corner acknowledges its artificial origins.
...
> The video has bewildered those who watched it online, given that the quote itself is real and on the record for Punchbowl News. In the original interview, Schumer explained that Democrats had prepared their healthcare-focused shutdown strategy well in advance, adding: “Their whole theory was – threaten us, bamboozle us and we would submit in a day or two.”
The part that's false here, the idea that it's a real video, would be hurt by the disclaimer of it being AI generated in the corner. And it's not a misquote of him, so it seems hard to make a defamation case out of this, even if it feels wrong.
Probably not at this point, because he actually said it.
I mean, it still could be - if the video shows him in a compromising position, if it shows him with a sneer that wasn't how he actually said it, or something else that makes it more negative than just the words.
I mean, look, it's still a lie, because it's pretending to be a video of him saying it, and it's not. But it's probably not slander or libel... at this point.
As a mathematician who also regularly publishes in these conferences, I am a little surprised to hear your take; your experience might be slightly different to mine.
Identifying limitations of LLMs in the context of "it's not AGI yet because X" is huge right now; it gets massive funding, taking away from other things like SciML and uncertainty analyses. I will agree that deep learning theory in the sense of foundational mathematical theory to develop internal understanding (with limited appeal to numerics) is in the roughest state it has even been in. My first impression there is that the toolbox has essentially run dry and we need something more to advance the field. My second impression is that empirical researchers in LLMs are mostly junior and significantly less critical of their own work and the work of others, but I digress.
I also disagree that we are disincentivised to find meaning behind the word "understanding" in the context of neural networks: if understanding is to build an internal world model, then quite a bit of work is going into that. Empirically, it would appear that they do, almost by necessity.
Maybe given our different niches we interact with different people? But I'm uncertain because I believe what I'm saying is highly visible. I forgot, which NeurIPS(?) conference were so many wearing "Scale is all you need" shirts?
> My first impression there is that the toolbox has essentially run dry and we need something more to advance the field
This is my impression too. Empirical evidence is a great tool and useful, especially when there is no strong theory to provide direction, but it is limited.
> My second impression is that empirical researchers in LLMs are mostly junior and significantly less critical of their own work and the work of others
But this is not my impression. I see this from many prominent researchers. Maybe they claim SIAYN in jest, but then they should come out and say it is such instead of doubling down. If we take them at their word (and I do), robotresearcher is not a junior (please, read their comments. It is illustrative of my experience. I'm just arguing back far more than I would in person). I've also seen members of audiences to talks where people ask questions like mine ("are benchmarks sufficient to make such claims?") with responses of "we just care that it works." Again, I think this is a non-answer to the question. But being taken as a sufficient answer, especially in response to peers, is unacceptable. It almost always has no follow-up.
I also do not believe these people are less critical. I've had several works which struggled through publication as my models that were a hundredth the size (and a millionth the data) could perform on par, or even better. At face value asks of "more datasets" and "more scale" are reasonable, yet it is a self reinforcing paradigm where it slows progress. It's like a corn farmer smugly asking why the neighboring soy bean farmer doesn't grow anything when the corn farmer is chopping all the soy bean stems in their infancy. It is a fine ask to big labs with big money, but it is just gate keeping and lazy evaluation to anyone else. Even at CVPR this last year they passed out "GPU Rich" and "GPU Poor" hats, so I thought the situation was well known.
> if understanding is to build an internal world model, then quite a bit of work is going into that. Empirically, it would appear that they do, almost by necessity.
I agree a "lot of work is going into it" but I also think the approaches are narrow and still benchmark chasing. I saw as well was given the aforementioned responses at workshops on world modeling (as well as a few presenters who gave very different and more complex answers or "it's the best we got right now", but nether seemed to confident in claiming "world model" either).
But I'm a bit surprised that as a mathematician you think these systems create world models. While I see some generalization, this is also impossible for me to distinguish from memorization. We're processing more data than can be scrutinized. We seem to also frequently uncover major limitations to our de-duplication processes[0]. We are definitely abusing the terms "Out of Distribution" and "Zero shot". Like I don't know how any person working with a proprietary LLM (or large model) that they don't own, can make a claim of "zero shot" or even "few shot" capabilities. We're publishing papers left and right, yet it's absurd to claim {zero,few}-shot when we don't have access to the learning distribution. We've merged these terms with biased sampling. Was the data not in training or is it just a low likelihood region of the model? They're indistinguishable without access to the original distribution.
Idk, I think our scaling is just making the problem harder to evaluate. I don't want to stop that camp because they are clearly producing things of value, but I do also want that camp to not make claims beyond their evidence. It just makes the discussion more convoluted. I mean the argument would be different if we were discussing small and closed worlds, but we're not. The claims are we've created world models yet many of them are not self-consistent. Certainly that is a requirement. I admit we're making progress, but the claims were made years ago. Take GameNGen[1] or Diamond Diffusion. Neither were the first and neither were self-consistent. Though both are also impressive.
Apologies if I ramble a bit here, this was typed in a bit of a hurry. Hopefully I answer some of your points.
First, regarding robotresearcher and simondota's comments, I am largely in agreement with what they say here. The "toaster" argument is a variant of the Chinese Room argument, and there is a standard rebuttal here. The toaster does not act independently of the human so it is not a closed system. The system as a whole, which includes the human, does understand toast. To me, this is different from the other examples you mention because the machine was not given a list of explicit instructions. (I'm no philosopher though so others can do a better job of explaining this). I don't feel that this is an argument for why LLMs "understand", but rather why the concept of "understanding" is irrelevant without an appropriate definition and context. Since we can't even agree on what constitutes understanding, it isn't productive to frame things in those terms. I guess that's where my maths background comes in, as I dislike the ambiguity of it all.
My "mostly junior" comment is partially in jest, but mostly comes from the fact that LLM and diffusion model research is a popular stream for moving into big tech. There are plenty of senior people in these fields too, but many reviewers in those fields are junior.
> I've also seen members of audiences to talks where people ask questions like mine ("are benchmarks sufficient to make such claims?") with responses of "we just care that it works."
This is a tremendous pain point to me more than I can convey here, but it's not unusual in computer science. Bad researchers will live and die on standard benchmarks. By the way, if you try to focus on another metric under the argument that the benchmarks are not wholly representative of a particular task, expect to get roasted by reviewers. Everyone knows it is easier to just do benchmark chasing.
> I also do not believe these people are less critical.
I think the fact that the "we just care that it works" argument is enough to get published is a good demonstration of what I'm talking about. If "more datasets" and "more scale" are the major types of criticisms that you are getting, then you are still working in a more fortunate field. And yes, I hate it as much as you do as it does favor the GPU rich, but they are at least potentially solvable. The easiest papers of mine to get through were methodological and often got these kinds of comments. Theory and SciML papers are an entirely different beast in my experience because you will rarely get reviewers that understand the material or care about its relevance. People in LLM research thought that the average NeurIPS score in the last round was a 5. Those in theory thought it was 4. These proportions feel reflected in the recent conferences. I have to really go looking for something outside the LLM mainstream, while there was a huge variety of work only a few years ago. Some of my colleagues have noticed this as well and have switched out of scientific work. This isn't unnatural or something to actively try to fix, as ML goes through these hype phases (in the 2000s, it was all kernels as I understand).
> approaches are narrow and still benchmark chasing
> as a mathematician you think these systems create world models
When I say "world model", I'm not talking about outputs or what you can get through pure inference. Training models to perform next frame prediction and looking at inconsistencies in the output tells us little about the internal mechanism. I'm talking about appropriate representations in a multimodal model. When it reads a given frame, is it pulling apart features in a way that a human would? We've known for a long time that embeddings appropriately encode relationships between words and phrases. This is a model of the world as expressed through language. The same thing happens for images at scale as can be seen in interpretable ViT models. We know from the theory that for next frame prediction, better data and more scaling improves performance. I agree that isn't very interesting though.
> We are definitely abusing the terms "Out of Distribution" and "Zero shot".
Absolutely in agreement with everything you have said. These are not concepts that should be talked about in the context of "understanding", especially at scale.
> I think our scaling is just making the problem harder to evaluate.
Yes and no. It's clear that whatever approach we will use to gauge internal understanding needs to work at scale. Some methods only work with sufficient scale. But we know that completely black-box approaches don't work, because if they did, we could use them on humans and other animals.
> The claims are we've created world models yet many of them are not self-consistent.
For this definition of world model, I see this the same way as how we used to have "language models" with poor memory. I conjecture this is more an issue of alignment than a lack of appropriate representations of internal features, but I could be totally wrong on this.
> The toaster does not act independently of the human so it is not a closed system
I think you're mistaken. No, not at that, at the premise. I think everyone agrees here. Where you're mistaken is that when I login to Claude it says "How can I help you today?"
No one is thinking that the toaster understands things. We're using it to point out how silly the claim of "task performance == understanding" is. Techblueberry furthered this by asking if the toaster is suddenly intelligent by wrapping it with a cron job. My point was about where the line is drawn. The turning on the toaster? No, that would be silly and you clearly agree. So you have to answer why the toaster isn't understanding toast. That's the ask. Because clearly toaster toasts bread.
You and robotresearcher have still avoided answering this question. It seems dumb but that is the crux of the problem. The LLM is claimed to be understanding, right? It meets your claims of task performance. But they are still tools. They cannot act independently. I still have to prompt them. At an abstract level this is no different than the toaster. So, at what point does the toaster understand how to toast? You claim it doesn't, and I agree. You claim it doesn't because a human has to interact with it. I'm just saying that looping agents onto themselves doesn't magically make them intelligent. Just like how I can automate the whole process from planting the wheat to toasting the toast.
You're a mathematician. All I'm asking is that you abstract this out a bit and follow the logic. Clearly even our automated seed to buttered toast on a plate machine needs not have understanding.
From my physics (and engineering) background there's a key thing I've learned: all measurements are proxies. This is no different. We don't have to worry about this detail in most every day things because we're typically pretty good at measuring. But if you ever need to do something with precision, it becomes abundantly obvious. But you even use this same methodology in math all the time. Though I wouldn't say that this is equivalent to taking a hard problem, creating an isomorphic map to an easier problem, solving it, then mapping back. There's an invective nature. A ruler doesn't measure distance. A ruler is a reference to distance. A laser range finder doesn't measure distance either, it is photodetector and a timer. There is nothing in the world that you can measure directly. If we cannot do this with physical things it seems pretty silly to think we can do it with abstract concepts that we can't create robust definitions for. It's not like we've directly measured the Higgs either. But what, do you think entropy is actually a measurement of intelligible speech? Perplexity is a good tool for identifying an entropy minimizer? Or does it just correlate? Is a FID a measurement of fidelity or are we just using a useful proxy? I'm sorry, but I just don't think there are precise mathematical descriptions of things like natural English language or realistic human faces. I've developed some of the best vision models out there and I can tell you that you have to read more than the paper because while they will produce fantastic images they also produce some pretty horrendous ones. The fact that they statistically generate realistic images does not imply that they actually understand them.
> I'm no philosopher
Why not? It sounds like you are. Do you not think about metamathematics? What math means? Do you not think about math beyond the computation? If you do, I'd call you a philosopher. There's a P in a PhD for a reason. We're not supposed to be automata. We're not supposed to be machine men, with machine minds, and machine hearts.
> This is a tremendous pain point ... researchers will live and die on standard benchmarks.
It is a pain we share. I see it outside CS as well, but I was shocked to see the difference. Most of the other physicists and mathematicians I know that came over to CS were also surprised. And it isn't like physicists are known for their lack of egos lol
> then you are still working in a more fortunate field
Oh, I've gotten the other comments too. That research never found publication and at the end of the day I had to graduate. Though now it can be revisited. I once was surprise to find that I saved a paper from Max Welling's group. My fellow reviewers were confident in their rejections just since they admitted to not understanding differential equations the AC sided with me (maybe they could see Welling's name? I didn't know till months after). It barely got through a workshop, but should have been in the main proceedings.
So I guess I'm saying I share this frustration. It's part of the reason I talk strongly here. I understand why people shift gears. But I think there's a big difference between begrudgingly getting on the train because you need to publish to survive and actively fueling it and shouting that all outer trains are broken and can never be fixed. One train to rule them all? I guess CS people love their binaries.
> world model
I agree that looking at outputs tells us little about their internal mechanisms. But proof isn't symmetric in difficulty either. A world model has to be consistent. I like vision because it gives us more clues in our evaluations, let's us evaluate beyond metrics. But if we are seeing video from a POV perspective, then if we see a wall in front of us, turn left, then turn back we should still expect to see that wall, and the same one. A world model is a model beyond what is seen from the camera's view. A world model is a physics model. And I mean /a/ physics model, not "physics". There is no single physics model. Nor do I mean that a world model needs to have even accurate physics. But it does need to make consistent and counterfactual predictions. Even the geocentric model is a world model (literally a model of worlds lol). The model of the world you have in your head is this. We don't close our eyes and conclude the wall in front of you will disappear. Someone may spin you around and you still won't do this, even if you have your coordinates wrong. The issue isn't so much memory as it is understanding that walls don't just appear and disappear. It is also understanding that this also isn't always true about a cat.
I referenced the game engines because while they are impressive they are not self consistent. Walls will disappear. An enemy shooting at you will disappear sometimes if you just stop looking at it. The world doesn't disappear when I close my eyes. A tree falling in a forest still creates acoustic vibrations in the air even if there is no one to hear it.
A world model is exactly that, a model of a world. It is a superset of a model of a camera view. It is a model of the things in the world and how they interact together, regardless of if they are visible or not. Accuracy isn't actually the defining feature here, though it is a strong hint, at least it is for poor world models.
I know this last part is a bit more rambly and harder to convey. But I hope the intention came across.
> You and robotresearcher have still avoided answering this question.
I have repeatedly explicitly denied the meaningfulness of the question. Understanding is a property ascribed by an observer, not possessed by a system.
You may not agree, but you can’t maintain that I’m avoiding that question. It does not have an answer that matters; that is my specific claim.
You can say a toaster understands toasting or you can not. There is literally nothing at stake there.
You said the LLMs are intelligent because they do tasks. But the claim is inconsistent with the toaster example.
If a toaster isn't intelligent because I have to give it bread and press the button to start then how's that any different from giving an LLM a prompt and pressing the button to start?
It's never been about the toaster. You're avoiding answering the question. I don't believe you're dumb, so don't act the part. I'm not buying it.
As someone who is invested in researching said math, I can say with some confidence that it does not exist, or at least not in the form claimed here. That's the whole problem.
I would be ecstatic if it did though, so if anyone has any examples or rebuttal, I would very much appreciate it.
You're right that there is no purely mathematical argument; it's almost non-sensical to claim such.
Instead you can simply make the rather reasonable observation that LLMs are a product of their training distribution, which only contains partial coverage of all possible observable states of the world. Some highly regular observable states are thus likely missing, but an embodied agent (like a human) would be able to correctly reason about these states due to other sensory modes that observe these states. It's very clear that the number of possible states far exceeds any text describing those states (the number of states transcribed to text is vanishingly small in comparison, but they are locally coherent and quite regular when observed through other sensory modes like vision). That's the closest you can make to a mathematical argument.
As a result one common failure case for an LLM might involve describing a specific physical environment and associated set of actions that are in an area of low (or no) probability mass in the training data and asking the LLM to reason about it. Here's an example where I asked ChatGPT the following:
imagine i walked into a room through a doorway, then turned around and walked right back out of the room. from the perspective of a person in the room facing the door, which side would my right arm be? please explain.
It responded by incorrectly stating the following:
Let’s break this down step by step.
1. *You walk into the room through the doorway*:
* As you're entering, the person in the room is facing the door. So, you're entering from the **outside**.
* Your **right arm** will be on the **right side** of your body, just as it is normally when you're standing.
2. *You turn around and walk right back out*:
* When you turn around to walk out, you’ll be facing **away** from the room, looking back toward the doorway.
* Since you're walking **back out**, the person inside the room is still facing the door, while you are walking away from them, so your **right arm** is still on the right side of your body.
So, from the perspective of the person in the room, as you’re exiting, your *right arm* will still be on their *left side*. That’s because they’re facing the door, and your right side is to their left when you’re turning around and leaving.
To visualize:
* Imagine a door in front of them.
* You walk through and turn around, so you're now facing away from them.
* Your right arm is on your right side, but to them, it's on their left side since you’re now moving in the opposite direction.
My claim is more about that data processing is not enough. I was too vague and I definitely did not convey myself accurately. I tried to clarify a bit in a sibling comment to yours but I'm still unsure if it is sufficient tbh.
For embodiment, I think this is sufficient but not necessary. A key part to the limitation is that the agent cannot interact with its environment. This is a necessary feature for distinguishing competing explanations. I believe we are actually in agreement here, but I do think we need to be careful how we define embodiment. Because even a toaster can be considered a robot. It seems hard to determine what does not qualify as a body when we get to the itty gritty. But I think in general when people are talking about embodiment they are discussing the capability of being interventional.
By your elaboration I believe we agree since part of what I believe to be necessary is the ability to self-analyze (meta-cognition) to determine low density regions of its model and then to be able to seek out and rectify this (intervention). Data processing is not sufficient for either of those conditions.
Your prompt is, imo, more about world modeling, though I do think this is related. I asked Claude Sonnet 4.5 with extended thinking enabled and it also placed itself outside the room. Opus 4.1 (again with extended thinking), got the answer right. (I don't use a standard system prompt, though that is mostly to make it not syncopathic and to try to get it to ask questions when uncertain and enforce step by step thinking)
From the perspective of the person in the room, your right arm would be on their right side as you walk out.
Here's why: When you initially walk into the room facing the person, your right arm appears on their left side (since you're facing each other). But when you turn around 180 degrees to walk back out, your back is now toward them. Your right arm stays on your right side, but from their perspective it has shifted to their right side.
Think of it this way - when two people face each other, their right sides are on opposite sides. But when one person turns their back, both people's right sides are now on the same side.
The CoT output is a bit more interesting[0]. Disabling my system prompt gives an almost identical answer fwiw. But Sonnet got it right. I repeated the test in incognito after deleting the previous prompts and it continued to get it right, independent of my system prompt or extended thinking.
I don't think this proves a world model though. Misses are more important than hits, just as counter examples are more important than examples in any evidence or proof setting. But fwiw I also frequently ask these models variations on river crossing problems and the results are very shabby. A few appear spoiled now but they are not very robust to variation and that I think is critical.
I think an interesting variation of your puzzle is as follows
Imagine you walked into a room through a doorway. Then you immediately turn around and walk back out of the room.
From the perspective of a person in the room, facing the door, which side would your right arm be? Please explain.
I think Claude (Sonnet) shows some subtle but important results in how it answers
Your right arm would be on their right side.
When you turn around to walk back out, you're facing the same direction as the person in the room (both facing the door). Since you're both oriented the same way, your right side and their right side are on the same side.
This makes me suspect there's some overfitting. CoT correctly uses "I"[1].
It definitely isn't robust to red herrings[2], and I think that's a kicker here. It is similar to failure results I see in any of these puzzles. They are quite easy to break with small variations. And we do need to remember that these are models trained on the entire internet (including HN comments), so we can't presume this is a unique puzzle.
Let me clarify. I was too vague and definitely did not express things accurately. That is on me.
We have the math to show that it can be impossible to distinguish two explanations through data processing alone. We have examples of this in science, a long history of it in fact. Fundamentally there is so much that we cannot conclude from processing data alone. Science (the search of knowledge) is active. It doesn't require just processing existing data, it requires the search for new data. We propose competing hypotheses that are indistinguishable from the current data and seek out the data which distinguishes them (a pain point for many of the TOEs like String Theory). We know that data processing alone is insufficient for explanation. We know it cannot distinguish confounders. We know it cannot distinguish causal graphs (e.g. distinguish triangular maps. We are able to create them, but not distinguish them through data processing alone). The problem with scaling alone is that it makes the assertions that data processing is enough. Yet we have so much work (and history) telling us that data processing is insufficient.
The scaling math itself also shows a drastic decline in performance with scale and often do not suggest convergence even with infinite data. They are power laws with positive concavity, requiring exponential increase in data and parameters for marginal improvements on test loss. I'm not claiming that we need zero test loss to reach AGI, but the results do tell us that if this is strongly correlated then we'll need to spend an exponential amount more to achieve AGI even if we are close. By our measures, scaling is not enough unless we are sufficiently close. Even our empirical results align with this as despite many claiming that scale is all we need, we are making significant changes to the model architectures and training procedures (including optimizers). We are making these large changes because throwing the new data at the old models (even when simply increasing the number of parameters) does not work out. It is not just the practicality, it is the results. The scaling claim has always been a myth used to drive investments since it is a nice simple story that says that we can get there by doing what we've already been doing, just more. We all know that these new LLMs aren't dramatic improvements off their previous versions, despite being much larger, more efficient, and having processed far more data.
[side note]: We even have my namesake who would argue that there are truths which are not provably true with a system that is both consistent and efficient (effectively calculable). But we need not go that far, as omniscience is not a requirement for AGI. Though it is worth noting for the limits of our models, since at the core this matters. Changing our axioms changes the results, even with the same data. But science doesn't exclusively use a formal system, nor does it use a single one.
My apologies for the much delayed reply as I have recently found myself with little extra time to post adequate responses. Your critiques are very interesting to ponder, so I thank you for posting them. I did want to respond to this one though.
I believe all of my counterarguments center around my current viewpoint that given the rapid rate of progress involved on the engineering side, it is no longer reasonable in deep learning theory to consider what is possible, and it is more interesting to try to outline hard limitations. This emposes a stark contrast between deep learning and classical statistics, as the boundaries in the latter are very clear and are not shared by the former.
I want to stress that at present, nearly every conjectured limitation of deep learning over the last several decades has fallen. This includes many back of the napkin, "clearly obvious" arguments, so I'm wary of them now. I think the skepticism all along has been fueled in response to hype cycles, so we must be careful not to make the same mistakes. There is far too much empirical evidence available to counter precise arguments against the claim that there is an underlying understanding within these models, so it seems we must resort to the imprecise to continue the debate.
Scaling, along one axis, suggests a high polynomial degree of additional compute (not exponential) is required for increasing improvements, this is true. But the progress over the last few years has occurred due to the discovery of new axes to scale on, which further reduces the error rate and improves performance. There are still many potential axes left untapped. What is significant about scaling to me is not how much additional compute is required, but the fact that the predicted bottom at the moment is very, very low, far lower than anything else we have ever seen, and that doesn't require any more data than we currently have. That should be cause for concern until we find a better lower bound.
> We all know that these new LLMs aren't dramatic improvements off their previous versions
No, I don't agree. This may be evident to many, but to some, the differences are stark. Our perceived metrics of performance are nonlinear and person-dependent, and these major differences can be imperceptible to most. The vast majority of attempts at providing more regular metrics or benchmarks that are not already saturated have shown that LLM development is not slowing down by any stretch. I'm not saying that LLMs will "go to the moon". But I don't have anything concrete to say they cannot either.
> We have the math to show that it can be impossible to distinguish two explanations through data processing alone.
Actually, this is a really great point, but I think this highlights the limitations of benchmarks and the requirements of capacity-based, compression-based, or other types of alternative data-independent metrics. With these in tow, it can be possible to distinguish two explanations. This could be a fruitful line of inquiry.
reply