> One way to resolve this surprise is to find some reason to believe these strange abilities are fundamentally not an understanding of the world. Thus stochastic parrots, this article, Yan LeCun and Chomsky, etc.
I mean, it is kind of obvious that predicting text based on a large corpus of text written by people with a model of the world will, if it works at all, look like having a model of the world.
The question is whether that involves having a model of the world, and secondarily, if there is a difference, what is the practical impact of the difference?
Or maybe that’s not really the question, because the whole “model of the world” thing is metaphysical omphaloskepsis that is inherently unanswerable because we can’t actually unambiguously reduce it to something with testable empirical predictions, reflecting a rationalization for elements of our culture and particularly our own view of our own special place in the universe, and the different answers to it likewise have no real meaning but simply reflect people’s bias for whether or not they want to share that special place, either in general or with the particular computing systems under discussion in particular.
The one thing that has convinced me that chatGPT has built a real world model is asking it how to stack different objects. You can ask it for the safest way to stack a pillow, an egg, a nail, a book and an action figure. Even get more complicated. GPT-4 will, most of the time, correctly reason through how to stack these items to form a stable structure without breaking. That tells me it knows intimately the physical properties of all of these objects and how they interact with each other.
Or was that one of the training samples provided to gpt 3.5?
It is going to be increasingly harder to divide between what it was in the training set and what is creative thinking going forward.
Like, at the very minimum, all the publicly known reasoning samples that were used to probe gpt version 3.5 downward are going to be unreliable for knowing what gpt4 reasoning abilities are
The point of the question is that you can easily come up with combinations of objects that are unlikely, or statistically impossible, to be in their training samples. How about this one? "Tell me how to safely stack the following objects on top of each other without any additional support: A Nikon DSLR, A small succulent, 4 lightbulbs, a pack of gummy worms, and a hardcover copy of Harry Potter and the Halfblood Prince."
Hardcover copy of Harry Potter and the Half-Blood Prince: Start with the largest, heaviest, and most stable object. This will serve as your base. Make sure to place it on a flat, secure surface.
Pack of gummy worms: This item is fairly light and flexible. It should sit nicely on the book without risking damage to either. You can place it right in the center to maintain balance.
Nikon DSLR: This is a heavy object with a fairly stable shape. Put this on top of the pack of gummy worms, as it's sturdy enough to handle the weight and won't be harmed by the soft package. Make sure to place it so that the lens is pointing upward and it sits flat and stable. Don't use any protruding lenses or parts to balance it, as they could easily be damaged.
Four Lightbulbs: Since light bulbs are round and fragile, you will need to stack them with care. If the bulbs are in a box, you can place the box on top of the DSLR, making sure it's centered and stable. If the bulbs are loose, it's a bit more tricky. You might want to nestle them into the lens of the DSLR (if the lens is large enough), which should provide some natural containment for them. If not, it might be safer not to include the lightbulbs in your stack.
A small succulent: Finally, the succulent can be placed on top. It's likely the lightest object and it is usually quite stable due to its pot. If the lightbulbs were in a box, place the succulent on top of that. If you ended up not including the lightbulbs, place the succulent directly on top of the DSLR.
Reads well if you don't think about it too much... For example: Where does the succulent go if the light bulbs are nestled into the lens of the DLSR? Balanced on the light bulbs? Why would the gummy worm package need to be in the center of the book to maintain balance?
I tried to stack all of these objects myself and couldn't really. I think GPT-4's approach is actually really good. It correctly points out that the gummy worms make a flexible base for the DSLR (otherwise the protruding buttons/viewfinder make it wobbly on the hard book), and the light bulbs are able to nestle into the front of the lens. If they were smaller light bulbs I could probably use the four of them as a small base on top of the lens to host the succulent.
The point is that ChatGPT undeniably built a world model good enough to understand the physical and three-dimensional properties of these items pretty well, and it gives me a somewhat workable way to stack them, despite never having seen that in its training data.
You cannot conclude that from the output - the training data will likely contain a lot stacking things. Everyday objects also might have some stacking properties that make these questions easy to answer even with semi-random answers.
Plus, some stuff clearly makes no sense or is ignored (like the gummy worms in the center, forgetting about the succulent in some cases).
If you want to test world modeling, give it objects it will have never encountered, describe them and then ask to stack etc. For example, a bunch of 7 dimensional objects that can only be stacked a certain way.
> If you want to test world modeling, give it objects it will have never encountered, describe them and then ask to stack etc. For example, a bunch of 7 dimensional objects that can only be stacked a certain way.
And when it does that perfectly, I assume you'll say that was also in the training data? All examples I've seen or tried point to LLMs being able to do some kind of reasoning that is completely dynamic, even when presented with the most outlandish cases.
All examples I tried myself show it failing miserable at reasoning.
It certainly needs better evidence than being able to come up with one of many possibilities of stacking things - aided by human interpretation on top of the text output. Happy to look at other suggestions for test problems.
Well for me personally, the proof is in giving it a few sentences on how it should write fairly complicated pieces of unique code I need on a daily basis and seeing it correctly infer things I forgot to specify in ways that are typically borderline impossible for anything but another human. If that's not reasoning I don't know what is.
The other one that convinced me was this list: https://i.imgur.com/CQlbaDN.png
I think the leetcode tests are quite indicative, going as far as saying that GPT-4 scores 77% on basic reasoning, 26% on complex reasoning and 6% on extremely complex reasoning.
Maybe the reasoning is all "baked in" as it were, like in a hypothetical machine doing string matching of questions and answers with a database containing an answer to every possible question. But in the end, correctly using those baked in thought processes may be good enough for it to be completely indistinguishable from the real thing, if the real thing even exists and we aren't stochastic resamplers ourselves.
> aided by human interpretation on top of the text output
That's an interesting point actually, I've been trying to do something in that regard recently, by having it use an API to do actual things (in a simulated environment) and it seems very promising despite the model not being tuned for it, but given that AutoGPT and plugin usage are a thing, that should be all the evidence you need on that front.
Coding isn't a use case of mine. For example, for things like financial derivatives replication it can tell you the abstract concept but it cannot apply it in a meaningful way.
Why? You need make sure that a solution requires true understanding and isn't in the training set. If it can reason properly, it shouldn't have a problem with such a problem.
I mean, it is kind of obvious that predicting text based on a large corpus of text written by people with a model of the world will, if it works at all, look like having a model of the world.
The question is whether that involves having a model of the world, and secondarily, if there is a difference, what is the practical impact of the difference?
Or maybe that’s not really the question, because the whole “model of the world” thing is metaphysical omphaloskepsis that is inherently unanswerable because we can’t actually unambiguously reduce it to something with testable empirical predictions, reflecting a rationalization for elements of our culture and particularly our own view of our own special place in the universe, and the different answers to it likewise have no real meaning but simply reflect people’s bias for whether or not they want to share that special place, either in general or with the particular computing systems under discussion in particular.