> One way to resolve this surprise is to find some reason to believe these stran...

tornato7 · on May 21, 2023

The one thing that has convinced me that chatGPT has built a real world model is asking it how to stack different objects. You can ask it for the safest way to stack a pillow, an egg, a nail, a book and an action figure. Even get more complicated. GPT-4 will, most of the time, correctly reason through how to stack these items to form a stable structure without breaking. That tells me it knows intimately the physical properties of all of these objects and how they interact with each other.

avereveard · on May 21, 2023

Or was that one of the training samples provided to gpt 3.5?

It is going to be increasingly harder to divide between what it was in the training set and what is creative thinking going forward.

Like, at the very minimum, all the publicly known reasoning samples that were used to probe gpt version 3.5 downward are going to be unreliable for knowing what gpt4 reasoning abilities are

tornato7 · on May 21, 2023

The point of the question is that you can easily come up with combinations of objects that are unlikely, or statistically impossible, to be in their training samples. How about this one? "Tell me how to safely stack the following objects on top of each other without any additional support: A Nikon DSLR, A small succulent, 4 lightbulbs, a pack of gummy worms, and a hardcover copy of Harry Potter and the Halfblood Prince."

Hardcover copy of Harry Potter and the Half-Blood Prince: Start with the largest, heaviest, and most stable object. This will serve as your base. Make sure to place it on a flat, secure surface.

Pack of gummy worms: This item is fairly light and flexible. It should sit nicely on the book without risking damage to either. You can place it right in the center to maintain balance.

Nikon DSLR: This is a heavy object with a fairly stable shape. Put this on top of the pack of gummy worms, as it's sturdy enough to handle the weight and won't be harmed by the soft package. Make sure to place it so that the lens is pointing upward and it sits flat and stable. Don't use any protruding lenses or parts to balance it, as they could easily be damaged.

Four Lightbulbs: Since light bulbs are round and fragile, you will need to stack them with care. If the bulbs are in a box, you can place the box on top of the DSLR, making sure it's centered and stable. If the bulbs are loose, it's a bit more tricky. You might want to nestle them into the lens of the DSLR (if the lens is large enough), which should provide some natural containment for them. If not, it might be safer not to include the lightbulbs in your stack.

A small succulent: Finally, the succulent can be placed on top. It's likely the lightest object and it is usually quite stable due to its pot. If the lightbulbs were in a box, place the succulent on top of that. If you ended up not including the lightbulbs, place the succulent directly on top of the DSLR.

RandomLensman · on May 21, 2023

Reads well if you don't think about it too much... For example: Where does the succulent go if the light bulbs are nestled into the lens of the DLSR? Balanced on the light bulbs? Why would the gummy worm package need to be in the center of the book to maintain balance?

tornato7 · on May 21, 2023

https://i.imgur.com/feEiiZA.png

I tried to stack all of these objects myself and couldn't really. I think GPT-4's approach is actually really good. It correctly points out that the gummy worms make a flexible base for the DSLR (otherwise the protruding buttons/viewfinder make it wobbly on the hard book), and the light bulbs are able to nestle into the front of the lens. If they were smaller light bulbs I could probably use the four of them as a small base on top of the lens to host the succulent.

RandomLensman · on May 21, 2023

Might also put the light bulbs as a base (especially if in a box). They are pretty sturdy and can hold a book.

tornato7 · on May 21, 2023

The point is that ChatGPT undeniably built a world model good enough to understand the physical and three-dimensional properties of these items pretty well, and it gives me a somewhat workable way to stack them, despite never having seen that in its training data.

RandomLensman · on May 21, 2023

You cannot conclude that from the output - the training data will likely contain a lot stacking things. Everyday objects also might have some stacking properties that make these questions easy to answer even with semi-random answers.

Plus, some stuff clearly makes no sense or is ignored (like the gummy worms in the center, forgetting about the succulent in some cases).

If you want to test world modeling, give it objects it will have never encountered, describe them and then ask to stack etc. For example, a bunch of 7 dimensional objects that can only be stacked a certain way.

moffkalast · on May 21, 2023

> If you want to test world modeling, give it objects it will have never encountered, describe them and then ask to stack etc. For example, a bunch of 7 dimensional objects that can only be stacked a certain way.

And when it does that perfectly, I assume you'll say that was also in the training data? All examples I've seen or tried point to LLMs being able to do some kind of reasoning that is completely dynamic, even when presented with the most outlandish cases.

RandomLensman · on May 21, 2023

All examples I tried myself show it failing miserable at reasoning.

It certainly needs better evidence than being able to come up with one of many possibilities of stacking things - aided by human interpretation on top of the text output. Happy to look at other suggestions for test problems.

moffkalast · on May 21, 2023

Well for me personally, the proof is in giving it a few sentences on how it should write fairly complicated pieces of unique code I need on a daily basis and seeing it correctly infer things I forgot to specify in ways that are typically borderline impossible for anything but another human. If that's not reasoning I don't know what is.

The other one that convinced me was this list: https://i.imgur.com/CQlbaDN.png I think the leetcode tests are quite indicative, going as far as saying that GPT-4 scores 77% on basic reasoning, 26% on complex reasoning and 6% on extremely complex reasoning.

Maybe the reasoning is all "baked in" as it were, like in a hypothetical machine doing string matching of questions and answers with a database containing an answer to every possible question. But in the end, correctly using those baked in thought processes may be good enough for it to be completely indistinguishable from the real thing, if the real thing even exists and we aren't stochastic resamplers ourselves.

> aided by human interpretation on top of the text output

That's an interesting point actually, I've been trying to do something in that regard recently, by having it use an API to do actual things (in a simulated environment) and it seems very promising despite the model not being tuned for it, but given that AutoGPT and plugin usage are a thing, that should be all the evidence you need on that front.

Google also did this with their old Palm model which is vastly inferior to even GPT 3.5: https://www.youtube.com/watch?v=j6O_uePUKKI

RandomLensman · on May 21, 2023

Coding isn't a use case of mine. For example, for things like financial derivatives replication it can tell you the abstract concept but it cannot apply it in a meaningful way.

biorach · on May 21, 2023

> For example, a bunch of 7 dimensional objects that can only be stacked a certain way.

That's a ridiculous example.

RandomLensman · on May 21, 2023

Why? You need make sure that a solution requires true understanding and isn't in the training set. If it can reason properly, it shouldn't have a problem with such a problem.

DiogenesKynikos · on May 21, 2023

How well do humans reason about 7-dimensional objects?

I'm already impressed if a computer can reason flexibly about 3-dimensional objects.

RandomLensman · on May 21, 2023

Humans having the right mathematical tooling do ok.

DiogenesKynikos · on May 23, 2023

What percentage of humans have that mathematical tooling?

The fact that people are even raising these sorts of obscure tests shows just how far AI has advanced.

tornato7 · on May 21, 2023

Please tell me how you would pose the question of a bunch of seven-dimensional objects that can only be stacked in a certain way.

namaria · on May 21, 2023

Extra points for teaching me a new word for navel gazing. But I think your reasoning about it devolved into navel gazing at the end.