Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://i.imgur.com/feEiiZA.png

I tried to stack all of these objects myself and couldn't really. I think GPT-4's approach is actually really good. It correctly points out that the gummy worms make a flexible base for the DSLR (otherwise the protruding buttons/viewfinder make it wobbly on the hard book), and the light bulbs are able to nestle into the front of the lens. If they were smaller light bulbs I could probably use the four of them as a small base on top of the lens to host the succulent.



Might also put the light bulbs as a base (especially if in a box). They are pretty sturdy and can hold a book.


The point is that ChatGPT undeniably built a world model good enough to understand the physical and three-dimensional properties of these items pretty well, and it gives me a somewhat workable way to stack them, despite never having seen that in its training data.


You cannot conclude that from the output - the training data will likely contain a lot stacking things. Everyday objects also might have some stacking properties that make these questions easy to answer even with semi-random answers.

Plus, some stuff clearly makes no sense or is ignored (like the gummy worms in the center, forgetting about the succulent in some cases).

If you want to test world modeling, give it objects it will have never encountered, describe them and then ask to stack etc. For example, a bunch of 7 dimensional objects that can only be stacked a certain way.


> If you want to test world modeling, give it objects it will have never encountered, describe them and then ask to stack etc. For example, a bunch of 7 dimensional objects that can only be stacked a certain way.

And when it does that perfectly, I assume you'll say that was also in the training data? All examples I've seen or tried point to LLMs being able to do some kind of reasoning that is completely dynamic, even when presented with the most outlandish cases.


All examples I tried myself show it failing miserable at reasoning.

It certainly needs better evidence than being able to come up with one of many possibilities of stacking things - aided by human interpretation on top of the text output. Happy to look at other suggestions for test problems.


Well for me personally, the proof is in giving it a few sentences on how it should write fairly complicated pieces of unique code I need on a daily basis and seeing it correctly infer things I forgot to specify in ways that are typically borderline impossible for anything but another human. If that's not reasoning I don't know what is.

The other one that convinced me was this list: https://i.imgur.com/CQlbaDN.png I think the leetcode tests are quite indicative, going as far as saying that GPT-4 scores 77% on basic reasoning, 26% on complex reasoning and 6% on extremely complex reasoning.

Maybe the reasoning is all "baked in" as it were, like in a hypothetical machine doing string matching of questions and answers with a database containing an answer to every possible question. But in the end, correctly using those baked in thought processes may be good enough for it to be completely indistinguishable from the real thing, if the real thing even exists and we aren't stochastic resamplers ourselves.

> aided by human interpretation on top of the text output

That's an interesting point actually, I've been trying to do something in that regard recently, by having it use an API to do actual things (in a simulated environment) and it seems very promising despite the model not being tuned for it, but given that AutoGPT and plugin usage are a thing, that should be all the evidence you need on that front.

Google also did this with their old Palm model which is vastly inferior to even GPT 3.5: https://www.youtube.com/watch?v=j6O_uePUKKI


Coding isn't a use case of mine. For example, for things like financial derivatives replication it can tell you the abstract concept but it cannot apply it in a meaningful way.


> For example, a bunch of 7 dimensional objects that can only be stacked a certain way.

That's a ridiculous example.


Why? You need make sure that a solution requires true understanding and isn't in the training set. If it can reason properly, it shouldn't have a problem with such a problem.


How well do humans reason about 7-dimensional objects?

I'm already impressed if a computer can reason flexibly about 3-dimensional objects.


Humans having the right mathematical tooling do ok.


What percentage of humans have that mathematical tooling?

The fact that people are even raising these sorts of obscure tests shows just how far AI has advanced.


Please tell me how you would pose the question of a bunch of seven-dimensional objects that can only be stacked in a certain way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: