Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This looks like it’s been posted on Reddit 10 years ago:

https://www.reddit.com/r/math/comments/32m611/logic_question...

So it’s likely that it’s part of the training data by now.



You'd think so, but both Google's AI Overview and Bing's CoPilot output wrong answers.

Google spits out: "The product of the three numbers is 10,225 (65 * 20 * 8). The three numbers are 65, 20, and 8."

Whoa. Math is not AI's strong suit...

Bing spits out: "The solution to the three people in a circle puzzle is that all three people are wearing red hats."

Hats???

Same text was used for both prompts (all the text after 'For those curious the riddle is:' in the GP comment), so Bing just goes off the rails.


That's a non-sequitur, they would be stupid to run ab expensive _L_LM for every search query. This post is not about Google Search being replaced by Gemini 2.5 and/or a chatbot.


Yes, putting an expensive LLM response atop each search query would be quite stupid.

You know what would be even stupider? Putting a cheap, wrong LLM response atop each search query.


Google placed its "AI overview" answer at the top of the page.

The second result is this reddit.com answer, https://www.reddit.com/r/math/comments/32m611/logic_question..., where at least the numbers make sense. I haven't examined the logic portion of the answer.

Bing doesn't list any reddit posts (that Google-exclusive deal) so I'll assume no stackexchange-related sites have an appropriate answer (or bing is only looking for hat-related answers for some reason).


I might have been phrasing poorly. With _L_ (or L as intended), I meant their state-of-the-art model, which I presume Gemini 2.5 is (didn't come around to TFA yet). Not sure if this question is just about model size.

I'm eagerly awaiting an article about RAG caching strategies though!


The riddle has a different variants with hats https://erdos.sdslabs.co/problems/5


There's 3 toddlers on the floor. You ask them a hard mathematical question. One of the toddlers plays around pieces of paper on the ground and happens to raise one that has the right answer written on it.

- This kid is a genius! - you yell

- But wait, the kid has just picked an answer from the ground, it didn't actually come up...

- But the other toddlers could do it also but didn't!


Other models aren't able to solve it so there's something else happening besides it being in the training data. You can also vary the problem and give it a number like 85 instead of 65 and Gemini is still able to properly reason through the problem


I'm sure you're right that it's more than just it being in the training data, but that it's in the training data means that you can't draw any conclusions about general mathematical ability using just this as a benchmark, even if you substitute numbers.

There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:

* Random chance (these are still statistical machines after all)

* The problem resurfaced recently and shows up more often than it used to.

* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.


Google Gemini 2.5 is able to search the web, so if you're able to find the answer on reddit, maybe it can too.


I think there’s a big push to train LLMs on maths problems - I used to get spammed on Reddit with ads for data tagging and annotation jobs.

Recently these have stopped and they’re now the ads are about becoming a maths tutor to AI.

Doesn’t seem like a role with long-term prospects.


Sure, but you can't cite this puzzle as proof that this model is "better than 95+% of the population at mathematical reasoning" when the method of solving (the "answer") it is online, and the model has surely seen it.


It gets it wrong when you give it 728. It claims (728, 182, 546). I won't share the answer so it won't appear in the next training set.


with 728 the puzzle doesn't work since it's divisible by 8


But then the AI should tell you that, too, if it really understand the problem?


Fair, the question is what possible solutions exists.


This whole answer hinges on knowing that 0 is not a positive integer, that's why I couldn't figure it out...


Thaks. I wanted to do exactly that: find the answer online. It is amazing that people (even in HN) think that LLM can reason. It just regurgitates the input.


Have you given a reasoning model a novel problem and watched its chain of thought process?


I think it can reason. At least if it can work in a loop ("thinking"). It's just that this reasoning is far inferior to human reasoning, despite what some people hastily claim.


I would say that 99.99% of humans do the same. Most people never come up with anything novel.


I would say maybe about 80% certainly not 99.99%. But I've seen that in college, some would only be able to solve the problems which were pretty much the same as others already seen. Notably some guys could easily come up with solutions to complex problems they did not see before. I have the opinion that no human at age 20 can have the amount of input a LLM today. And still humans of age 20 do come with very new ideas pretty often (new in the sense that (s)he has not seen that or anything like it before). Of course there are more and less creative/intelligent people...


Reasoning != coming up with something novel.


And if it wasn’t, it is now


[flagged]


Is there a reason for the downvotes here? We can see that having the answer in the training data doesn't help. If it's in there, what's that supposed to show?


It's entirely unclear what are you trying to get across, at least to me.

Generally speaking, posting output from a LLM, without explaining exactly what do you think it illustrates and why is frowned upon here. I don't think your comment does a great job of the latter.


>> So it’s likely that it’s part of the training data by now.

> I don't think this means what you think it means.

> I did some interacting with the Tencent model that showed up here a couple days ago [...]

> This is a question that obviously was in the training data. How do you get the answer back out of the training data?

What do I think the conversation illustrates? Probably that having the answer in the training data doesn't get it into the output.

How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese. If you want to read the Chinese, Google Translate is more than good enough for this purpose; that's what I used.


Your intentions are good, but your execution is poor.

I cannot figure out what the comment is trying to get across either. It's easy for you because you already know what you are trying to say. You know what the pasted output shows. The poor execution is in not spending enough time thinking about how someone coming in totally blind would interpret the comment.


> How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese.

I can't, and I imagine most of the people who downvoted you couldn't either.

I think asking people to go to Google Translate to parse a random comment that seems to be 90% LLM output by volume is a bit much.


I have translated the Chinese. I still have no idea what point you're trying to make. You ask it questions about some kind of band, and it answers. Are you saying the answers are wrong?


No clue. Perhaps people object to the untranslated Chinese?


> Is there a reason for the downvotes here?

I didn't downvote you, but like (probably) most people here, I can't read Chinese; I can't derive whatever point you're trying to make just from with text you provided.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: