How up to date are you on current open weights models? After playing around with...

Nomadeon · 2025-08-05T22:52:05 1754434325

Agree. Concrete example: "What was the Japanese codeword for Midway Island in WWII?"

Answer on Wikipedia: https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-bre...

dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds

deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds

gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds

gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !

Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.

It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.

anorwell · 2025-08-06T00:07:30 1754438850

I think your example reflects well on oss-20b, not poorly. It (may) show that they've been successful in separating reasoning from knowledge. You don't _want_ your small reasoning model to waste weights memorizing minutiae.

sailingparrot · 2025-08-06T10:25:49 1754475949

> gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !

To be fair, this is not the type of questions that benefit from reasoning, either the model has this info in it's parametric memory or it doesn't. Reasoning won't help.

bigmanhank · 2025-08-06T08:53:27 1754470407

Not true: During World War II the Imperial Japanese Navy referred to Midway Island in their communications as “Milano” (ミラノ). This was the official code word used when planning and executing operations against the island, including the Battle of Midway.

12.82 tok/sec 140 tokens 7.91s to first token

openai/gpt-oss-20b

WmWsjA6B29B4nfk · 2025-08-06T09:15:03 1754471703

What's not true? This is a wrong answer

bigmanhank · 2025-08-07T06:21:08 1754547668

this was the answer from my instance. it is true. "not true" was refering to the poster

seba_dos1 · 2025-08-06T12:41:37 1754484097

How would asking this kind of question without providing the model with access to Wikipedia be a valid benchmark for anything useful?

nojito · 2025-08-05T23:40:26 1754437226

Why does it need knowledge when it can just call tools to get it?

pxc · 2025-08-06T00:03:28 1754438608

Right... knowledge is one of the things (the one thing?) that LLMs are really horrible at, and that goes double for models small enough to run on normal-ish consumer hardware.

Shouldn't we prefer to have LLMs just search and summarize more reliable sources?

jdiff · 2025-08-06T00:29:06 1754440146

Even large hosted models fail at that task regularly. It's a silly anecdotal example, but I asked the Gemini assistant on my Pixel whether [something] had seen a new release to match the release of [upstream thing].

It correctly chose to search, and pulled in the release page itself as well as a community page on reddit, and cited both to give me the incorrect answer that a release had been pushed 3 hours ago. Later on when I got around to it, I discovered that no release existed, no mention of a release existed on either cited source, and a new release wasn't made for several more days.

moodler · 2025-08-06T03:34:52 1754451292

Reliable sources that are becoming polluted by output from knowledge-poor LLMs, or overwhelmed and taken offline by constant requests from LLMs doing web scraping …

nojito · 2025-08-06T00:51:20 1754441480

Yup which is why these models are so exciting!

They are specifically training on webbrowsing and python calling.

notachatbot123 · 2025-08-06T07:42:00 1754466120

Why do I need "AI" when I can just (theoretically, in good old times Google) Google it?

nojito · 2025-08-06T10:55:34 1754477734

Because now the model can do it for you and you can focus on other more sophisticated tasks.

I am aware that there’s a huge group of people who justify their salary by being able google.

iamnotagenius · 2025-08-06T15:33:53 1754494433

Try to push your point to absurd you see why; hint - to analyze data pulled by tools you need knowledge already baked in. You have very limited context, you cannot just pull and pull data.

kmacdough · 2025-08-06T16:45:18 1754498718

I too am skeptical of these models, but it's a reasoning focused model. As a result this isn't a very appropriate benchmark.

Small models are going to be particularly poor when used outside of their intended purpose. They have to omit something.