Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How up to date are you on current open weights models? After playing around with it for a few hours I find it to be nowhere near as good as Qwen3-30B-A3B. The world knowledge is severely lacking in particular.


Agree. Concrete example: "What was the Japanese codeword for Midway Island in WWII?"

Answer on Wikipedia: https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-bre...

dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds

deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds

gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds

gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !

Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.

It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.


I think your example reflects well on oss-20b, not poorly. It (may) show that they've been successful in separating reasoning from knowledge. You don't _want_ your small reasoning model to waste weights memorizing minutiae.


> gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !

To be fair, this is not the type of questions that benefit from reasoning, either the model has this info in it's parametric memory or it doesn't. Reasoning won't help.


Not true: During World War II the Imperial Japanese Navy referred to Midway Island in their communications as “Milano” (ミラノ). This was the official code word used when planning and executing operations against the island, including the Battle of Midway.

12.82 tok/sec 140 tokens 7.91s to first token

openai/gpt-oss-20b


What's not true? This is a wrong answer


this was the answer from my instance. it is true. "not true" was refering to the poster


How would asking this kind of question without providing the model with access to Wikipedia be a valid benchmark for anything useful?


Why does it need knowledge when it can just call tools to get it?


Right... knowledge is one of the things (the one thing?) that LLMs are really horrible at, and that goes double for models small enough to run on normal-ish consumer hardware.

Shouldn't we prefer to have LLMs just search and summarize more reliable sources?


Even large hosted models fail at that task regularly. It's a silly anecdotal example, but I asked the Gemini assistant on my Pixel whether [something] had seen a new release to match the release of [upstream thing].

It correctly chose to search, and pulled in the release page itself as well as a community page on reddit, and cited both to give me the incorrect answer that a release had been pushed 3 hours ago. Later on when I got around to it, I discovered that no release existed, no mention of a release existed on either cited source, and a new release wasn't made for several more days.


Reliable sources that are becoming polluted by output from knowledge-poor LLMs, or overwhelmed and taken offline by constant requests from LLMs doing web scraping …


Yup which is why these models are so exciting!

They are specifically training on webbrowsing and python calling.


Why do I need "AI" when I can just (theoretically, in good old times Google) Google it?


Because now the model can do it for you and you can focus on other more sophisticated tasks.

I am aware that there’s a huge group of people who justify their salary by being able google.


Try to push your point to absurd you see why; hint - to analyze data pulled by tools you need knowledge already baked in. You have very limited context, you cannot just pull and pull data.


I too am skeptical of these models, but it's a reasoning focused model. As a result this isn't a very appropriate benchmark.

Small models are going to be particularly poor when used outside of their intended purpose. They have to omit something.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: