Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's for hidden layers and not for every parameter: From Keller's Muon github page:

"Muon is an optimizer for the hidden weights of a neural network. Other parameters, such as embeddings, classifier heads, and hidden gains/biases should be optimized using standard AdamW."

And I just looked into this nanochat repo and it's also how it's used here.

https://github.com/karpathy/nanochat/blob/dd6ff9a1cc23b38ce6...



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: