Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I haven't heard of this before. Has Muon dethroned Adam and AdamW as the standard general purpose optimizer for deep learning?


It's for hidden layers and not for every parameter: From Keller's Muon github page:

"Muon is an optimizer for the hidden weights of a neural network. Other parameters, such as embeddings, classifier heads, and hidden gains/biases should be optimized using standard AdamW."

And I just looked into this nanochat repo and it's also how it's used here.

https://github.com/karpathy/nanochat/blob/dd6ff9a1cc23b38ce6...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: