It's for hidden layers and not for every parameter:
From Keller's Muon github page:
"Muon is an optimizer for the hidden weights of a neural network. Other parameters, such as embeddings, classifier heads, and hidden gains/biases should be optimized using standard AdamW."
And I just looked into this nanochat repo and it's also how it's used here.
"Muon is an optimizer for the hidden weights of a neural network. Other parameters, such as embeddings, classifier heads, and hidden gains/biases should be optimized using standard AdamW."
And I just looked into this nanochat repo and it's also how it's used here.
https://github.com/karpathy/nanochat/blob/dd6ff9a1cc23b38ce6...