Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2


All the Llamas have done it (well, 2 and 3, and I believe 1, I don't know about 4). I think they have a citation for it, though it might just be the RoPE paper (https://arxiv.org/abs/2104.09864).

I'm not actually aware of any model that doesn't do positional embeddings on a per-layer basis (excepting BERT and the original transformer paper, and I haven't read the GPT2 paper in a while, so I'm not sure about that one either).


Thanks! I'm not super up to date on all the ML stuff :)


Should be in the RoPE paper. The OG transformers used multiplicative sinusoidal embeddings, while RoPE does a pairwise rotation.

There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.


This is normal. Rope was introduced after bert/gpt2




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: