I'm just a layman but I don't think anyone really expected or knows _why_ just s...

I'm just a layman but I don't think anyone really expected or knows _why_ just stacking a bunch of attention layers works so well. It's not immediately obvious that doing well on predicting a masked token is going to somehow "generalize" to being able to provide coherent answers to prompts. You can sort of squint and try to handwave it, but if it were obvious that this would work you'd think people would have experimented with it before 2018.