Batching can lead to variance with things like batchnorm but most transformers u... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		jerpint 5 months ago \| parent \| context \| favorite \| on: Why DeepSeek is cheap at scale but expensive to ru... Batching can lead to variance with things like batchnorm but most transformers use layer norm to avoid this problem

amelius 5 months ago [–]

Batchnorm can only have an effect between batches during training, not inference.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact