Correct me if I’m wrong but you need more than just closed captions. You need pr... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		cruffle_duffle 18 days ago \| parent \| context \| favorite \| on: Neural audio codecs: how to get audio into LLMs Correct me if I’m wrong but you need more than just closed captions. You need precise timing too. I’d think you’d need the text to line up exactly with the audio so when the voice makes an “A” sound the text it aligns with is “A” as well. So while having the closed captions saves some of the work, there is probably much more needed to get everything lined up. But I’m absolutely not an expert at all. In fact this is the first I’ve ever even though about it!

vvolhejn 18 days ago [–]

Author here. Speech-to-text is more or less solved, it's easy to automatically get captions including precise timestamps. For training Moshi, Kyutai's audio LLM, my colleagues used whisper-timestamped to transcribe 7 million hours of audio.

See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037

cruffle_duffle 18 days ago | [–]

Sweet!

Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact