At what point did AI-generated human speech become so remarkably realistic? I re...

At what point did AI-generated human speech become so remarkably realistic?

I recall just a couple of years ago when even the best models, like WaveNet, still had a subtle robotic quality.

What architectures or models have led to this breakthrough? Or is it possible that, as a non-native English speaker, I’m missing some nuances?