To create specific people and spatiotemporal consistent content, we would use co...

To create specific people and spatiotemporal consistent content, we would use conditional image generation where we condition on both the text and some seed vector, and proceed autoregressively. If we wanted a random person, we would use a random seed vector. If we wanted a specific person, we would use 3D GAN Inversion to find the seed vector which would give us the person in a provided query image. We would also likely condition the latent space so we could retarget facial expressions from one person to another by manipulating the differences between latent codes.