To create specific people and spatiotemporal consistent content, we would use conditional image generation where we condition on both the text and some seed vector, and proceed autoregressively. If we wanted a random person, we would use a random seed vector. If we wanted a specific person, we would use 3D GAN Inversion to find the seed vector which would give us the person in a provided query image. We would also likely condition the latent space so we could retarget facial expressions from one person to another by manipulating the differences between latent codes.