Ideally in that scenario you'd have a model that unified vision, language and an understanding of 'doing things' and manipulating objects. so it wouldnt just be an LLM, it would be a language-vision-doingthings model. There's no reason why we cant build one.