So the problem with that, is you're visualising the space with only points that exist in the image dataset. The language embedding has more information that comes from the language that isn't contained in images.
It handles bad, and it handles anatomy. If there aren't single images that cover that - that's exactly what language embeddings solve for.
Try it on clip-front: https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2...