Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> "DALL-E 2's works very simply: ... a model called the prior maps the text encoding to a corresponding image encoding that captures the semantic information of the prompt contained in the text encoding. Finally, an image decoding model stochastically generates an image which is a visual manifestation of this semantic information."

> "The fundamental principles of training CLIP are quite simple: First, all images and their associated captions are passed through their respective encoders, mapping all objects into an m-dimensional space."

Not scared to admit I don't find this simple at all and I'm probably not in the target audience. I'd love a description that doesn't assume machine learning basics. Is there one?



https://ml.berkeley.edu/blog/posts/dalle2/

it's "simple" because how it works is "just" brute-fucking-force. of course coming up with the architecture and making it fast (so it scales up well) is the challenge.

and scaling works .. because .. well, no one knows why (but likely because it's just a nice architecture for learning, evolution also converged on it without knowing why)

see also: https://www.gwern.net/Scaling-hypothesis




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: