Using clip for searching is better than direct text indexing for a variety of reasons but here for example because it matches better what stable diffusion sees
Still interesting to have a different view over the dataset!
If you want to scale this out, you could use elastic search
It works for your example
I guess I'll disable it by default since it seems to confuse people