Op here. Yes that's right. We do also insert the current text embedding on misse...

Op here. Yes that's right. We do also insert the current text embedding on misses to expand the boundaries of the cluster.

For instance: I love McDonalds (1). I love burgers. (0.99) I love cheeseburgers with ketchup (?).

This is a bad example but in this case the last text could end up right at the boundary of the similarity to that 1st label if we did not store the 2nd, which could cause a cluster miss we don't want.

We only store the text on cache misses, though you could do both. I had not considered that idea but it make sense. I'm not very concerned about the dataset size because vector storage is generally cheap (~ $2/mo for 1M vectors) and the savings in $$$ not spend generating tokens covers for that expense generously.