Yeah, totally, I get you. I'm not trying to do a takedown, ER is just hard.
The point I was trying to make is that at scale one does not simply:
> compare these embeddings with all the other embeddings you have
You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.
Using embeddings is quite common in the literature and in deployment (I have, in fact, deployed ER that uses embeddings), but only as part of #2, the matching step. In many projects, doing full pairwise comparisons would take on the order of years, you have to do something else to refine the comparison sets first.
You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.
Any reason you couldn't just dump it in FAISS or Annoy? No need to do pairwise comparisons. Annoy claims to handle even up to 1,000 dimensions reasonably. https://github.com/facebookresearch/faiss/issues/95
Yeah, ANN algos provide a logical way to do blocking. It's not quite as simple as just running it though.
First, you have to be able to actually run the server. Most ANN algos (not DiskANN or SPTAG, but basically everything else) are run in memory, so if you've got really beefy data you'll need machines that can handle the load. The search engine companies and researchers using toy problems generally don't run into this problem.
Second, Approximate Nearest Neighbors are, unsurprisingly, approximate, so if you want the top N true neighbors you actually have to 1) collect the top K*N approximate neighbors then 2) run true cosine similarity on the K*N neighbors to refine down to N.
So your new O(kn)(not the same k or n, sorry I overloaded my variables) ER algorithm is basically:
- set up ANN server
- for each entity, collect K*N approximate neighbors
- refine to N true neighbors
- perform matching step
- somehow handle the fact that you have merged entities in your ANN server (this can be weirdly tricky).
Also, I might be splitting hairs, but ANN isn't REALLY performing a similarity measure. Its doing clustering or making search trees that try to guarantee the leaf nodes will be close to within some threshold. In that sense it has more in common with a fuzzy hash or blocking algorithm than it does with an exact similarity calculation.
As you may be able to tell, I have spent more time than I would like thinking about and implementing this sort of thing.
The point I was trying to make is that at scale one does not simply:
> compare these embeddings with all the other embeddings you have
You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.
Using embeddings is quite common in the literature and in deployment (I have, in fact, deployed ER that uses embeddings), but only as part of #2, the matching step. In many projects, doing full pairwise comparisons would take on the order of years, you have to do something else to refine the comparison sets first.