Yeah, totally, I get you. I'm not trying to do a takedown, ER is just hard. The ...

code_biologist · on June 2, 2022

You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.

Any reason you couldn't just dump it in FAISS or Annoy? No need to do pairwise comparisons. Annoy claims to handle even up to 1,000 dimensions reasonably. https://github.com/facebookresearch/faiss/issues/95

smeeth · on June 2, 2022

Yeah, ANN algos provide a logical way to do blocking. It's not quite as simple as just running it though.

First, you have to be able to actually run the server. Most ANN algos (not DiskANN or SPTAG, but basically everything else) are run in memory, so if you've got really beefy data you'll need machines that can handle the load. The search engine companies and researchers using toy problems generally don't run into this problem.

Second, Approximate Nearest Neighbors are, unsurprisingly, approximate, so if you want the top N true neighbors you actually have to 1) collect the top K*N approximate neighbors then 2) run true cosine similarity on the K*N neighbors to refine down to N.

So your new O(kn)(not the same k or n, sorry I overloaded my variables) ER algorithm is basically:

- set up ANN server

- for each entity, collect K*N approximate neighbors

- refine to N true neighbors

- perform matching step

- somehow handle the fact that you have merged entities in your ANN server (this can be weirdly tricky).

Also, I might be splitting hairs, but ANN isn't REALLY performing a similarity measure. Its doing clustering or making search trees that try to guarantee the leaf nodes will be close to within some threshold. In that sense it has more in common with a fuzzy hash or blocking algorithm than it does with an exact similarity calculation.

As you may be able to tell, I have spent more time than I would like thinking about and implementing this sort of thing.