Danger awaits all ye who enter this tutorial and have large datasets.
The tutorial is fun marketing material and all but its FAR too slow to be used on anything at scale. Please, for your sanity, don't treat this as anything other than a fun toy example.
ER wants to be an an O(n^2) problem and you have to fight very hard for it not to turn into one. Most people doing this at scale are following basically the same playbook:
1) Use a very complicated speed optimized non-ml algorithm to find groups of entities that are highly likely to be the same, usually based on string similarity, hashing, or extremely complicated heuristics. This process is called blocking or filtering.
2) Use fancy ML to determine matches within these blocks.
If you try to combine 1 & 2, skip 1, or try to do 1 with ML on large datasets you are guaranteed to have a bad time. The difference between mediocre and amazing ER is how well you are doing blocking/filtering.
So much this. I work on spatiotemporal entity resolution and it requires extreme care and cleverness to not turn it into a proverbial "boiling the ocean" compute problem for all but the most trivial cases. The implied state that needs to be managed alone is often effectively intractable. At one time supercomputers were used for large-scale entity resolution problems; they may still.
The beautiful thing about spatiotemporal entities is that physical time provides "out of the box common sense" decent locality, and physical space excellent locality, for the blocking process your parent comment mentions.
It sucks to work on a whole product catalog or rolodex.
Cool, TIL about blocking/ filtering. We do entity resolution such that if we have a "process id" (ex: 1234) in two different event logs we can determine if they're the same process (since pids get reused). We don't have to compare every process to every other process, only ones that share the same pid, which drastically reduces the data size, and then we do filtering on top of that based on other optional attributes.
When I wrote this system I didn't know what ER even was, an article like this would have helped a lot, even just the first line defining ER would have helped a lot.
You are 100% correct, this is a toy example that I decided to put together for fun after talking to a bunch of people who mentioned it as a problem that they experience.
The main idea I wanted to add to the discussion (which is not that crazy of an addition) is that you can possible use sentence embedding instead of fuzzy matching on the actual letters to get more "domain expertise"
How to actually compare these embeddings with all the other embeddings you have in a large dataset is a problem that is completely out of the scope of this tutorial
Yeah, totally, I get you. I'm not trying to do a takedown, ER is just hard.
The point I was trying to make is that at scale one does not simply:
> compare these embeddings with all the other embeddings you have
You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.
Using embeddings is quite common in the literature and in deployment (I have, in fact, deployed ER that uses embeddings), but only as part of #2, the matching step. In many projects, doing full pairwise comparisons would take on the order of years, you have to do something else to refine the comparison sets first.
You just can't, similarity metrics (especially cosine) on 768 dim arrays are prohibitively slow.
Any reason you couldn't just dump it in FAISS or Annoy? No need to do pairwise comparisons. Annoy claims to handle even up to 1,000 dimensions reasonably. https://github.com/facebookresearch/faiss/issues/95
Yeah, ANN algos provide a logical way to do blocking. It's not quite as simple as just running it though.
First, you have to be able to actually run the server. Most ANN algos (not DiskANN or SPTAG, but basically everything else) are run in memory, so if you've got really beefy data you'll need machines that can handle the load. The search engine companies and researchers using toy problems generally don't run into this problem.
Second, Approximate Nearest Neighbors are, unsurprisingly, approximate, so if you want the top N true neighbors you actually have to 1) collect the top K*N approximate neighbors then 2) run true cosine similarity on the K*N neighbors to refine down to N.
So your new O(kn)(not the same k or n, sorry I overloaded my variables) ER algorithm is basically:
- set up ANN server
- for each entity, collect K*N approximate neighbors
- refine to N true neighbors
- perform matching step
- somehow handle the fact that you have merged entities in your ANN server (this can be weirdly tricky).
Also, I might be splitting hairs, but ANN isn't REALLY performing a similarity measure. Its doing clustering or making search trees that try to guarantee the leaf nodes will be close to within some threshold. In that sense it has more in common with a fuzzy hash or blocking algorithm than it does with an exact similarity calculation.
As you may be able to tell, I have spent more time than I would like thinking about and implementing this sort of thing.
My suspicion is O(kn), where k is both constant and rather large. This would be highly reliant on an extremely good blocking algorithm. Maybe trending towards O(nlogn)? I don't know. It's really tough.
The blocking algorithm for this hypothetical solution would most likely be an approximate hashing algorithm sort of resembling a bloom filter. The input data would have to be amenable to hashing, like a single string field. Add multiple fields and shit just got complicated fast. Look up q-gram blocking if you want to read a sort of simple example of a blocking algorithm like this.
Thinking about this, I think you would end up with something like O(s^2*b) where s is the size of your biggest block (e.g. all the "Smith"s or company names containing "Inc." if that doesn't get removed), and b is the number of blocks. This would be at least n, so O(kn) with a big k makes some sense. If you have some way to get away with not comparing every element in a block to every other element, e.g. exact string match only I think you could get away with O(s*log(s)*b), but most of the time the matches are not going to be good enough.
Entity resolution/record linkage/deduplication is an oddly specialized domain of knowledge given that it's such a common problem. I put together a page of resources a while back if anyone is interested: https://github.com/ropeladder/record-linkage-resources
I was expecting the use of sentence-transformers (sbert.net). If you have a long list of entities you could use an approximate similarity search library such as annoy. The authors store the embeddings in a database and decode json for each comparison. Very inefficient in my opinion. At least load the whole table of embeddings in a np.array from the start, np.dot is plenty fast if your list not huge.
The problem is still not solved. Having a list of the most similar entities does not tell you which are similar and which are just related. You need a classifier. For that you can label a few hundred pairs of positive and negative examples and use the same sbert.net to finetune a transformer. The authors use the easier route of thresholding cosine similarity score at 0.8, but this threshold might not work for your case.
I remember helping my little sister who got entity resolution (people’s names and company names) homework assignment for programming class 26 years ago (she is economics major and I am CS). That was infuriating and intellectually challenging at the same time. We came up with a combination of n-grams, Levenshtein distance, and common abbreviation (think “Inc.” and “Corp.”) canonicalization. It worked reasonably well.
The reason why I love this problem is because of this! I feel like there are a lot of fun ways to be creative here, but as the other comments mentioned -- to get a scalable and really good solution is extremely difficult.
What are people doing with entity resolution/record linkage? At Doximity we use it to match messy physician, hospital, and medical publication data from various sources into more coherent data sets to power profiles and research tools. Mostly with https://dedupe.io/ but with some custom tooling to handle larger (1m+ entities) datasets.
Replacing sensitive entity content with pseudorandom seeded junk for subsequent training and transforms in exposed settings. Not the main use case for ER.
To me, this looks very similar to local sequence similarity search (e.g. BLAST), where there are very rapid methods that use tuple-lookup and banded alignment to quickly identify "homologs" (the same entity). The nice thing about similarity searching algorithms is that they give you a very accurate probability of whether two strings are "homologous" (belong to the same entity). Perhaps I have the scale wrong, but it is routine to look for thousands of queries (new entities) among hundreds of millions of sequences (known entities) in an hour or so (and sequences are typically an order of magnitude longer than entity names). The problem is embarrassingly parallel, and very efficient algorithms are available for entities that are usually very similar.
BLAST is so deeply embedded in modern genome biology that it is hard to find short descriptions. Basically, there are two parts:
(1) local sequence (string) similarity scores (scores that allow mismatches, but need not extend to the end of either string) are distributed as the "extreme value distribution", which means that it is very easy to distinguish between similarity by chance and similarity because of common ancestry (I think of the names for the same entity as ancestral). This is a very powerful concept, that allows one to accurately set false-positive rates (it does not help with false negatives)
(2) BLAST calculates local similarity scores at O(n^2)/K where K is very very large by first identifying significant conserved "words" using a lookup table, then looking for runs of those words along an alignment diagonal, and then trying to extend that diagonal using an efficient banded dynamic programming algorithm.
this doesn't make any sense though, your write up starts with an assumption that multiple records in the dataset are "obviously" the same entity ... so we wouldn't even need entity resolution then ...
"entity resolution" as a process should moreso imply how to determine whether two records that aren't obviously the same entity are actually the same entity ... and how you would go about discovering and proving and declaring that ...
The tutorial is fun marketing material and all but its FAR too slow to be used on anything at scale. Please, for your sanity, don't treat this as anything other than a fun toy example.
ER wants to be an an O(n^2) problem and you have to fight very hard for it not to turn into one. Most people doing this at scale are following basically the same playbook:
1) Use a very complicated speed optimized non-ml algorithm to find groups of entities that are highly likely to be the same, usually based on string similarity, hashing, or extremely complicated heuristics. This process is called blocking or filtering.
2) Use fancy ML to determine matches within these blocks.
If you try to combine 1 & 2, skip 1, or try to do 1 with ML on large datasets you are guaranteed to have a bad time. The difference between mediocre and amazing ER is how well you are doing blocking/filtering.