Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I remember helping my little sister who got entity resolution (people’s names and company names) homework assignment for programming class 26 years ago (she is economics major and I am CS). That was infuriating and intellectually challenging at the same time. We came up with a combination of n-grams, Levenshtein distance, and common abbreviation (think “Inc.” and “Corp.”) canonicalization. It worked reasonably well.


The reason why I love this problem is because of this! I feel like there are a lot of fun ways to be creative here, but as the other comments mentioned -- to get a scalable and really good solution is extremely difficult.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: