Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just a couple of things:

- We are actually using a weighted average.

- This is just one of the techniques used for recall. The aim at this point is to be super fast and scalable, and not super accurate. Once we reach the precision stage of the ranking (as defined in https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...), we can afford to do fancier matching.

/ One of the authors



Weighted based on what, do you keep IDF values of some sort?

Even then it's hard to imagine how this performs great if these are plain Word2vec vectors. Saying it's just the recall step is a bit hand wavy as these will be actually selecting the documents you will be performing additional scoring on and may very well end up excluding a multitude of great results.

In any case, once more these are very interesting to read and as a search nerd, and I can't help but wonder about all the alternatives considered.


We do have our own custom word (piece) embeddings that we have trained on <good query, bad query>-pairs. There are a few more details about it in https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: