How to build a smart search engine

In the first post within this series, we built a search engine in just a few lines of code which was powered by the BM25 algorithm used in many of the largest enterprise search engines today.

In this post, we want to go beyond this and create a truly smart search engine. This post will describe the process to do this and also provide template code to achieve this on any dataset.

But what do we mean by ‘smart’? We are defining this as a search engine which is able to:

Return relevant results to a user even if they have not searched for the specific words within these results.
Be location aware; understand UK postcodes and the geographic relationship of towns and cities in the UK.
Be able to scale up to larger datasets (we will be moving to a larger dataset than in our previous example with 212k records but we need to be able to scale to much larger data).
Be orders of magnitude faster than our last implementation, even when searching over large datasets.
Handle spelling mistakes, typos and previously ‘unseen’ words in an intelligent way.

In order to achieve this, we will need to combine a number of techniques:

fastText Word vectors. We will train a model on our data set to create vector representations of words (more information on this here).
BM25. We will still be using this algorithm to power our search but we will need apply this to our word vector results.
Superfast searching of our results using the lightweight and highly efficient Non-Metric Space Library (NMSLIB).

This will look something like the below:

Image for post

An overview of the pipeline we will be creating in this post

This article will walk through each of these areas and describe how they can be brought together to create a smart search engine.

#programming #artificial-intelligence #towards-data-science #python #search

towardsdatascience.com

How to build a smart search engine