Semantic search is an information retrieval system that focuses on the meaning of the sentences rather than the conventional keyword matching. Even though there are many text embeddings that can be used for this purpose, scaling this up to build low latency APIs that can fetch data from a huge collection of data is something that is seldom discussed. In this article, I will discuss how we can implement a minimal semantic search engine using SOTA sentence embeddings (sentence transformer) and FAISS.

Sentence transformers

It is a framework or set of models that give dense vector representations of sentences or paragraphs. These models are transformer networks(BERT, RoBERTa, etc.) which are fine-tuned specifically for the task of Semantic textual similarity as the BERT doesn’t perform well out of the box for these tasks. Given below is the performance of different models in the STS benchmark

Image for post

Image source: Sentence transformers

We can see that the Sentence transformer models outperform the other models by a large margin.

But if you look at the leaderboard by papers with code and GLUE, you would see many models above 90. So why do we need Sentence transformers?.

Well, In those models, the semantic Textual similarity is considered as a regression task. This means whenever we need to calculate the similarity score between two sentences, we need to pass them together into the model and the model outputs the numerical score between them. While this works well for the benchmarking test, it scales badly for a real-life use case, and here are the reasons.

  1. When you need to search over say 10 k documents, you would need to perform 10k separate inference computations, its not possible to compute the embeddings separately and calculate just the cosine similarity. See the author’s explanation.
  2. The maximum sequence length (The total number of words/tokens the model can take at one pass) is shared between two documents, which causes the representations to be diluted due to chunking

#machine-learning #semantic-search #naturallanguageprocessing

Billion-scale semantic similarity search with FAISS+SBERT
10.35 GEEK