Deconstructing Scoring In Elasticsearch

This article aims to explain the basics of relevance scoring in Elasticsearch(ES). Considering the very fact that Elasticsearch is based on Lucene; in this article we will first look into the classic TF-IDF(Term Frequency-Inverse Document Frequency) algorithm followed by the BM25 Similarity in ES which is now the default Similarity algorithm since Lucene 6.0.

Introduction

Simply put, Relevancy ranking is the process of sorting the document results so that those documents which are most likely to be relevant to the query are shown at the top. The relevance of documents is an extremely integral aspect of search engines because, as an end-user, we want results that are closely associated with our intended query. Information Retrieval as a field contributes heavily to this process.

One such scoring is used by Elasticsearch to rank and score the documents that we see on running a query. The higher the score the more relevancy it has to our intended query.

Let’s dive deep into the methods

In order to score the documents, Elasticsearch’s first step is to get the subset of the documents that match the query. This is achieved in a binary fashion. A document can either match our query or not. Yes or No. True or False.

Once the subset of documents is received then the task of scoring the documents based on their relevance begins. Scoring of a document is broadly a function of fields matched from the intended query and any surplus modifications to scoring such as boosting.

TF-IDF : Classic Method

As earlier specified, Elasticsearch is based on Lucene, so it primarily uses the latter’s scoring function. This method was the default method before Lucene 6.0 . Lucene’s practical scoring formula is mainly based on the term frequency and inverse document frequency concepts of Elasticsearch.

Image for post

Lucene’s practical scoring formula:

score(q,d) = 
 queryNorm(q) 
 · coord(q,d) 
 · ∑ ( 
 tf(t in d) 
 · idf(t)² 
 · t.getBoost() 
 · norm(t,d) 
 ) (t in q)

Where :

q : query
d : document
t : term

So, effectively score(q,d) means that we are trying to calculate the relevance score of document d for query q.

#lucene #elasticsearch #scoring #kibana #algorithms #algorithms

Introduction

TF-IDF : Classic Method

codeburst.io

Deconstructing Scoring In Elasticsearch