In this article, I’ll show you how to effectively explore indices and compute deeply nested aggregations on data indexed in ElasticSearch, using the pandagg library.

After an explanation of the motivations to write this library (head to “Let’s go” section if in a hurry), we’ll work on the IMDB dataset and compute aggregations answering questions requiring queries with increasing complexity.

It assumes you have a basic knowledge of ElasticSearch concepts.

All concepts approached here are explained in more detail in library documentation. The github repository is available here.


Motivations

ElasticSearch provides a powerful API to compute aggregated metrics on your indexed data: aggregations. One of the killer features is the ability to nest aggregations clauses, with the aggs parameter available in bucket aggregations.

{
    "per_genre": {
        "terms": {"field": "genres","size": 3},
        "aggs": {
            "rating_average": {"avg": {"field": "rank"}},
            "nb_roles_average": {"avg": {"field": "nb_roles"}
            }
        }
    }
}

But if you have already tried to compute quite deeply nested queries, you might have struggled parsing the output of your query.

#python #aggregation #pandas

Introducing pandagg: pandas-inspired library
2.15 GEEK