Statistics in Spark SQL Explained

A closer look at the cost-based optimizer in Spark.

Spark SQL optimizer uses two types of optimizations: rule-based and cost-based. The former relies on heuristic rules while the latter can use some statistical properties of the data. In this article, we will explain how these statistics are used in Spark under the hood and we will see in which situations they are useful and how to take advantage of them.Most of the optimizations that Spark does are based on some heuristic rules that do not take into account the properties of the data that are being processed. For example, the PredicatePushDown rule is based on a heuristic rule which assumes that it is better to first reduce the data by filtering and then apply some computation on it. There are however some situations in which Spark can also use some statistical information about the data in order to come up with yet a better plan and this is often referred to as the cost-based optimization or CBO. In this article, we will explore it more in detail.

#spark #data-science #sql #datbase #developer

towardsdatascience.com

Statistics in Spark SQL Explained