Be in charge of Query Execution in Spark SQL

Querying data in Spark has become a luxury since Spark 2.x because of SQL and declarative DataFrame API. Using just few lines of high level code allows to express quite complex logic and carry out complicated transformations. The big benefit of the API is that users don’t need to think about the execution and can let the optimizer figure out the most efficient way to execute the query. And efficient query execution is often a requirement not only because the resources may become costly, but also it makes the work of the end user more comfortable by reducing the time he/she has to wait for the result of the computation.
The Spark SQL optimizer is indeed quite mature, especially now with the upcoming version 3.0 which will introduce some new internal optimizations such as dynamic partition pruning and adaptive query execution. The optimizer is internally working with a query plan and is usually able to simplify it and optimize by various rules. For example it can change the order of some transformations or drop them completely if they are not necessary for the final output. Despite all the clever optimizations there are however still situations in which a human brain can do better. In this article we will take a look at one of these cases and see how, using a simple trick, we can lead Spark towards a more efficient execution.

#apache-spark #data-engineering #query-optimization #spark-sql

towardsdatascience.com

Be in charge of Query Execution in Spark SQL