Adaptive Query Execution on Apache Spark 3.0.0

Spark 3.0 comes with a lot of exciting new features and enhancements. Some of the key highlights of the new release are Adaptive Query Execution, Dynamic Partition Pruning, Disk-persisted RDD blocks served by shuffle service.

There are also significant improvements in panda’s API and up to 40X speedups in invoking R user-defined functions. Scala 2.12 is Generally Available while Scala 2.11 is removed in the latest version.

In the following sections, we will dive deeper into some of the supported functionality

Adaptive Query Execution

Adaptive query execution (AQE) changes the Spark execution plan at runtime based on the statistics available from intermediate data generated and stage runs. The optimized plan can convert a sort-merge join to broadcast join, optimize the reducer count, and/or handle data skew during the join operation.

Configuration property:

spark.sql.adaptive.enabled — which is Disabled by default

Dynamic Partition Pruning

Dynamic Partition Pruning (DPP) optimization improves the job performance for the queries where the join condition is on the partitioned column by selecting the specific partitions within the table that need to be read at runtime. This the amount of data read and processed.

#hadoop #science #spark #sql #data #data-science