Make computations on large cross joined Spark DataFrames faster

Apache Spark splits data into partitions and performs tasks on these partitions in parallel to make your computations run concurrently. The number of partitions has a direct impact on the run time of Spark computations.

Often times your Spark computations involve cross joining two Spark DataFrames i.e. creating a new DataFrame containing a combination of every row from the two input DataFrames. Spark multiplies the number of partitions of the input DataFrames when cross joining large DataFrames. This can result in a significantly higher number of partitions in the cross joined DataFrame. As a result, running computations on this DataFrame can be very slow due to excessive overhead in managing many small tasks on the partitions.

This blog post will demonstrate how repartitioning the large input DataFrames with a smaller number of partitions before cross join can make computations on the resulting cross joined DataFrame faster.

Let’s consider two scenarios to understand how partitioning works when cross joining DataFrames:

Scenario 1: Smaller DataFrames

If your input DataFrames are smaller in size, then the cross joined DataFrame would have partitions equal to the number of the partitions of the input DataFrame.

scala> val xDF = (1 to 1000).toList.toDF("x")
scala> xDF.rdd.partitions.size
res11: Int = 2
scala> val yDF = (1 to 1000).toList.toDF("y")
scala> yDF.rdd.partitions.size
res12: Int = 2
scala> val crossJoinDF = xDF.crossJoin(yDF)
scala> crossJoinDF.rdd.partitions.size
res13: Int = 2

In this case,

Partitions of xDF == Partitions of yDF == Partitions of crossJoinDF

If the partitions of input DataFrames, i.e xDF or yDF are not equal, then the partitions of the cross joined DataFrame would be equal to one of the input DataFrames.

#big-data #spark #cross-join #optimization #partition

Scenario 1: Smaller DataFrames

towardsdatascience.com

Make computations on large cross joined Spark DataFrames faster