How to Create a Spark DataFrame the Fast Way

There are several different ways to create a DataFrame in Apache Spark — which one should you use? What is the most efficient way from a performance perspective? In this post, we will look at a few different options using the programming language Apache Spark is written in: Scala.

As a first step, we want to create a simple DataFrame in Spark. It can be done like this:

val df = (1 to 100).toDF("id")

(1 to 100) creates a range of 100 integer values and the .toDF(“id”) function call converts this range into a Spark DataFrame with one column named “id”.

When we work with DataFrames we sometimes want to extend them by adding more columns. We can call an external REST API to retrieve a result for each ID in your DataFrame, for example. This allows us to merge the results back to our DataFrame for further processing or storage.

In our code example, we will simulate the API call by generating additional columns in a simplified way. But you can imagine their values could instead come from an external API or data source:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit

def addColumns(df: DataFrame): DataFrame = 
  df.collect
    .map(_.getInt(0))
    .map( id => 
      Seq(id).toDF("id")
        .withColumn("animal", lit("dog"))
        .withColumn("age", lit(id + 10))
    )
    .reduce(_ union _)
val slowDf = spark.time(addColumns(df))
slowDf.show(5)

First we need to import the DataFrame type because we want to use it in the signature of the addColumns() function. Next we need to import the lit() function for creating new columns with literal values.

Our addColumns() function receives a DataFrame as a parameter and returns a new DataFrame with the added columns as a result. To simulate the external REST API call we first collect the DataFrame on the Spark driver and retrieve the integer values for each ID by calling .map(_.getInt(0)). Afterwards we can map over the IDs and create a new DataFrame for each ID, before adding the new columns “animal” and “age” to it.

This approach looks pretty nice, because we see each column name on the same line as the column value. That can prevent mix-ups between columns and values, especially if we work with a long list of them and have to add a few new ones, for example. But we will soon see the downside to this benefit.

Lastly, we combine the individual DataFrames we created in each step of the map() function into a single DataFrame by calling reduce(_ union _). This shortcut does the same as calling reduce((a, b) => a.union(b)), if you are not as familiar with Scala’s syntax.

#apache-spark #dataframes #scala #performance-optimization

medium.com

How to Create a Spark DataFrame the Fast Way