Deep dive into Apache Spark Window Functions

In this blog post, we’ll do a Deep Dive into Apache Spark Window Functions. You may also be interested in my earlier posts on Apache Spark.

First, let’s look at what window functions are and when we should use them. We use various functions in Apache Spark like month (return month from the date), round (round off the value), andfloor(gives floor value for a given input), etc. which will be performed on each record and will return a value for each record. Then we have various aggregated functions that will be performed on a group of data and return a single value for each group like sum, avg, min, max, and count. But what if we would like to perform the operation on a group of data and would like to have a single value/result for each record? We can use window functions in such cases. They can define the ranking for records, cumulative distribution, moving average, or identify the records prior to or after the current record.

Let’s use some Scala API examples to learn about the following window functions:

Aggregate: min, max, avg, count, and sum.
Ranking: rank, dense_rank, percent_rank, row_num, and ntile
Analytical: cume_dist, lag, and lead
Custom boundary: rangeBetween and rowsBetween

For your easy reference, a Zeppelin notebook exported as a JSON file and also a Scala file are available on GitHub.

#hadoop #spark #software-engineering #programming #big-data

medium.com

Deep dive into Apache Spark Window Functions