First Steps With PySpark and Big Data Processing

First Steps With PySpark and Big Data Processing

In this tutorial for Python developers, you'll take your first steps with Spark, PySpark, and Big Data processing concepts using intermediate Python concepts.

In this tutorial for Python developers, you'll take your first steps with Spark, PySpark, and Big Data processing concepts using intermediate Python concepts.

It’s becoming more common to face situations where the amount of data is simply too big to handle on a single machine. Luckily, technologies such as Apache Spark, Hadoop, and others have been developed to solve this exact problem. The power of those systems can be tapped into directly from Python using PySpark!

Efficiently handling datasets of gigabytes and more is well within the reach of any Python developer, whether you’re a data scientist, a web developer, or anything in between.

In this tutorial, you’ll learn:

  • What Python concepts can be applied to Big Data
  • How to use Apache Spark and PySpark
  • How to write basic PySpark programs
  • How to run PySpark programs on small datasets locally
  • Where to go next for taking your PySpark skills to a distributed system

Big Data Concepts in Python

Despite its popularity as just a scripting language, Python exposes several programming paradigms like array-oriented programming, object-oriented programming, asynchronous programming, and many others. One paradigm that is of particular interest for aspiring Big Data professionals is functional programming.

Functional programming is a common paradigm when you are dealing with Big Data. Writing in a functional manner makes for embarrassingly parallel code. This means it’s easier to take your code and have it run on several CPUs or even entirely different machines. You can work around the physical memory and CPU restrictions of a single workstation by running on multiple systems at once.

This is the power of the PySpark ecosystem, allowing you to take functional code and automatically distribute it across an entire cluster of computers.

Luckily for Python programmers, many of the core ideas of functional programming are available in Python’s standard library and built-ins. You can learn many of the concepts needed for Big Data processing without ever leaving the comfort of Python.

The core idea of functional programming is that data should be manipulated by functions without maintaining any external state. This means that your code avoids global variables and always returns new data instead of manipulating the data in-place.

Another common idea in functional programming is anonymous functions. Python exposes anonymous functions using the lambda keyword, not to be confused with AWS Lambda functions.

Now that you know some of the terms and concepts, you can explore how those ideas manifest in the Python ecosystem.

Lambda Functions

lambda functions in Python are defined inline and are limited to a single expression. You’ve likely seen lambda functions when using the built-in sorted() function:

>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(sorted(x))
['Python', 'awesome!', 'is', 'programming']
>>> print(sorted(x, key=lambda arg: arg.lower()))
['awesome!', 'is', 'programming', 'Python']

The key parameter to sorted is called for each item in the iterable. This makes the sorting case-insensitive by changing all the strings to lowercase before the sorting takes place.

This is a common use-case for lambda functions, small anonymous functions that maintain no external state.

Other common functional programming functions exist in Python as well, such as filter(), map(), and reduce(). All these functions can make use of lambda functions or standard functions defined with def in a similar manner.

filter(), map(), and reduce()

The built-in filter(), map(), and reduce() functions are all common in functional programming. You’ll soon see that these concepts can make up a significant portion of the functionality of a PySpark program.

It’s important to understand these functions in a core Python context. Then, you’ll be able to translate that knowledge into PySpark programs and the Spark API.

filter() filters items out of an iterable based on a condition, typically expressed as a lambda function:

>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(list(filter(lambda arg: len(arg) < 8, x)))
['Python', 'is']

filter() takes an iterable, calls the lambda function on each item, and returns the items where the lambda returned True.

Note: Calling list() is required because filter() is also an iterable. filter() only gives you the values as you loop over them. list() forces all the items into memory at once instead of having to use a loop.

You can imagine using filter() to replace a common for loop pattern like the following:

def is_less_than_8_characters(item):
    return len(item) < 8

x = ['Python', 'programming', 'is', 'awesome!']
results = []

for item in x:
    if is_less_than_8_characters(item):


This code collects all the strings that have less than 8 characters. The code is more verbose than the filter() example, but it performs the same function with the same results.

Another less obvious benefit of filter() is that it returns an iterable. This means filter() doesn’t require that your computer have enough memory to hold all the items in the iterable at once. This is increasingly important with Big Data sets that can quickly grow to several gigabytes in size.

map() is similar to filter() in that it applies a function to each item in an iterable, but it always produces a 1-to-1 mapping of the original items. The new iterable that map() returns will always have the same number of elements as the original iterable, which was not the case with filter():

>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(list(map(lambda arg: arg.upper(), x)))

map() automatically calls the lambda function on all the items, effectively replacing a for loop like the following:

results = []

x = ['Python', 'programming', 'is', 'awesome!']
for item in x:


The for loop has the same result as the map() example, which collects all items in their upper-case form. However, as with the filter() example, map() returns an iterable, which again makes it possible to process large sets of data that are too big to fit entirely in memory.

Finally, the last of the functional trio in the Python standard library is reduce(). As with filter() and map(), reduce()applies a function to elements in an iterable.

Again, the function being applied can be a standard Python function created with the def keyword or a lambda function.

However, reduce() doesn’t return a new iterable. Instead, reduce() uses the function called to reduce the iterable to a single value:

>>> from functools import reduce
>>> x = ['Python', 'programming', 'is', 'awesome!']
>>> print(reduce(lambda val1, val2: val1 + val2, x))

This code combines all the items in the iterable, from left to right, into a single item. There is no call to list() here because reduce() already returns a single item.

Note: Python 3.x moved the built-in reduce() function into the functools package.

lambda, map(), filter(), and reduce() are concepts that exist in many languages and can be used in regular Python programs. Soon, you’ll see these concepts extend to the PySpark API to process large amounts of data.


Sets are another common piece of functionality that exist in standard Python and is widely useful in Big Data processing. Sets are very similar to lists except they do not have any ordering and cannot contain duplicate values. You can think of a set as similar to the keys in a Python dict.

python big-data pyspark

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Silly mistakes that can cost ‘Big’ in Big Data Analytics

‘Data is the new science. Big Data holds the key answers’ - Pat Gelsinger The biggest advantage that the enhancement of modern technology has brought

Big Data can be The ‘Big’ boon for The Modern Age Businesses

We need no rocket science in understanding that every business, irrespective of their size in the modern-day business world, needs data insights for its expansion. Big data analytics is essential when it comes to understanding the needs and wants of a significant section of the audience.

Role of Big Data in Healthcare - DZone Big Data

In this article, see the role of big data in healthcare and look at the new healthcare dynamics. Big Data is creating a revolution in healthcare, providing better outcomes while eliminating fraud and abuse, which contributes to a large percentage of healthcare costs.

How you’re losing money by not opting for Big Data Services?

Big Data Analytics is the next big thing in business, and it is a reality that is slowly dawning amongst companies. With this article, we have tried to show you the importance of Big Data in business and urge you to take advantage of this immense...

Basic Data Types in Python | Python Web Development For Beginners

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.