Callum Slater

Callum Slater

1653465344

PySpark Cheat Sheet: Spark DataFrames in Python

This PySpark SQL cheat sheet is your handy companion to Apache Spark DataFrames in Python and includes code samples.

You'll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules for streaming, SQL, machine learning and graph processing. Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python. 

Now, it's time to tackle the Spark SQL module, which is meant for structured data processing, and the DataFrame API, which is not only available in Python, but also in Scala, Java, and R.

Without further ado, here's the cheat sheet:

PySpark SQL cheat sheet

This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You'll also see that this cheat sheet also on how to run SQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.

Spark SGlL is Apache Spark's module for working with structured data.

Initializing SparkSession 
 

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SGL over tables, cache tables, and read parquet files.

>>> from pyspark.sql import SparkSession
>>> spark a SparkSession \
     .builder\
     .appName("Python Spark SQL basic example") \
     .config("spark.some.config.option", "some-value") \
     .getOrCreate()

Creating DataFrames
 

Fromm RDDs

>>> from pyspark.sql.types import*

Infer Schema

>>> sc = spark.sparkContext
>>> lines = sc.textFile(''people.txt'')
>>> parts = lines.map(lambda l: l.split(","))
>>> people = parts.map(lambda p: Row(nameap[0],ageaint(p[l])))
>>> peopledf = spark.createDataFrame(people)

Specify Schema

>>> people = parts.map(lambda p: Row(name=p[0],
               age=int(p[1].strip())))
>>>  schemaString = "name age"
>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> schema = StructType(fields)
>>> spark.createDataFrame(people, schema).show()

 

From Spark Data Sources
JSON

>>>  df = spark.read.json("customer.json")
>>> df.show()

>>>  df2 = spark.read.load("people.json", format="json")

Parquet files

>>> df3 = spark.read.load("users.parquet")

TXT files

>>> df4 = spark.read.text("people.txt")

Filter 

#Filter entries of age, only keep those records of which the values are >24
>>> df.filter(df["age"]>24).show()

Duplicate Values 

>>> df = df.dropDuplicates()

Queries 
 

>>> from pyspark.sql import functions as F

Select

>>> df.select("firstName").show() #Show all entries in firstName column
>>> df.select("firstName","lastName") \
      .show()
>>> df.select("firstName", #Show all entries in firstName, age and type
              "age",
              explode("phoneNumber") \
              .alias("contactInfo")) \
      .select("contactInfo.type",
              "firstName",
              "age") \
      .show()
>>> df.select(df["firstName"],df["age"]+ 1) #Show all entries in firstName and age, .show() add 1 to the entries of age
>>> df.select(df['age'] > 24).show() #Show all entries where age >24

When

>>> df.select("firstName", #Show firstName and 0 or 1 depending on age >30
               F.when(df.age > 30, 1) \
              .otherwise(0)) \
      .show()
>>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options
.collect()

Like 

>>> df.select("firstName", #Show firstName, and lastName is TRUE if lastName is like Smith
              df.lastName.like("Smith")) \
     .show()

Startswith - Endswith 

>>> df.select("firstName", #Show firstName, and TRUE if lastName starts with Sm
              df.lastName \
                .startswith("Sm")) \
      .show()
>>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th
      .show()

Substring 

>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName
                          .alias("name")) \
        .collect()

Between 

>>> df.select(df.age.between(22, 24)) \ #Show age: values are TRUE if between 22 and 24
          .show()

Add, Update & Remove Columns 

Adding Columns

 >>> df = df.withColumn('city',df.address.city) \
            .withColumn('postalCode',df.address.postalCode) \
            .withColumn('state',df.address.state) \
            .withColumn('streetAddress',df.address.streetAddress) \
            .withColumn('telePhoneNumber', explode(df.phoneNumber.number)) \
            .withColumn('telePhoneType', explode(df.phoneNumber.type)) 

Updating Columns

>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')

Removing Columns

  >>> df = df.drop("address", "phoneNumber")
 >>> df = df.drop(df.address).drop(df.phoneNumber)
 

Missing & Replacing Values 
 

>>> df.na.fill(50).show() #Replace null values
 >>> df.na.drop().show() #Return new df omitting rows with null values
 >>> df.na \ #Return new df replacing one value with another
       .replace(10, 20) \
       .show()

GroupBy 

>>> df.groupBy("age")\ #Group by age, count the members in the groups
      .count() \
      .show()

Sort 
 

>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
>>> df.orderBy(["age","city"],ascending=[0,1])\
     .collect()

Repartitioning 

>>> df.repartition(10)\ #df with 10 partitions
      .rdd \
      .getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions() #df with 1 partition

Running Queries Programmatically 
 

Registering DataFrames as Views

>>> peopledf.createGlobalTempView("people")
>>> df.createTempView("customer")
>>> df.createOrReplaceTempView("customer")

Query Views

>>> df5 = spark.sql("SELECT * FROM customer").show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
               .show()

Inspect Data 
 

>>> df.dtypes #Return df column names and data types
>>> df.show() #Display the content of df
>>> df.head() #Return first n rows
>>> df.first() #Return first row
>>> df.take(2) #Return the first n rows >>> df.schema Return the schema of df
>>> df.describe().show() #Compute summary statistics >>> df.columns Return the columns of df
>>> df.count() #Count the number of rows in df
>>> df.distinct().count() #Count the number of distinct rows in df
>>> df.printSchema() #Print the schema of df
>>> df.explain() #Print the (logical and physical) plans

Output

Data Structures 
 

 >>> rdd1 = df.rdd #Convert df into an RDD
 >>> df.toJSON().first() #Convert df into a RDD of string
 >>> df.toPandas() #Return the contents of df as Pandas DataFrame

Write & Save to Files 

>>> df.select("firstName", "city")\
       .write \
       .save("nameAndCity.parquet")
 >>> df.select("firstName", "age") \
       .write \
       .save("namesAndAges.json",format="json")

Stopping SparkSession 

>>> spark.stop()

Have this Cheat Sheet at your fingertips

Original article source at https://www.datacamp.com

#pyspark #cheatsheet #spark #dataframes #python #bigdata

PySpark Cheat Sheet: Spark DataFrames in Python
Willis  Mills

Willis Mills

1653336480

How to Create Scatter Plot Correlation Matrix Visualization using Python Pandas DataFrame

Python pandas tutorial for beginner on how to create scatter plot correlation matrix visualization to understand the correlation among various columns or variables of python pandas dataframe.

#dataframes #pandas #python 

How to Create Scatter Plot Correlation Matrix Visualization using Python Pandas DataFrame
Willis  Mills

Willis Mills

1653329220

How to Adjust WeekDays or Holidays in Date Range With Python Pandas

Python tutorial for beginners on how to adjust weekday or holidays in your date series to calculate weekdays only.

#dataframes #pandas #python 

How to Adjust WeekDays or Holidays in Date Range With Python Pandas
Willis  Mills

Willis Mills

1653321960

How to format Dates in Python Pandas using To_datetime Function

Python tutorial for beginners on how to format dates or create new dates variable using pandas to_datetime function which you can apply directly to dataframe column and process the date and time data.

I have covered multiple examples of processing invalid dates here which you'll find useful in your day to day work.

#dataframes #pandas #python 

How to format Dates in Python Pandas using To_datetime Function
Willis  Mills

Willis Mills

1653314700

How to Create Pandas DataFrame From A Dictionary

In this video tutorial, We'll share How to Create Pandas DataFrame from a Dictionary and Converting Python Dictionary to Pandas DataFrame.

#dataframes #pandas #python 

How to Create Pandas DataFrame From A Dictionary
Willis  Mills

Willis Mills

1653307440

How to Arrange Pandas DataFrame Columns in Jupyter Notebook

Python Pandas Tutorial for Beginners on how to reshape or transform the columns of data frame from wide to long format or long to wide format to support the data analysis needs.

This is similar to how we create various different type of pivot tables to arrange the data in the format we require.

#dataframes #pandas #python #Jupyter

How to Arrange Pandas DataFrame Columns in Jupyter Notebook
Willis  Mills

Willis Mills

1653300180

How to Filter Pandas Data Frame for Specific Value in Python

Python pandas tutorial for beginners on how filter pandas dataframe for a specific value in a column or multiple values in a column.

#dataframes #pandas #python 

How to Filter Pandas Data Frame for Specific Value in Python
Willis  Mills

Willis Mills

1653292920

How to Import JSON Data in Python (Step By Step)

In this video tutorial, We'll show you How to Import JSON Data in Python With Step by Step for beginners. Let's explore it with us now!

#dataframes #pandas #python #json 

How to Import JSON Data in Python (Step By Step)
Willis  Mills

Willis Mills

1653285660

How to Iterate Or Loop Over All The Pandas Dataframe Columns Names

Python pandas tutorial for beginners on how to loop over all the pandas dataframe column name and changing their name to lowercase or uppercase or replacing space with underscore etc. to make them consistent for data analysis.

#dataframes #pandas #python 

How to Iterate Or Loop Over All The Pandas Dataframe Columns Names
Willis  Mills

Willis Mills

1653278400

How to Iterate Or Loop Over Columns Of Python Pandas Data Frame

Python pandas tutorial on how to loop over columns or iterate over the columns using itermitems to get a value of single column or a specific row and apply any mathematical operation on that.

#dataframes #pandas #python 

How to Iterate Or Loop Over Columns Of Python Pandas Data Frame
Willis  Mills

Willis Mills

1653271140

How to Create & Process Hierarchical index on Pandas Series

Python pandas series tutorial on creating multiple index or hierarchical index and processing it by applying series functions.

#dataframes #pandas #python 

How to Create & Process Hierarchical index on Pandas Series
Willis  Mills

Willis Mills

1653268320

How to Rank a DataFrame in Python Pandas DataFrame Values

Python pandas tutorial for beginners on how to rank dataframe values in python using the rank function and re order the dataframe.

#dataframes #pandas #python 

How to Rank a DataFrame in Python Pandas DataFrame Values
Willis  Mills

Willis Mills

1653253740

How to Identify & Drop Duplicate Values from Python Pandas DataFrame

Python tutorial for beginners on how to remove duplicate values from python pandas dataframe.


I have first shown the duplicated function of pandas which returns boolean value whether a particular row is duplicate or not.


After that I have sum method that you can chain with duplicated method to return the count' of rows that are duplicated in a dataset


After that I have shown how to find whether a particular column is having duplicate values or not as well as getting the count of duplicated values of a column.


By default duplicated pandas function keep the first row or value while identifying the duplicate value however I have shown you how can you keep the last value as an out of unique value dataset.


Finally I have shown the function to drop duplicate values out from the dataset and its various parameters like subset or keep etc. to further define the output of the unique values.

#python #pandas #dataframes 

How to Identify & Drop Duplicate Values from Python Pandas DataFrame
Willis  Mills

Willis Mills

1653023400

How to Apply Custom Function on Python Pandas DataFrame Columns

Python Pandas Tutorial on How to apply custom aggregation function on all the columns of Python Pandas Data frame. Also this is know as user defined function that you can apply to pandas columns as well as rows.

#python #dataframes #pandas 

How to Apply Custom Function on Python Pandas DataFrame Columns
Willis  Mills

Willis Mills

1652994360

Using Vlookup Or Mapping in Python Pandas Pandas DataFrame

In this video tutorial, We'll show you How to use vlookup or mapping function for mapping of dictionary values with pandas dataframe.

#dataframes #python #pandas 

Using Vlookup Or Mapping in Python Pandas Pandas DataFrame