This PySpark SQL cheat sheet is your handy companion to Apache Spark DataFrames in Python and includes code samples.
You'll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules for streaming, SQL, machine learning and graph processing. Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python.
Now, it's time to tackle the Spark SQL module, which is meant for structured data processing, and the DataFrame API, which is not only available in Python, but also in Scala, Java, and R.
Without further ado, here's the cheat sheet:
This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You'll also see that this cheat sheet also on how to run SQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.
Spark SGlL is Apache Spark's module for working with structured data.
A SparkSession can be used create DataFrame, register DataFrame as tables, execute SGL over tables, cache tables, and read parquet files.
>>> from pyspark.sql import SparkSession >>> spark a SparkSession \ .builder\ .appName("Python Spark SQL basic example") \ .config("spark.some.config.option", "some-value") \ .getOrCreate()
>>> from pyspark.sql.types import*
>>> sc = spark.sparkContext >>> lines = sc.textFile(''people.txt'') >>> parts = lines.map(lambda l: l.split(",")) >>> people = parts.map(lambda p: Row(nameap,ageaint(p[l]))) >>> peopledf = spark.createDataFrame(people)
>>> people = parts.map(lambda p: Row(name=p, age=int(p.strip()))) >>> schemaString = "name age" >>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()] >>> schema = StructType(fields) >>> spark.createDataFrame(people, schema).show()
From Spark Data Sources
>>> df = spark.read.json("customer.json") >>> df.show()
>>> df2 = spark.read.load("people.json", format="json")
>>> df3 = spark.read.load("users.parquet")
>>> df4 = spark.read.text("people.txt")
#Filter entries of age, only keep those records of which the values are >24 >>> df.filter(df["age"]>24).show()
>>> df = df.dropDuplicates()
>>> from pyspark.sql import functions as F
>>> df.select("firstName").show() #Show all entries in firstName column >>> df.select("firstName","lastName") \ .show() >>> df.select("firstName", #Show all entries in firstName, age and type "age", explode("phoneNumber") \ .alias("contactInfo")) \ .select("contactInfo.type", "firstName", "age") \ .show() >>> df.select(df["firstName"],df["age"]+ 1) #Show all entries in firstName and age, .show() add 1 to the entries of age >>> df.select(df['age'] > 24).show() #Show all entries where age >24
>>> df.select("firstName", #Show firstName and 0 or 1 depending on age >30 F.when(df.age > 30, 1) \ .otherwise(0)) \ .show() >>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options .collect()
>>> df.select("firstName", #Show firstName, and lastName is TRUE if lastName is like Smith df.lastName.like("Smith")) \ .show()
Startswith - Endswith
>>> df.select("firstName", #Show firstName, and TRUE if lastName starts with Sm df.lastName \ .startswith("Sm")) \ .show() >>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th .show()
>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName .alias("name")) \ .collect()
>>> df.select(df.age.between(22, 24)) \ #Show age: values are TRUE if between 22 and 24 .show()
>>> df = df.withColumn('city',df.address.city) \ .withColumn('postalCode',df.address.postalCode) \ .withColumn('state',df.address.state) \ .withColumn('streetAddress',df.address.streetAddress) \ .withColumn('telePhoneNumber', explode(df.phoneNumber.number)) \ .withColumn('telePhoneType', explode(df.phoneNumber.type))
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')
>>> df = df.drop("address", "phoneNumber") >>> df = df.drop(df.address).drop(df.phoneNumber)
>>> df.na.fill(50).show() #Replace null values >>> df.na.drop().show() #Return new df omitting rows with null values >>> df.na \ #Return new df replacing one value with another .replace(10, 20) \ .show()
>>> df.groupBy("age")\ #Group by age, count the members in the groups .count() \ .show()
>>> peopledf.sort(peopledf.age.desc()).collect() >>> df.sort("age", ascending=False).collect() >>> df.orderBy(["age","city"],ascending=[0,1])\ .collect()
>>> df.repartition(10)\ #df with 10 partitions .rdd \ .getNumPartitions() >>> df.coalesce(1).rdd.getNumPartitions() #df with 1 partition
Registering DataFrames as Views
>>> peopledf.createGlobalTempView("people") >>> df.createTempView("customer") >>> df.createOrReplaceTempView("customer")
>>> df5 = spark.sql("SELECT * FROM customer").show() >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\ .show()
>>> df.dtypes #Return df column names and data types >>> df.show() #Display the content of df >>> df.head() #Return first n rows >>> df.first() #Return first row >>> df.take(2) #Return the first n rows >>> df.schema Return the schema of df >>> df.describe().show() #Compute summary statistics >>> df.columns Return the columns of df >>> df.count() #Count the number of rows in df >>> df.distinct().count() #Count the number of distinct rows in df >>> df.printSchema() #Print the schema of df >>> df.explain() #Print the (logical and physical) plans
>>> rdd1 = df.rdd #Convert df into an RDD >>> df.toJSON().first() #Convert df into a RDD of string >>> df.toPandas() #Return the contents of df as Pandas DataFrame
Write & Save to Files
>>> df.select("firstName", "city")\ .write \ .save("nameAndCity.parquet") >>> df.select("firstName", "age") \ .write \ .save("namesAndAges.json",format="json")
Original article source at https://www.datacamp.com
#pyspark #cheatsheet #spark #dataframes #python #bigdata
Python pandas tutorial for beginner on how to create scatter plot correlation matrix visualization to understand the correlation among various columns or variables of python pandas dataframe.
Python tutorial for beginners on how to adjust weekday or holidays in your date series to calculate weekdays only.
Python tutorial for beginners on how to format dates or create new dates variable using pandas to_datetime function which you can apply directly to dataframe column and process the date and time data.
I have covered multiple examples of processing invalid dates here which you'll find useful in your day to day work.
In this video tutorial, We'll share How to Create Pandas DataFrame from a Dictionary and Converting Python Dictionary to Pandas DataFrame.
Python Pandas Tutorial for Beginners on how to reshape or transform the columns of data frame from wide to long format or long to wide format to support the data analysis needs.
This is similar to how we create various different type of pivot tables to arrange the data in the format we require.
Python pandas tutorial for beginners on how filter pandas dataframe for a specific value in a column or multiple values in a column.
In this video tutorial, We'll show you How to Import JSON Data in Python With Step by Step for beginners. Let's explore it with us now!
Python pandas tutorial for beginners on how to loop over all the pandas dataframe column name and changing their name to lowercase or uppercase or replacing space with underscore etc. to make them consistent for data analysis.
Python pandas tutorial on how to loop over columns or iterate over the columns using itermitems to get a value of single column or a specific row and apply any mathematical operation on that.
Python pandas series tutorial on creating multiple index or hierarchical index and processing it by applying series functions.
Python pandas tutorial for beginners on how to rank dataframe values in python using the rank function and re order the dataframe.
Python tutorial for beginners on how to remove duplicate values from python pandas dataframe.
I have first shown the duplicated function of pandas which returns boolean value whether a particular row is duplicate or not.
After that I have sum method that you can chain with duplicated method to return the count' of rows that are duplicated in a dataset
After that I have shown how to find whether a particular column is having duplicate values or not as well as getting the count of duplicated values of a column.
By default duplicated pandas function keep the first row or value while identifying the duplicate value however I have shown you how can you keep the last value as an out of unique value dataset.
Finally I have shown the function to drop duplicate values out from the dataset and its various parameters like subset or keep etc. to further define the output of the unique values.
Python Pandas Tutorial on How to apply custom aggregation function on all the columns of Python Pandas Data frame. Also this is know as user defined function that you can apply to pandas columns as well as rows.
In this video tutorial, We'll show you How to use vlookup or mapping function for mapping of dictionary values with pandas dataframe.