Callum Slater

Callum Slater

1653465344

PySpark Cheat Sheet: Spark DataFrames in Python

This PySpark SQL cheat sheet is your handy companion to Apache Spark DataFrames in Python and includes code samples.

You'll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules for streaming, SQL, machine learning and graph processing. Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python. 

Now, it's time to tackle the Spark SQL module, which is meant for structured data processing, and the DataFrame API, which is not only available in Python, but also in Scala, Java, and R.

Without further ado, here's the cheat sheet:

PySpark SQL cheat sheet

This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You'll also see that this cheat sheet also on how to run SQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.

Spark SGlL is Apache Spark's module for working with structured data.

Initializing SparkSession 
 

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SGL over tables, cache tables, and read parquet files.

>>> from pyspark.sql import SparkSession
>>> spark a SparkSession \
     .builder\
     .appName("Python Spark SQL basic example") \
     .config("spark.some.config.option", "some-value") \
     .getOrCreate()

Creating DataFrames
 

Fromm RDDs

>>> from pyspark.sql.types import*

Infer Schema

>>> sc = spark.sparkContext
>>> lines = sc.textFile(''people.txt'')
>>> parts = lines.map(lambda l: l.split(","))
>>> people = parts.map(lambda p: Row(nameap[0],ageaint(p[l])))
>>> peopledf = spark.createDataFrame(people)

Specify Schema

>>> people = parts.map(lambda p: Row(name=p[0],
               age=int(p[1].strip())))
>>>  schemaString = "name age"
>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> schema = StructType(fields)
>>> spark.createDataFrame(people, schema).show()

 

From Spark Data Sources
JSON

>>>  df = spark.read.json("customer.json")
>>> df.show()

>>>  df2 = spark.read.load("people.json", format="json")

Parquet files

>>> df3 = spark.read.load("users.parquet")

TXT files

>>> df4 = spark.read.text("people.txt")

Filter 

#Filter entries of age, only keep those records of which the values are >24
>>> df.filter(df["age"]>24).show()

Duplicate Values 

>>> df = df.dropDuplicates()

Queries 
 

>>> from pyspark.sql import functions as F

Select

>>> df.select("firstName").show() #Show all entries in firstName column
>>> df.select("firstName","lastName") \
      .show()
>>> df.select("firstName", #Show all entries in firstName, age and type
              "age",
              explode("phoneNumber") \
              .alias("contactInfo")) \
      .select("contactInfo.type",
              "firstName",
              "age") \
      .show()
>>> df.select(df["firstName"],df["age"]+ 1) #Show all entries in firstName and age, .show() add 1 to the entries of age
>>> df.select(df['age'] > 24).show() #Show all entries where age >24

When

>>> df.select("firstName", #Show firstName and 0 or 1 depending on age >30
               F.when(df.age > 30, 1) \
              .otherwise(0)) \
      .show()
>>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options
.collect()

Like 

>>> df.select("firstName", #Show firstName, and lastName is TRUE if lastName is like Smith
              df.lastName.like("Smith")) \
     .show()

Startswith - Endswith 

>>> df.select("firstName", #Show firstName, and TRUE if lastName starts with Sm
              df.lastName \
                .startswith("Sm")) \
      .show()
>>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th
      .show()

Substring 

>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName
                          .alias("name")) \
        .collect()

Between 

>>> df.select(df.age.between(22, 24)) \ #Show age: values are TRUE if between 22 and 24
          .show()

Add, Update & Remove Columns 

Adding Columns

 >>> df = df.withColumn('city',df.address.city) \
            .withColumn('postalCode',df.address.postalCode) \
            .withColumn('state',df.address.state) \
            .withColumn('streetAddress',df.address.streetAddress) \
            .withColumn('telePhoneNumber', explode(df.phoneNumber.number)) \
            .withColumn('telePhoneType', explode(df.phoneNumber.type)) 

Updating Columns

>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')

Removing Columns

  >>> df = df.drop("address", "phoneNumber")
 >>> df = df.drop(df.address).drop(df.phoneNumber)
 

Missing & Replacing Values 
 

>>> df.na.fill(50).show() #Replace null values
 >>> df.na.drop().show() #Return new df omitting rows with null values
 >>> df.na \ #Return new df replacing one value with another
       .replace(10, 20) \
       .show()

GroupBy 

>>> df.groupBy("age")\ #Group by age, count the members in the groups
      .count() \
      .show()

Sort 
 

>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
>>> df.orderBy(["age","city"],ascending=[0,1])\
     .collect()

Repartitioning 

>>> df.repartition(10)\ #df with 10 partitions
      .rdd \
      .getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions() #df with 1 partition

Running Queries Programmatically 
 

Registering DataFrames as Views

>>> peopledf.createGlobalTempView("people")
>>> df.createTempView("customer")
>>> df.createOrReplaceTempView("customer")

Query Views

>>> df5 = spark.sql("SELECT * FROM customer").show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
               .show()

Inspect Data 
 

>>> df.dtypes #Return df column names and data types
>>> df.show() #Display the content of df
>>> df.head() #Return first n rows
>>> df.first() #Return first row
>>> df.take(2) #Return the first n rows >>> df.schema Return the schema of df
>>> df.describe().show() #Compute summary statistics >>> df.columns Return the columns of df
>>> df.count() #Count the number of rows in df
>>> df.distinct().count() #Count the number of distinct rows in df
>>> df.printSchema() #Print the schema of df
>>> df.explain() #Print the (logical and physical) plans

Output

Data Structures 
 

 >>> rdd1 = df.rdd #Convert df into an RDD
 >>> df.toJSON().first() #Convert df into a RDD of string
 >>> df.toPandas() #Return the contents of df as Pandas DataFrame

Write & Save to Files 

>>> df.select("firstName", "city")\
       .write \
       .save("nameAndCity.parquet")
 >>> df.select("firstName", "age") \
       .write \
       .save("namesAndAges.json",format="json")

Stopping SparkSession 

>>> spark.stop()

Have this Cheat Sheet at your fingertips

Original article source at https://www.datacamp.com

#pyspark #cheatsheet #spark #dataframes #python #bigdata

What is GEEK

Buddha Community

PySpark Cheat Sheet: Spark DataFrames in Python
Ray  Patel

Ray Patel

1619510796

Lambda, Map, Filter functions in python

Welcome to my Blog, In this article, we will learn python lambda function, Map function, and filter function.

Lambda function in python: Lambda is a one line anonymous function and lambda takes any number of arguments but can only have one expression and python lambda syntax is

Syntax: x = lambda arguments : expression

Now i will show you some python lambda function examples:

#python #anonymous function python #filter function in python #lambda #lambda python 3 #map python #python filter #python filter lambda #python lambda #python lambda examples #python map

Callum Slater

Callum Slater

1653465344

PySpark Cheat Sheet: Spark DataFrames in Python

This PySpark SQL cheat sheet is your handy companion to Apache Spark DataFrames in Python and includes code samples.

You'll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules for streaming, SQL, machine learning and graph processing. Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python. 

Now, it's time to tackle the Spark SQL module, which is meant for structured data processing, and the DataFrame API, which is not only available in Python, but also in Scala, Java, and R.

Without further ado, here's the cheat sheet:

PySpark SQL cheat sheet

This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You'll also see that this cheat sheet also on how to run SQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.

Spark SGlL is Apache Spark's module for working with structured data.

Initializing SparkSession 
 

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SGL over tables, cache tables, and read parquet files.

>>> from pyspark.sql import SparkSession
>>> spark a SparkSession \
     .builder\
     .appName("Python Spark SQL basic example") \
     .config("spark.some.config.option", "some-value") \
     .getOrCreate()

Creating DataFrames
 

Fromm RDDs

>>> from pyspark.sql.types import*

Infer Schema

>>> sc = spark.sparkContext
>>> lines = sc.textFile(''people.txt'')
>>> parts = lines.map(lambda l: l.split(","))
>>> people = parts.map(lambda p: Row(nameap[0],ageaint(p[l])))
>>> peopledf = spark.createDataFrame(people)

Specify Schema

>>> people = parts.map(lambda p: Row(name=p[0],
               age=int(p[1].strip())))
>>>  schemaString = "name age"
>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> schema = StructType(fields)
>>> spark.createDataFrame(people, schema).show()

 

From Spark Data Sources
JSON

>>>  df = spark.read.json("customer.json")
>>> df.show()

>>>  df2 = spark.read.load("people.json", format="json")

Parquet files

>>> df3 = spark.read.load("users.parquet")

TXT files

>>> df4 = spark.read.text("people.txt")

Filter 

#Filter entries of age, only keep those records of which the values are >24
>>> df.filter(df["age"]>24).show()

Duplicate Values 

>>> df = df.dropDuplicates()

Queries 
 

>>> from pyspark.sql import functions as F

Select

>>> df.select("firstName").show() #Show all entries in firstName column
>>> df.select("firstName","lastName") \
      .show()
>>> df.select("firstName", #Show all entries in firstName, age and type
              "age",
              explode("phoneNumber") \
              .alias("contactInfo")) \
      .select("contactInfo.type",
              "firstName",
              "age") \
      .show()
>>> df.select(df["firstName"],df["age"]+ 1) #Show all entries in firstName and age, .show() add 1 to the entries of age
>>> df.select(df['age'] > 24).show() #Show all entries where age >24

When

>>> df.select("firstName", #Show firstName and 0 or 1 depending on age >30
               F.when(df.age > 30, 1) \
              .otherwise(0)) \
      .show()
>>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options
.collect()

Like 

>>> df.select("firstName", #Show firstName, and lastName is TRUE if lastName is like Smith
              df.lastName.like("Smith")) \
     .show()

Startswith - Endswith 

>>> df.select("firstName", #Show firstName, and TRUE if lastName starts with Sm
              df.lastName \
                .startswith("Sm")) \
      .show()
>>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th
      .show()

Substring 

>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName
                          .alias("name")) \
        .collect()

Between 

>>> df.select(df.age.between(22, 24)) \ #Show age: values are TRUE if between 22 and 24
          .show()

Add, Update & Remove Columns 

Adding Columns

 >>> df = df.withColumn('city',df.address.city) \
            .withColumn('postalCode',df.address.postalCode) \
            .withColumn('state',df.address.state) \
            .withColumn('streetAddress',df.address.streetAddress) \
            .withColumn('telePhoneNumber', explode(df.phoneNumber.number)) \
            .withColumn('telePhoneType', explode(df.phoneNumber.type)) 

Updating Columns

>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')

Removing Columns

  >>> df = df.drop("address", "phoneNumber")
 >>> df = df.drop(df.address).drop(df.phoneNumber)
 

Missing & Replacing Values 
 

>>> df.na.fill(50).show() #Replace null values
 >>> df.na.drop().show() #Return new df omitting rows with null values
 >>> df.na \ #Return new df replacing one value with another
       .replace(10, 20) \
       .show()

GroupBy 

>>> df.groupBy("age")\ #Group by age, count the members in the groups
      .count() \
      .show()

Sort 
 

>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
>>> df.orderBy(["age","city"],ascending=[0,1])\
     .collect()

Repartitioning 

>>> df.repartition(10)\ #df with 10 partitions
      .rdd \
      .getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions() #df with 1 partition

Running Queries Programmatically 
 

Registering DataFrames as Views

>>> peopledf.createGlobalTempView("people")
>>> df.createTempView("customer")
>>> df.createOrReplaceTempView("customer")

Query Views

>>> df5 = spark.sql("SELECT * FROM customer").show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
               .show()

Inspect Data 
 

>>> df.dtypes #Return df column names and data types
>>> df.show() #Display the content of df
>>> df.head() #Return first n rows
>>> df.first() #Return first row
>>> df.take(2) #Return the first n rows >>> df.schema Return the schema of df
>>> df.describe().show() #Compute summary statistics >>> df.columns Return the columns of df
>>> df.count() #Count the number of rows in df
>>> df.distinct().count() #Count the number of distinct rows in df
>>> df.printSchema() #Print the schema of df
>>> df.explain() #Print the (logical and physical) plans

Output

Data Structures 
 

 >>> rdd1 = df.rdd #Convert df into an RDD
 >>> df.toJSON().first() #Convert df into a RDD of string
 >>> df.toPandas() #Return the contents of df as Pandas DataFrame

Write & Save to Files 

>>> df.select("firstName", "city")\
       .write \
       .save("nameAndCity.parquet")
 >>> df.select("firstName", "age") \
       .write \
       .save("namesAndAges.json",format="json")

Stopping SparkSession 

>>> spark.stop()

Have this Cheat Sheet at your fingertips

Original article source at https://www.datacamp.com

#pyspark #cheatsheet #spark #dataframes #python #bigdata

Shardul Bhatt

Shardul Bhatt

1626775355

Why use Python for Software Development

No programming language is pretty much as diverse as Python. It enables building cutting edge applications effortlessly. Developers are as yet investigating the full capability of end-to-end Python development services in various areas. 

By areas, we mean FinTech, HealthTech, InsureTech, Cybersecurity, and that's just the beginning. These are New Economy areas, and Python has the ability to serve every one of them. The vast majority of them require massive computational abilities. Python's code is dynamic and powerful - equipped for taking care of the heavy traffic and substantial algorithmic capacities. 

Programming advancement is multidimensional today. Endeavor programming requires an intelligent application with AI and ML capacities. Shopper based applications require information examination to convey a superior client experience. Netflix, Trello, and Amazon are genuine instances of such applications. Python assists with building them effortlessly. 

5 Reasons to Utilize Python for Programming Web Apps 

Python can do such numerous things that developers can't discover enough reasons to admire it. Python application development isn't restricted to web and enterprise applications. It is exceptionally adaptable and superb for a wide range of uses.

Robust frameworks 

Python is known for its tools and frameworks. There's a structure for everything. Django is helpful for building web applications, venture applications, logical applications, and mathematical processing. Flask is another web improvement framework with no conditions. 

Web2Py, CherryPy, and Falcon offer incredible capabilities to customize Python development services. A large portion of them are open-source frameworks that allow quick turn of events. 

Simple to read and compose 

Python has an improved sentence structure - one that is like the English language. New engineers for Python can undoubtedly understand where they stand in the development process. The simplicity of composing allows quick application building. 

The motivation behind building Python, as said by its maker Guido Van Rossum, was to empower even beginner engineers to comprehend the programming language. The simple coding likewise permits developers to roll out speedy improvements without getting confused by pointless subtleties. 

Utilized by the best 

Alright - Python isn't simply one more programming language. It should have something, which is the reason the business giants use it. Furthermore, that too for different purposes. Developers at Google use Python to assemble framework organization systems, parallel information pusher, code audit, testing and QA, and substantially more. Netflix utilizes Python web development services for its recommendation algorithm and media player. 

Massive community support 

Python has a steadily developing community that offers enormous help. From amateurs to specialists, there's everybody. There are a lot of instructional exercises, documentation, and guides accessible for Python web development solutions. 

Today, numerous universities start with Python, adding to the quantity of individuals in the community. Frequently, Python designers team up on various tasks and help each other with algorithmic, utilitarian, and application critical thinking. 

Progressive applications 

Python is the greatest supporter of data science, Machine Learning, and Artificial Intelligence at any enterprise software development company. Its utilization cases in cutting edge applications are the most compelling motivation for its prosperity. Python is the second most well known tool after R for data analytics.

The simplicity of getting sorted out, overseeing, and visualizing information through unique libraries makes it ideal for data based applications. TensorFlow for neural networks and OpenCV for computer vision are two of Python's most well known use cases for Machine learning applications.

Summary

Thinking about the advances in programming and innovation, Python is a YES for an assorted scope of utilizations. Game development, web application development services, GUI advancement, ML and AI improvement, Enterprise and customer applications - every one of them uses Python to its full potential. 

The disadvantages of Python web improvement arrangements are regularly disregarded by developers and organizations because of the advantages it gives. They focus on quality over speed and performance over blunders. That is the reason it's a good idea to utilize Python for building the applications of the future.

#python development services #python development company #python app development #python development #python in web development #python software development

Dylan  Iqbal

Dylan Iqbal

1561091460

PySpark Cheat Sheet: Spark DataFrames in Python

This PySpark SQL cheat sheet is your handy companion to Apache Spark DataFrames in Python and includes code samples.

You’ll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules for streaming, SQL, machine learning and graph processing. Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python.

Now, it’s time to tackle the Spark SQL module, which is meant for structured data processing, and the DataFrame API, which is not only available in Python, but also in Scala, Java, and R. If you want to know more about the differences between RDDs, DataFrames, and DataSets,

Without further ado, here’s the cheat sheet:

PySpark SQL cheat sheet

This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You’ll also see that this cheat sheet also on how to runSQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.

#python #spark #pyspark #data-science

Kasey  Turcotte

Kasey Turcotte

1623927960

Pandas DataFrame vs. Spark DataFrame: When Parallel Computing Matters

With Performance Comparison Analysis and Guided Example of Animated 3D Wireframe Plot

Python is famous for its vast selection of libraries and resources from the open-source community. As a Data Analyst/Engineer/Scientist, one might be familiar with popular packages such as NumpyPandasScikit-learnKeras, and TensorFlow. Together these modules help us extract value out of data and propels the field of analytics. As data continue to become larger and more complex, one other element to consider is a framework dedicated to processing Big Data, such as Apache Spark. In this article, I will demonstrate the capabilities of distributed/cluster computing and present a comparison between the Pandas DataFrame and Spark DataFrame. My hope is to provide more conviction on choosing the right implementation.

Pandas DataFrame

Pandas has become very popular for its ease of use. It utilizes DataFrames to present data in tabular format like a spreadsheet with rows and columns. Importantly, it has very intuitive methods to perform common analytical tasks and a relatively flat learning curve. It loads all of the data into memory on a single machine (one node) for rapid execution. While the Pandas DataFrame has proven to be tremendously powerful in manipulating data, it does have its limits. With data growing at an exponentially rate, complex data processing becomes expensive to handle and causes performance degradation. These operations require parallelization and distributed computing, which the Pandas DataFrame does not support.

Introducing Cluster/Distribution Computing and Spark DataFrame

Apache Spark is an open-source cluster computing framework. With cluster computing, data processing is distributed and performed in parallel by multiple nodes. This is recognized as the MapReduce framework because the division of labor can usually be characterized by sets of the mapshuffle, and reduce operations found in functional programming. Spark’s implementation of cluster computing is unique because processes 1) are executed in-memory and 2) build up a query plan which does not execute until necessary (known as lazy execution). Although Spark’s cluster computing framework has a broad range of utility, we only look at the Spark DataFrame for the purpose of this article. Similar to those found in Pandas, the Spark DataFrame has intuitive APIs, making it easy to implement.

#pandas dataframe vs. spark dataframe: when parallel computing matters #pandas #pandas dataframe #pandas dataframe vs. spark dataframe #spark #when parallel computing matters