1660163400
我们一直在推进 PySpark 系列,到目前为止,我们涵盖了数据预处理技术和各种 ML 算法以及现实世界的咨询项目。在本文中,我们还将研究另一个咨询项目。让我们假设一个场景,假设一家狗粮公司雇佣了我们,我们的任务是预测为什么他们制造的食物会比保质期迅速变质。我们将使用 PySpark 的 MLIB 解决这个特定的问题陈述。
从介绍部分,我们很清楚需要做什么,但在本节中,我们将深入了解这个项目的“如何”和“为什么”部分。
该数据集包含 4 个标记为 B、C 和 D 的特征列,另一个目标列标记为“Spoiled”。所以数据集中共有 5列。让我们看一下每一列的简短描述。
在这里您可以找到数据集的来源。
注意:在这个特定的项目中,我们不会遵循机器学习管道的通用方法(train-test-split);相反,我们将使用另一种方法,您将在继续本文的过程中找到它,这将帮助您为此类问题绘制另一个模板。
安装PySpark:要对变质化学品进行预测分析,我们只需要安装一个库,它是这个项目的核心和灵魂,即PySpark,它最终将为MLIB库建立一个环境并建立连接与 Apache Spark。
在本文的这一部分中,我们将启动 Spark 会话,因为这是我们通过 PySpark 创建和新会话来设置 apache Spark 环境的强制性过程之一。
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('dog_food_project').getOrCreate()
spark
输出:
推论:首先,Spark Session 库是从 pyspark.sql 库中导入的。
然后是构建会话的构建器函数的角色(也提供命名功能 - dog_food_project),我们使用 getOrCreate() 函数创建了 SparkSession。
在最后调用 spark 对象时,我们可以看到 SparkMemory 的 UI 总结了整个过程。
这是另一个必须遵循的步骤,因为如果没有相关数据集,任何数据科学项目都无法进行,这就像“试图在不考虑砖头的情况下建造房屋”。因此可以参考下面的代码来读取数据集,它是 CSV 格式的。
data_food = spark.read.csv('dog_food.csv',inferSchema=True,header=True)
data_food.show()
输出:
推论:从上面的输出中,我们已经确认了我们在“关于”部分中关于数据集的陈述,即它有 4 种成分/化学品(A、B、C、D)和一个目标变量,即 Spoiled。
为此,使用了 read.csv 函数,将 inferSchema 和 header 参数保持为 True,以便它可以返回相关类型的数据。
data_food.printSchema()
输出:
root
|-- A: integer (nullable = true)
|-- B: integer (nullable = true)
|-- C: double (nullable = true)
|-- D: integer (nullable = true)
|-- Spoiled: double (nullable = true)
推理:就在这一步之前,在读取数据集时,我们将 inferSchema 参数值设置为 True,以便在执行 printSchema 函数时,我们可以获得每个特征的正确数据类型。因此,所有 4 个特征都具有整数类型,而 Spoiled(目标)包含双精度类型的数据。
data_food.head(10)
输出:
[Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0),
Row(A=5, B=6, C=12.0, D=7, Spoiled=1.0),
Row(A=6, B=2, C=13.0, D=6, Spoiled=1.0),
Row(A=4, B=2, C=12.0, D=1, Spoiled=1.0),
Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0),
Row(A=10, B=3, C=13.0, D=9, Spoiled=1.0),
Row(A=8, B=5, C=14.0, D=5, Spoiled=1.0),
Row(A=5, B=8, C=12.0, D=8, Spoiled=1.0),
Row(A=6, B=5, C=12.0, D=9, Spoiled=1.0),
Row(A=3, B=3, C=12.0, D=1, Spoiled=1.0)]
推论:我们可以通过另一种方法查看数据集,即传统的 head 函数,它不仅会返回所有列的名称,还会返回与其关联的值(按行)和元组是 Row 对象的格式。
data_food.describe().show()
输出:
推理:如果我们想访问有关数据集的统计信息怎么办?为此,PySpark 具有用于针对所选数据集的 describe() 函数。可以看到输出,它返回了每个特征的计数、平均值、标准偏差、最小值和最大值,以及独立的列。
使用 MLIB 库时,我们需要确保所有功能都堆叠在一个单独的列中,将目标列保留在另一列中。因此,为了实现这一点,PySpark 附带了一个 VectorAssembler 库,它可以为我们整理东西,而无需手动处理太多。
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
在上面的单元格中,我们从 ml 中导入了 Vectors 和 VectorAssembler 模块。林和毫升。特征库同时进行。展望未来,我们将看到相同的实施。
assembler_data = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],outputCol="features")
output = assembler_data.transform(data_food)
output.show()
输出:
代码分解:
到了模型构建阶段,具体来说,我们将使用Tree方法来实现本文的座右铭。请注意,此模型构建阶段与传统方式不同,因为我们不需要训练测试拆分,我们只想抓住哪个特征更重要。
from pyspark.ml.classification import RandomForestClassifier,DecisionTreeClassifier
rfc = DecisionTreeClassifier(labelCol='Spoiled',featuresCol='features')
推理:在使用树分类器之前,我们需要从分类模块中导入随机森林分类器和决策树分类器。
然后,初始化决策树对象并传入标签列(目标)和特征列(特征)。
final_data = output.select('features','Spoiled')
final_data.show()
输出:
执行了上面只访问特征和目标列的过程,这样我们就可以得到训练阶段需要传递的最终数据。在输出中,同样可以确认。
rfc_model = rfc.fit(final_data)
输出:
推论:最后是训练阶段,为此,使用该方法。另外,请注意,这里我们传递的是我们在上面抓取的最终数据。
rfc_model.featureImportances
输出:
SparseVector(4, {1: 0.0019, 2: 0.9832, 3: 0.0149})
推论:仔细查看有 3 个索引且第二个索引具有最高值 (0.9832) 的输出,即
化学物质 C 是刺激化学物质 C 的最重要特征,是狗粮过早变质的主要原因。
我们现在处于最后阶段,文章的最后一部分将让您总结我们迄今为止所做的一切,以实现狗粮公司假设聘请我们预测导致狗粮过早变质的化学物质的结果。
这是本文的回购链接。我希望你喜欢我关于使用 MLIB 预测狗粮质量的文章。如果您有任何意见或问题,请在下方评论。
资料来源:https ://www.analyticsvidhya.com/blog/2022/07/introduction-to-classification-problem-using-pyspark/
1660163400
我们一直在推进 PySpark 系列,到目前为止,我们涵盖了数据预处理技术和各种 ML 算法以及现实世界的咨询项目。在本文中,我们还将研究另一个咨询项目。让我们假设一个场景,假设一家狗粮公司雇佣了我们,我们的任务是预测为什么他们制造的食物会比保质期迅速变质。我们将使用 PySpark 的 MLIB 解决这个特定的问题陈述。
从介绍部分,我们很清楚需要做什么,但在本节中,我们将深入了解这个项目的“如何”和“为什么”部分。
该数据集包含 4 个标记为 B、C 和 D 的特征列,另一个目标列标记为“Spoiled”。所以数据集中共有 5列。让我们看一下每一列的简短描述。
在这里您可以找到数据集的来源。
注意:在这个特定的项目中,我们不会遵循机器学习管道的通用方法(train-test-split);相反,我们将使用另一种方法,您将在继续本文的过程中找到它,这将帮助您为此类问题绘制另一个模板。
安装PySpark:要对变质化学品进行预测分析,我们只需要安装一个库,它是这个项目的核心和灵魂,即PySpark,它最终将为MLIB库建立一个环境并建立连接与 Apache Spark。
在本文的这一部分中,我们将启动 Spark 会话,因为这是我们通过 PySpark 创建和新会话来设置 apache Spark 环境的强制性过程之一。
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('dog_food_project').getOrCreate()
spark
输出:
推论:首先,Spark Session 库是从 pyspark.sql 库中导入的。
然后是构建会话的构建器函数的角色(也提供命名功能 - dog_food_project),我们使用 getOrCreate() 函数创建了 SparkSession。
在最后调用 spark 对象时,我们可以看到 SparkMemory 的 UI 总结了整个过程。
这是另一个必须遵循的步骤,因为如果没有相关数据集,任何数据科学项目都无法进行,这就像“试图在不考虑砖头的情况下建造房屋”。因此可以参考下面的代码来读取数据集,它是 CSV 格式的。
data_food = spark.read.csv('dog_food.csv',inferSchema=True,header=True)
data_food.show()
输出:
推论:从上面的输出中,我们已经确认了我们在“关于”部分中关于数据集的陈述,即它有 4 种成分/化学品(A、B、C、D)和一个目标变量,即 Spoiled。
为此,使用了 read.csv 函数,将 inferSchema 和 header 参数保持为 True,以便它可以返回相关类型的数据。
data_food.printSchema()
输出:
root
|-- A: integer (nullable = true)
|-- B: integer (nullable = true)
|-- C: double (nullable = true)
|-- D: integer (nullable = true)
|-- Spoiled: double (nullable = true)
推理:就在这一步之前,在读取数据集时,我们将 inferSchema 参数值设置为 True,以便在执行 printSchema 函数时,我们可以获得每个特征的正确数据类型。因此,所有 4 个特征都具有整数类型,而 Spoiled(目标)包含双精度类型的数据。
data_food.head(10)
输出:
[Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0),
Row(A=5, B=6, C=12.0, D=7, Spoiled=1.0),
Row(A=6, B=2, C=13.0, D=6, Spoiled=1.0),
Row(A=4, B=2, C=12.0, D=1, Spoiled=1.0),
Row(A=4, B=2, C=12.0, D=3, Spoiled=1.0),
Row(A=10, B=3, C=13.0, D=9, Spoiled=1.0),
Row(A=8, B=5, C=14.0, D=5, Spoiled=1.0),
Row(A=5, B=8, C=12.0, D=8, Spoiled=1.0),
Row(A=6, B=5, C=12.0, D=9, Spoiled=1.0),
Row(A=3, B=3, C=12.0, D=1, Spoiled=1.0)]
推论:我们可以通过另一种方法查看数据集,即传统的 head 函数,它不仅会返回所有列的名称,还会返回与其关联的值(按行)和元组是 Row 对象的格式。
data_food.describe().show()
输出:
推理:如果我们想访问有关数据集的统计信息怎么办?为此,PySpark 具有用于针对所选数据集的 describe() 函数。可以看到输出,它返回了每个特征的计数、平均值、标准偏差、最小值和最大值,以及独立的列。
使用 MLIB 库时,我们需要确保所有功能都堆叠在一个单独的列中,将目标列保留在另一列中。因此,为了实现这一点,PySpark 附带了一个 VectorAssembler 库,它可以为我们整理东西,而无需手动处理太多。
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
在上面的单元格中,我们从 ml 中导入了 Vectors 和 VectorAssembler 模块。林和毫升。特征库同时进行。展望未来,我们将看到相同的实施。
assembler_data = VectorAssembler(inputCols=['A', 'B', 'C', 'D'],outputCol="features")
output = assembler_data.transform(data_food)
output.show()
输出:
代码分解:
到了模型构建阶段,具体来说,我们将使用Tree方法来实现本文的座右铭。请注意,此模型构建阶段与传统方式不同,因为我们不需要训练测试拆分,我们只想抓住哪个特征更重要。
from pyspark.ml.classification import RandomForestClassifier,DecisionTreeClassifier
rfc = DecisionTreeClassifier(labelCol='Spoiled',featuresCol='features')
推理:在使用树分类器之前,我们需要从分类模块中导入随机森林分类器和决策树分类器。
然后,初始化决策树对象并传入标签列(目标)和特征列(特征)。
final_data = output.select('features','Spoiled')
final_data.show()
输出:
执行了上面只访问特征和目标列的过程,这样我们就可以得到训练阶段需要传递的最终数据。在输出中,同样可以确认。
rfc_model = rfc.fit(final_data)
输出:
推论:最后是训练阶段,为此,使用该方法。另外,请注意,这里我们传递的是我们在上面抓取的最终数据。
rfc_model.featureImportances
输出:
SparseVector(4, {1: 0.0019, 2: 0.9832, 3: 0.0149})
推论:仔细查看有 3 个索引且第二个索引具有最高值 (0.9832) 的输出,即
化学物质 C 是刺激化学物质 C 的最重要特征,是狗粮过早变质的主要原因。
我们现在处于最后阶段,文章的最后一部分将让您总结我们迄今为止所做的一切,以实现狗粮公司假设聘请我们预测导致狗粮过早变质的化学物质的结果。
这是本文的回购链接。我希望你喜欢我关于使用 MLIB 预测狗粮质量的文章。如果您有任何意见或问题,请在下方评论。
资料来源:https ://www.analyticsvidhya.com/blog/2022/07/introduction-to-classification-problem-using-pyspark/
1601766000
Data is now growing faster than processing speeds. One of the many solutions to this problem is to parallelise our computing on large clusters. A language that allows us to do just that is PySpark.
However, PySpark requires you to think about data differently.
Instead of looking at a dataset row-wise. PySpark encourages you to look at it column-wise. This was a difficult transition for me at first. I’ll tell you the main tricks I learned so you don’t have to waste your time searching for the answers.
#data-transformation #introduction-to-pyspark #pyspark #data-science
1595680020
In the previous blog, we started with the introduction to Apache Spark, why it is preferred, its features and advantages, architecture and working along with its industrial use cases. In this article we’ll get started with PySpark — Apache Spark using Python! By the end of this article, you’ll have a better understanding of what PySpark is, why we choose python for Spark, its features and advantages followed by a quick installation guide to set up PySpark in your own computer. Finally, this article will throw some light on some of the important concepts in Spark in order to proceed further.
Source: Databricks
As we have already discussed, Apache Spark also supports Python along with other languages, to make it easier for developers who are more comfortable working with Python for Apache Spark. Python being a relatively easier programming language to learn and use as compared to Spark’s native language Scala, it is preferred by many to develop Spark applications. As we all know, Python is the de facto language for many data analytics workload. While Apache Spark is the most extensively used big data framework today, Python is one of the most widely used programming languages especially for data science. So why not integrate them? This is where PySpark — python for Spark comes in. In order to support Python with Apache Spark, PySpark was released. As many data scientists and analysts use python for its rich libraries, integrating it with Spark is having the best of both worlds. With a strong support by the open source community, PySpark was developed using the Py4j library, to interface with the RDDs in Apache Spark using Python. High speed data processing, powerful caching, real-time and in-memory computation and low latency are some of the features of PySpark that makes it better than other data processing frameworks.
Source: becominghuman.ai
Python is easier to learn and use compared to other programming languages, thanks to its syntax and standard libraries. Python being a dynamically typed language, facilitates Spark’s RDDs to hold objects of multiple types. Moreover, Python has an extensive and rich set of libraries for a wide range of utilities like machine learning, natural language processing, visualization, local data transformations and many more.
While python has many libraries like Pandas, NumPy, SciPy for data analysis and manipulation, these libraries are memory dependent and depend on a single node system. Hence, these are not ideal for working with very large datasets in the order of terabytes and petabytes. With Pandas, scalability is an issue. In cases of real-time or near real-time flow of data, where large amount of data needs to be brought into an integrated space for transformation, processing and analysis, Pandas wouldn’t be an optimal choice. Instead, we need a framework to do the work faster and more efficiently by means of distributed and pipelined processing. This is where PySpark would come into action.
#pyspark #introduction-to-pyspark #data-science #python #apache-spark #apache
1619517660
Python Pandas encouraged us to leave excel tables behind and to look at data from a coder perspective instead. Data sets became bigger and bigger, turned from data bases to data files and into data lakes. Some smart minds from Apache blessed us with the Scala based framework Spark to process the bigger amounts in a reasonable time. Since Python is the go to language for data science nowadays, there was a Python API available soon that’s called PySpark.
For a while now I am trying to conquer this Spark interface with its non-pythonic syntax that everybody in the big data world praises. It took me a few attempts and it’s still work in progress. However in this post I want to show you, who is also starting learning PySpark, how to replicate the same analysis you would otherwise do with Pandas.
The data analysis example we are going to look at you can find in the book “Python for Data Analysis” by Wes McKinney. In that analysis, the aim is to find out the top ranked movies from the MovieLens 1M data set, which is acquired and maintained by the GroupLens Research project from the University of Minnesota.
As a coding framework I used Kaggle, since it comes with the convenience of notebooks that have the basic data science modules installed and are ready to go with two clicks.
The complete analysis and the Pyspark code you can also find in this Kaggle notebookand the Pandas code inthis one. We won’t replicate the same analysis here, but instead focus on the syntax differences when handling Pandas and Pyspark dataframes. I will always show the Pandas code first following with the PySpark equivalent.
The basic functions that we need for this analysis are:
#python #data-science #pyspark #introduction-to-pyspark #pandas-dataframe
1615450927
In this PySpark tutorial for beginners video, you will learn what is pyspark, components of spark, spark architecture, methods of spark deployment, Pyspark Installation, Pyspark dataframe tutorial, RDD concepts, features, operations, and transformations of RDD in Pyspark in detail.
Why should you watch this PySpark tutorial?
This PySpark tutorial is designed in a way that you learn it from scratch. This Intellipaat PySpark tutorial will help you develop custom, feature-rich applications using Python and Spark.
Why PySpark is important?
This PySpark tutorial will show you how Python for spark has an elegant syntax, is easy to code, debug and run. You will learn PySpark is deployed across industry verticals by going through this video. The Intellipaat PySpark tutorial is easy to understand, has real world PySpark examples and thus makes you understand why PySpark is so important and why you should learn PySpark and go for a PySpark career.
Why should you opt for a PySpark career?
If you want to fast-track your career then you should strongly consider PySpark. The reason for this is that it is one of the fastest-growing and widely used. There is a huge demand for PySpark programmers. The salaries for PySpark programmers are very good. There is a huge growth opportunity in this domain as well. Hence this Intellipaat PySpark tutorial is your stepping stone to a successful career!
#pyspark