What is TensorFrames? TensorFlow + Apache Spark

Originally published by*** Adi Polak ****at *towardsdatascience.com

First thing first, what is TensorFrames?

TensorFrames is an open source created by Apache Spark contributors. Its functions and parameters are named the same as in the TensorFlow framework. Under the hood, it is an Apache Spark DSL (domain-specific language) wrapper for Apache Spark DataFrames. It allows us to manipulate the DataFrames with TensorFlow functionality. And no, it is notpandas DataFrame, it is based on Apache Spark DataFrame.

…but wait, what is TensorFlow (TF)?

TensorFlow is an open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library and is also used for machine learning applications such as neural networks.

…and Apache Spark?

Apache Spark is an open-source distributed general-purpose cluster-computing framework.

A word about scale

Today when we mention scale, we usually talk about two options; scale horizontally, and scaling vertically.

·        Horizontal scale — add additional machines with more or less the same computing power

·        Vertical scale — adding more resources to machine/s we are currently working with. It can be a processor upgraded from a CPU to GPU, more memory (RAM), and etc.

With TensorFrames, we can do both, more processor computing power, and more machines. Where with only TensorFlow we would usually focus on adding more power through scaling vertically, now with Apache Spark support, we can scale both vertically and horizontally. But, how do we know how much of each we actually need? To answer this question, we need to understand the full usage of our applications and plan accordingly.

For each change, like adding a machine or upgrading from CPU to GPU, we have downtime. In the cloud, resizing a cluster or adding more compute power, is a matter of minutes, versus on-prem where we need to deal with adding new machines and upgrading machines processors, this can take days, and sometimes months.

So, A more flexible solution is the public cloud.

In the picture below, scale horizontally is the X-axis where scale vertically is the Y-axis.

**Slide from Tim Hunter presentation at Apache Spark conf

Before jumping to the functions, let’s understand some important TensorFlow vocabulary:

Tensor

A statically typed multi-dimensional array whose elements are of a generic type.

GraphDef

Graph or Computional Graph is the core concept of TensorFlow to present computation. When we use TensorFlow, we first create our own Computation Graph and pass the Graph to TensorFlow. GraphDf is the serialized version of Graph.

Operation

A Graph node that performs computation on Tensors. An Operation is a node in a Graph that takes zero or more Tensors (produced by other Operations in the Graph) as input and produces zero or more Tensor s as output.

Identity

tf.identity is used when we want to explicitly transport tensor between devices (like, from GPU to a CPU). The operation adds nodes to the graph, which makes a copy when the devices of the input and the output are different.

Constant

A constant has the following arguments which can be tweaked as required to get the desired function. It the same as a variable, but its value can’t be changed. Constant can be:

·        value: A constant value (or list) of output type dtype.

·        dtype: The type of the elements of the resulting tensor.

·        shape: Optional dimensions of resulting tensor.

·        name: Optional name for the tensor.

·        verify_shape: Boolean that enables verification of a shape of values.

Placeholders

Allocate storage for data (such as for image pixel data during a feed). Initial values are not required (but can be set, see tf.placeholder_with_default). Versus variables, where you need to declare the initial value. \

Some Apache Spark Vocabulary

Dataframe

This is a distributed collection of data organized into named columns that provide operations to filter, group, or compute aggregates. Dataframe data is often distributed across multiple machines. It can be in memory data or on disk.

RelationalGroupedDataset

A set of methods for aggregations on a DataFrame, created by groupBycubeor rollup.

The main method is the agg function, which has multiple variants. This class also contains some first-order statistics such as meansum for convenience.

Now that we understand the terminology better, let’s look at the functionality.

The Functionality — TensorFlow version 0.6.0

Apache Spark is known for being an analytics platform for data at scale, together with TensorFlow, we get TensorFrames which have three categories of data manipulations:

Let’s understand each functionality.

-1- Mapping

Mapping operations transform and/or adds columns to a given dataframe.

Each functionality is accessed through two API, one which receives Operation and the other which receives DataFrame, GraphDef, and ShapeDescription.

Exposed API:

MapRows

def mapRows(o0: Operation, os: Operation*): DataFrame

For the user, this is the function that will be more often in use, since there is no direct request to create the GraphDef and ShapeDescription object. This way is more readable for experienced TensorFlow developers:

mapRows receives two parameters, operation, and operation* which means the second operation can be a collection of operations. Later it turns them into a sequence and translates it into a graph, it creates the ShapeDiscription out of the graph and sends it with the DataFrame to an internal function. Where it transforms the distributed data row by row according to the transformations given in the graph. All input in the graph should be filled with some data from the given DataFrame or constants. Meaning, we can’t use null. At the end the function returns a new DataFrame with the new schema, the schema will contain the original schema plus new columns that correspond to the graph output. ShapeDiscription provides the shape of the output, it is used, behind the scenes, for optimization and going around limitations of the kernel.

MapBlock

Performs a similar task as MapRows, however, since it is optimized for compact, it applies the graph transformers in blocks of data and not row by row.

def mapBlocks(o0: Operation, os: Operation*): DataFrame

The often more used function is:

Code example: We create val df, which is of type DataFrame, with two rows, one contains value 1.0 and the second data row contain value 2.0. The column name is x.

val x is a declaration of the placeholder for the output, y is the identity for transporting tensors from CPU to GPU or from machine to machine, it received val x as it’s value.

z is the computation function itself. Here, df.MapBlock functions gets two operations, y and z, and retunes a new DataFrame named df2 with extra column z. Column z is the sum of x+x. In the output, column x is the original value, column y is the identity value and column z is the output of the graph.

MapBlocksTrimmed

This is the same as MapBlock , BUT, it drops the original DataFrame columns from the result DataFrame. Meaning the output DataFrame will contain only the calculated columns.

def mapBlocksTrimmed(o0: Operation, os: Operation*): DataFrame

Let’s look at:

Code example: we create a DataFrame named df with two rows with values 3.0 and 4.0 . Notice that we create a constant named out with value 1.0 and 2.0, this constant is the TensorFrame dsl functionality that mimics the TensorFlow functionality. Then we call df.MapBlocksTrimmed. The output schema will only contain the result column, which is named “out” and in our case will only hold the constant values which are 1.0 and 2.0 .

Important Note in the first line of code we import TesnorFrames dsl and we name it to tf, which stands for TensorFlow, we do it since this is how TesnorFlow users used to work with it and we are adhering to the best practices of TensorFlow.

-2- Reducing

Reduction operations coalesce a pair or a collection of rows and transform them into a single row, it repeats the same operation until there is one row left. Under the hood, TensorFrames minimizes the data transfer between computers by reducing all the rows on each computer first and then sending the remainder over the network to perform the last reductions.

f(f(a, b), c) == f(a, f(b, c))

The transforms function must be classified as morphism: the order in which they are done should not matter. In mathematical terms, given some function f and some function inputs a, b, c, the following must hold:

Map reduce schema by Christopher Scherb

The reduce functionality API, same as the rest, we have 2 API for each functionality, the one which receives Operation is more intuitive, however, in TensorFlow there is no direct reduce rows operation, instead, there are many reduce operations like tf.math.reduce_sum and tf.reduce_sum .

ReduceRows

This functionality uses TensorFlow operations to merge two rows together until there is one row left. It receives the datafram, graph, and a ShapeDescription.

def reduceRows(o0: Operation, os: Operation*): Row

User interface:

In the next code example. We create a DataFrame with a column named inand two rows. x1 and x2 placeholder for dtype and x- which is an add operation of x1 and x2. reduceRows, return a Row with value 3 which is the sum of 1.0 and 2.0.

ReduceBlocks

Works the same as ReduceRows , BUT, it does it on a vector of rows and not row by row.

def reduceBlocks(o0: Operation, os: Operation*): Row

More used function:

Code example: Here we create a DataFrame with two columns — key2 and x. One placeholder names x1, one reduce_sum TensorFlow operation named x. The reduce functionality return the sum of the rows in the DataFrame according to the desired columns that the reduce_sum named after which is x.

-3- Aggregation

def aggregate(data: RelationalGroupedDataset, graph: GraphDef, shapeHints: ShapeDescription): DataFrame

Aggregation is an extra operation for Apache Spark and TensorFlow. It is different from the aggregation functionality in TensorFlow and works with RelationalGroupedDataset. API functionality:

Aggregate receives a RelationalGroupedDataset which is an Apache Spark object, it wraps DataFrame and adds aggregation functionality, a sequence of expressions and a group type.

The aggregate function receives the graph and ShareDescriptiom. It aggregates rows together using reducing transformation on grouped data. This is useful when data is already grouped by key. At the moment, only numerical data is supported.

Code example: In the example, we have a DataFrame with two columns, key, and xx1 as a placeholder and x as the reduce_sum functionality named x.

Using groupby functionality we group the rows by key, and after it, we call aggregate with the operations. We can see in the output that the aggregated was calculated according to the key, for the key with value 1- we received 2.1 as the value for column x and for the key with value 2 we received 2.0 as the value for column x.

TensorFrames basic process

In all TensorFrames functionality, the DataFrame is sent together with the computations graph. The DataFrame represents the distributed data, meaning in every machine there is a chunk of the data that will go through the graph operations/ transformations. This will happen in every machine with the relevant data. Tungsten binary format is the actual binary in-memory data that goes through the transformation, first to Apache Spark Java object and from there it is sent to TensorFlow Jave API for graph calculations. This all happens in the Spark Worker process, the Spark worker process can spin many tasks which mean various calculation at the same time over the in-memory data.

Noteworthy

·        DataFrames with scala is currently an experimental version.

·        The Scala DSL only features a subset of TensorFlow transforms.

·        TensorFrames is open source and can be supported here.

·        Python was the first client language supported by TensorFlow and currently supports the most features. More and more of that functionality is being moved into the core of TensorFlow (implemented in C++) and exposed via a C API. Which later exposed through other languages API, such as Java and JavaScript.

·        Interested in working with Keras? check out Elephas: Distributed Deep Learning with Keras & Spark.

·        interested in TensorFrames project on the public cloud? check this and this.

Now that you know more about TensorFrames, how will you take it forward?

Originally published by*** Adi Polak ***at towardsdatascience.com


Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Complete Guide to TensorFlow for Deep Learning with Python

☞ Data Science: Deep Learning in Python

☞ Python for Data Science and Machine Learning Bootcamp

☞ Deep Learning with TensorFlow 2.0 [2019]

☞ TensorFlow 2.0: A Complete Guide on the Brand New TensorFlow

☞ Tensorflow and Keras For Neural Networks and Deep Learning

☞ Tensorflow Bootcamp For Data Science in Python

☞ Complete 2019 Data Science & Machine Learning Bootcamp

#tensorflow #python #machine-learning

What is GEEK

Buddha Community

What is TensorFrames? TensorFlow + Apache Spark
Edureka Fan

Edureka Fan

1606982795

What is Apache Spark? | Apache Spark Python | Spark Training

This Edureka “What is Apache Spark?” video will help you to understand the Architecture of Spark in depth. It includes an example where we Understand what is Python and Apache Spark.

#big-data #apache-spark #developer #apache #spark

Anil  Sakhiya

Anil Sakhiya

1595141479

Apache Spark For Beginners In 3 Hours | Apache Spark Training

In this Apache Spark For Beginners, we will have an overview of Spark in Big Data. We will start with an introduction to Apache Spark Programming. Then we will move to know the Spark History. Moreover, we will learn why Spark is needed and covers everything that an individual needed to master its skill in this field. In this Apache Spark tutorial, you will not only learn Spark from the basics but also through this Apache Spark tutorial, you will get to know the Spark architecture and its components such as Spark Core, Spark Programming, Spark SQL, Spark Streaming, and much more.

This “Spark Tutorial” will help you to comprehensively learn all the concepts of Apache Spark. Apache Spark has a bright future. Many companies have recognized the power of Spark and quickly started worked on it. The primary importance of Apache Spark in the Big data industry is because of its in-memory data processing. Spark can also handle many analytics challenges because of its low-latency in-memory data processing capability.

Spark’s shell provides you a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python

This Spark tutorial will comprise of the following topics:

  • 00:00:00 - Introduction
  • 00:00:52 - Spark Fundamentals
  • 00:23:11 - Spark Architecture
  • 01:01:08 - Spark Demo

#apache-spark #apache #spark #big-data #developer

Gunjan  Khaitan

Gunjan Khaitan

1582649280

Apache Spark Tutorial For Beginners - Apache Spark Full Course

This full course video on Apache Spark will help you learn the basics of Big Data, what Apache Spark is, and the architecture of Apache Spark. Then, you will understand how to install Apache Spark on Windows and Ubuntu. You will look at the important components of Spark, such as Spark Streaming, Spark MLlib, and Spark SQL. Finally, you will get an idea about implement Spark with Python in PySpark tutorial and look at some of the important Apache Spark interview questions. Now, let’s get started and learn Apache Spark in detail.

Below topics are explained in this Apache Spark Full Course:

  1. Animated Video 01:15
  2. History of Spark 06:48
  3. What is Spark 07:28
  4. Hadoop vs spark 08:32
  5. Components of Apache Spark 14:14
  6. Spark Architecture 33:26
  7. Applications of Spark 40:05
  8. Spark Use Case 42:08
  9. Running a Spark Application 44:08
  10. Apache Spark insallation on Windows 01:01:03
  11. Apache Spark insallation on Ubuntu 01:31:54
  12. What is Spark Streaming 01:49:31
  13. Spark Streaming data sources 01:50:39
  14. Features of Spark Streaming 01:52:19
  15. Working of Spark Streaming 01:52:53
  16. Discretized Streams 01:54:03
  17. caching/persistence 02:02:17
  18. checkpointing in spark streaming 02:04:34
  19. Demo on Spark Streaming 02:18:27
  20. What is Spark MLlib 02:47:29
  21. What is Machine Learning 02:49:14
  22. Machine Learning Algorithms 02:51:38
  23. Spark MLlib Tools 02:53:01
  24. Spark MLlib Data Types 02:56:42
  25. Machine Learning Pipelines 03:09:05
  26. Spark MLlib Demo 03:18:38
  27. What is Spark SQL 04:01:40
  28. Spark SQL Features 04:03:52
  29. Spark SQL Architecture 04:07:43
  30. Spark SQL Data Frame 04:09:59
  31. Spark SQL Data Source 04:11:55
  32. Spark SQL Demo 04:23:00
  33. What is PySpark 04:52:03
  34. PySpark Features 04:58:02
  35. PySpark with Python and Scala 04:58:54
  36. PySpark Contents 05:00:35
  37. PySpark Subpackages 05:40:10
  38. Companies using PySpark 05:41:16
  39. PySpark Demo 05:41:49
  40. Spark Interview Questions 05:50:43

#bigdata #apache #spark #apache-spark

kiran sam

1619408437

Apache Spark Training Course Online - Learn Scala

R is perhaps the most popular computer dialects in information science, explicitly committed to measurable investigation with a number of augmentations, for example, RStudio addins and other R packages, for information processing and machine learning assignments. Moreover, it empowers information researchers to effortlessly imagine their informational collection.

By using SparkR in Apache SparkTM, R code can without much of a stretch be scaled. To interactively run occupations, you can without much of a stretch run the distributed calculation by running a R shell.

At the point when SparkR doesn’t require interaction with the R process, the performance is virtually indistinguishable from other language APIs like Scala, Java and Python. However, huge performance degradation happens when SparkR occupations interact with local R capacities or information types.

Databricks Runtime introduced vectorization in SparkR to improve the performance of information I/O among Spark and R. We are eager to declare that using the R APIs from Apache Arrow 0.15.1, the vectorization is presently accessible in the upcoming Apache Spark 3.0 with the significant performance improvements.

This blog entry outlines Spark and R interaction inside SparkR, the current local execution and the vectorized execution in SparkR with benchmark results.Native implementation

The calculation on SparkR DataFrame gets distributed across every one of the hubs accessible on the Spark cluster. There’s no correspondence with the R processes above in driver or executor sides on the off chance that it doesn’t have to collect information as R data.frame or to execute R native capacities. At the point when it requires R data.frame or the execution of R native capacity, they convey using attachments among JVM and R driver/executors.

It (de)serializes and transfers information row by row among JVM and R with an inefficient encoding format, which doesn’t consider the modern CPU plan, for example, CPU pipelining.

Vectorized implementation

In Apache Spark 3.0, another vectorized implementation is introduced in SparkR by leveraging Apache Arrow to trade information directly among JVM and R driver/executors with minimal (de)serialization cost

Instead of (de)serializing the information row by row using an inefficient format among JVM and R, the new implementation leverages Apache Arrow to permit pipelining and Single Instruction Multiple Data (SIMD) with a productive columnar format.

The new vectorized SparkR APIs are not empowered as a matter of course yet can be empowered by setting spark.sql.execution.arrow.sparkr.enabled to true in the upcoming Apache Spark 3.0. Note that vectorized dapplyCollect() and gapplyCollect() are not executed at this point. It is encouraged for users to utilize dapply() and gapply() instead.

Benchmark results

The benchmarks were performed with a basic informational collection of 500,000 records by executing similar code and comparing the all out passed times when the vectorization is empowered and impaired.

If there should be an occurrence of collect() and createDataFrame() with R DataFrame, it turned out to be approximately 17x and 42x faster when the vectorization was empowered. For dapply() and gapply(), it was 43x and 33x faster than when the vectorization is handicapped, respectively.

There was a performance improvement of up to 17x–43x when the streamlining was empowered by spark.sql.execution.arrow.sparkr.enabled to true. The larger the information was, the higher performance anticipated

End

The upcoming Apache Spark 3.0, supports the vectorized APIs, dapply(), gapply(), collect() and createDataFrame() with R DataFrame by leveraging Apache Arrow. Enabling vectorization in SparkR improved the performance up to 43x faster, and more lift is normal when the size of information is larger.

Concerning future work, there is an ongoing issue in Apache Arrow, ARROW-4512. The correspondence among JVM and R isn’t completely in a streaming manner currently. It needs to (de)serialize in clump since Arrow R API doesn’t support this out of the container. Also, dapplyCollect() and gapplyCollect() will be supported in Apache Spark 3.x releases. Users can work around through dapply() and collect(), and gapply() and collect() individually in the interim.

Try out these new abilities today on Databricks, through our DBR 7.0 Beta, which includes a preview of the upcoming Spark 3.0 release. Learn more about Spark 3.0 in our Spark Certification

Spark and R interaction

SparkR supports not just a rich arrangement of ML and SQL-like APIs yet additionally a bunch of APIs normally used to directly interact with R code — for instance, the consistent conversion of Spark DataFrame from/to R DataFrame, and the execution of R local capacities on Spark DataFrame in a distributed manner.

In many cases, the performance is virtually steady across other language APIs in Spark — for instance, when user code relies on Spark UDFs or potentially SQL APIs, the execution happens entirely inside the JVM with no performance punishment in I/O. See the cases beneath which take ~1 second similarly.

/Scala API

/~1 second

sql(“SELECT id FROM range(2000000000)”).filter(“id > 10”).count()

R API

~1 second

count(filter(sql(“SELECT * FROM range(2000000000)”), “id > 10”))

However, in situations where it requires to execute the R local capacity or convert it from/to R local sorts, the performance is immensely different as underneath.

/Scala API

val ds = (1L to 100000L).toDS

/~1 second

ds.mapPartitions(iter => iter.filter(_ < 50000)).count()

R API

df <-createDataFrame(lapply(seq(100000), work (e) list(value=e)))

~15 seconds - multiple times slower

count(dapply(

df, function(x) as.data.frame(x[x$value < 50000,]), schema(df)))

Albeit this basic case above filters the qualities lower than 50,000 for each partition, SparkR is 15x slower.

/Scala API

/~0.2 seconds

val df = sql(“SELECT * FROM range(1000000)”).collect()

R API

~8 seconds - multiple times slower

df <-collect(sql(“SELECT * FROM range(1000000)”))

The case above is far and away more terrible. It just collects a similar information into the driver side, yet it is 40x slower in SparkR.

This is on the grounds that the APIs that require the interaction with R local capacity or information types and its execution are not very productive. There are six APIs that have the striking performance punishment:

createDataFrame()

collect()

dapply()

dapplyCollect()

gapply()

gapplyCollect()

In short, createDataFrame() and collect() require to (de)serialize and convert the information from JVM from/to R driver side. For instance, String in Java becomes character in R. For dapply() and gapply(), the conversion among JVM and R executors is required in light of the fact that it needs to (de)serialize both R local capacity and the information. If there should arise an occurrence of dapplyCollect() and gapplyCollect(), it requires the overhead at both driver and executors among JVM and R

Native implementation

The calculation on SparkR DataFrame gets distributed across every one of the hubs accessible on the Spark cluster. There’s no correspondence with the R processes above in driver or executor sides on the off chance that it doesn’t have to collect information as R data.frame or to execute R native capacities. At the point when it requires R data.frame or the execution of R native capacity, they convey using attachments among JVM and R driver/executors.

It (de)serializes and transfers information row by row among JVM and R with an inefficient encoding format, which doesn’t consider the modern CPU plan, for example, CPU pipelining.

Vectorized implementation

In Apache Spark 3.0, another vectorized implementation is introduced in SparkR by leveraging Apache Arrow to trade information directly among JVM and R driver/executors with minimal (de)serialization cost

Instead of (de)serializing the information row by row using an inefficient format among JVM and R, the new implementation leverages Apache Arrow to permit pipelining and Single Instruction Multiple Data (SIMD) with a productive columnar format.

The new vectorized SparkR APIs are not empowered as a matter of course yet can be empowered by setting spark.sql.execution.arrow.sparkr.enabled to true in the upcoming Apache Spark 3.0. Note that vectorized dapplyCollect() and gapplyCollect() are not executed at this point. It is encouraged for users to utilize dapply() and gapply() instead.

Benchmark results

The benchmarks were performed with a basic informational collection of 500,000 records by executing similar code and comparing the all out passed times when the vectorization is empowered and impaired.

If there should be an occurrence of collect() and createDataFrame() with R DataFrame, it turned out to be approximately 17x and 42x faster when the vectorization was empowered. For dapply() and gapply(), it was 43x and 33x faster than when the vectorization is handicapped, respectively.

There was a performance improvement of up to 17x–43x when the streamlining was empowered by spark.sql.execution.arrow.sparkr.enabled to true. The larger the information was, the higher performance anticipated

End

The upcoming Apache Spark 3.0, supports the vectorized APIs, dapply(), gapply(), collect() and createDataFrame() with R DataFrame by leveraging Apache Arrow. Enabling vectorization in SparkR improved the performance up to 43x faster, and more lift is normal when the size of information is larger.

Concerning future work, there is an ongoing issue in Apache Arrow, ARROW-4512. The correspondence among JVM and R isn’t completely in a streaming manner currently. It needs to (de)serialize in clump since Arrow R API doesn’t support this out of the container. Also, dapplyCollect() and gapplyCollect() will be supported in Apache Spark 3.x releases. Users can work around through dapply() and collect(), and gapply() and collect() individually in the interim.

Try out these new abilities today on Databricks, through our DBR 7.0 Beta, which includes a preview of the upcoming Spark 3.0 release.

#apache spark #apache spark training #apache spark course

Gunjan  Khaitan

Gunjan Khaitan

1619227525

Apache Spark Full Course | Spark Tutorial For Beginners | Complete Spark Tutorial

This Apache Spark full course will help you learn the basics of Big Data, what Apache Spark is, and the architecture of Apache Spark. Then, you will understand how to install Apache Spark on Windows and Ubuntu. You will look at the important components of Spark, such as Spark Streaming, Spark MLlib, and Spark SQL. Finally, you will get an idea about implement Spark with Python in PySpark tutorial and look at some of the important Apache Spark interview questions.

Below topics are explained in this Apache Spark Full Course:

  1. Animated Video
  2. History of Spark
  3. What is Spark
  4. Hadoop vs spark
  5. Components of Apache Spark
  6. Spark Architecture
  7. Applications of Spark
  8. Spark Use Case
  9. Running a Spark Application
  10. Apache Spark installation on Windows
  11. Apache Spark installation on Ubuntu
  12. What is Spark Streaming
  13. Spark Streaming data sources
  14. Features of Spark Streaming
  15. Working of Spark Streaming
  16. Discretized Streams
  17. caching/persistence
  18. checkpointing in spark streaming
  19. Demo on Spark Streaming
  20. What is Spark MLlib
  21. What is Machine Learning
  22. Machine Learning Algorithms
  23. Spark MLlib Tools
  24. Spark MLlib Data Types
  25. Machine Learning Pipelines
  26. Spark MLlib Demo
  27. What is Spark SQL
  28. Spark SQL Features
  29. Spark SQL Architecture
  30. Spark SQL Data Frame
  31. Spark SQL Data Source
  32. Spark SQL Demo
  33. What is PySpark
  34. PySpark Features
  35. PySpark with Python and Scala
  36. PySpark Contents
  37. PySpark Sub packages
  38. Companies using PySpark
  39. PySpark Demo
  40. Spark Interview Questions

#apache-spark #big-data #developer #apache #spark