How to optimize loading 150GB data into hive table?

I have a 150 GB file in hive stage table which uses following table properties

I have a 150 GB file in hive stage table which uses following table properties

    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
       "separatorChar" = "|",
       "quoteChar"     = "'",
       "escapeChar"    = "\\"
    )  
    STORED AS TEXTFILE;

Now when I load this data in my main table, it fails with java heap error after running for one hour. I am using partitioned main table and there are about 12000 partitions in the data. For loading the data I am using simple hql:

    insert overwrite mainTable partition(date)
    select * from stage table;

I have also tried increasing the map memory to 15GB but sill it fails. Is there any way to optimize this ? Any solution which includes spark or hive would work.

Integrating Kafka With Spark Structured Streaming

Integrating Kafka With Spark Structured Streaming

Learn the method to integrate Kafka with Spark for consuming streaming data amd discover how to unleash your streaming analytics needs...

Learn the method to integrate Kafka with Spark for consuming streaming data amd discover how to unleash your streaming analytics needs...

Kafka is a messaging broker system that facilitates the passing of messages between producer and consumer. On the other hand, Spark Structure streaming consumes static and streaming data from various sources (like Kafka, Flume, Twitter, etc.) that can be processed and analyzed using a high-level algorithm for Machine Learning and pushes the result out to an external storage system. The main advantage of structured streaming is to get continuous incrementing of the result as the streaming data continue to arrive.

Kafka has its own stream library and is best for transforming Kafka topic-to-topic whereas Spark streaming can be integrated with almost any type of system. For more detail, you can refer to this blog.

In this blog, I’ll cover an end-to-end integration of Kafka with Spark structured streaming by creating Kafka as a source and Spark structured streaming as a sink.

Let’s create a Maven project and add following dependencies in pom.xml.

 <dependency>
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-core_2.11</artifactId>
     <version>2.1.1</version>
 </dependency>
 <dependency>
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-sql_2.11</artifactId>
     <version>2.1.1</version>
 </dependency>
 <dependency>
     <groupId>org.apache.kafka</groupId>
     <artifactId>kafka-clients</artifactId>
     <version>0.10.2.0</version>
 </dependency>
<dependency>
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-streaming-kafka_2.10</artifactId>
     <version>1.6.3</version>
 </dependency>

Now, we will be creating a Kafka producer that produces messages and pushes them to the topic. The consumer will be the Spark structured streaming DataFrame.

First, setting the properties for the Kafka producer.

val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")

  • bootstrap.servers: This contains the full list of servers with hostname and port. The list should be in the form of host1: port, host2: port , and so on.

  • key.serializer: Serializer class for the key that implements serializer interface.

  • value.serializer: Serializer class for the key that implements the serializer interface.

Creating a Kafka producer and sending topic over the stream:

val producer = new KafkaProducer[String,String](props)
for(count <- 0 to 10) 
  producer.send(new ProducerRecord[String, String](topic, "title "+count.toString,"data from topic"))
println("Message sent successfully")
producer.close()

The send is asynchronous, and this method will return immediately once the record has been stored in the buffer of records waiting to be sent. This allows sending many records in parallel without blocking to wait for the response after each one. The result of the send is a RecordMetadata specifying the partition the record was sent to and the offset it was assigned. After sending the data, close the producer using the close method.

Kafka as a Source 

Now, Spark will be a consumer of streams produced by Kafka. For this, we need to create a Spark session.

val spark = SparkSession
  .builder
  .appName("sparkConsumer")
  .config("spark.master", "local")
  .getOrCreate()

This is getting the topics from Kafka and reading it in Spark stream by subscribing to a particular topic that is to be provided in option. Following is the code to subscribe Kafka topics in Spark stream and read it using readstream.

val dataFrame = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "mytopic")
  .load()

Printing the schema of the DataFrame:

 ds1.printSchema()

The output for the schema includes all the fields related to Kafka metadata.

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

Create a dataset from DataFrame by casting the key and value from the topic as a string:

val dataSet: Dataset[(String, String)] =dataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]

Write the data in the dataset to the console and hold the program from exit using the method awaitTermination:

val query: StreamingQuery = dataSet.writeStream
 .outputMode("append")
 .format("console")
 .start()
 query.awaitTermination()

The complete code is on my GitHub.

Originally published by Jatin Demla at https://dzone.com

Learn More

☞ Apache Spark with Python - Big Data with PySpark and Spark

☞ Apache Spark 2.0 with Scala - Hands On with Big Data!

☞ Taming Big Data with Apache Spark and Python - Hands On!

☞ Apache Spark with Scala - Learn Spark from a Big Data Guru

☞ Apache Spark Hands on Specialization for Big Data Analytics

☞ Big Data Analysis with Apache Spark Python PySpark

Using Apache Spark to Query a Remote Authenticated MongoDB Server

Using Apache Spark to Query a Remote Authenticated MongoDB Server

<strong>Apache Spark is one of the most popular open source tools for big data. Learn how to use it to ingest data from a remote MongoDB server.</strong>

Apache Spark is one of the most popular open source tools for big data. Learn how to use it to ingest data from a remote MongoDB server.

Cloudera Data Flow, the answer to all your real-time streaming data problems. Manage your data from edge to enterprise with a no-code approach to developing sophisticated streaming applications easily. Learn more today.

1. Download and Extract Spark
$ wget http://apache.spinellicreations.com/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
$ tar -xf spark-2.4.0-bin-hadoop2.7.tgz
$ cd spark-2.4.0-bin-hadoop2.7

Create a spark-defaults.conf file by copying spark-defaults.conf.template in conf/.

Add the below line to the conf file.

spark.debug.maxToStringFields=1000
2. Connect to Mongo via a Remote Server

We use the MongoDB Spark Connector.

First, make sure the Mongo instance in the remote server has the bindIp set to the appropriate value and the correct local IP (not just localhost). Use the authentication root and password below to indicate the credentials of your authenticated Mongo database. 192.168.1.32 is your remote server’s private IP (i.e., the server where Mongo is running). We are reading the oplog.rs collection in the local database. Change these accordingly. Similarly, we are writing the outputs to the database, sparkoutput.

spark-2.4.0-bin-hadoop2.7]$ ./bin/pyspark --conf "spark.mongodb.input.uri=mongodb://root:[email protected]:27017/local.oplog.rs?readPreference=primaryPreferred" --conf "spark.mongodb.output.uri=mongodb://root:[email protected]:27017/sparkoutput" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0
Python 2.7.5 (default, Oct 30 2018, 23:45:53)

[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

Ivy Default Cache set to: /home/pkathi2/.ivy2/cache

The jars for the packages stored in: /home/pkathi2/.ivy2/jars

:: loading settings :: url = jar:file:/home/pkathi2/spark-2.4.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency

:: resolving dependencies :: org.apache.spark#spark-submit-parent-33a37e02-1a24-498d-9217-e7025eeebd10;1.0

confs: [default]

found org.mongodb.spark#mongo-spark-connector_2.11;2.4.0 in central

found org.mongodb#mongo-java-driver;3.9.0 in central

:: resolution report :: resolve 256ms :: artifacts dl 5ms

:: modules in use:

org.mongodb#mongo-java-driver;3.9.0 from central in [default]

org.mongodb.spark#mongo-spark-connector_2.11;2.4.0 from central in [default]


| | modules || artifacts |

| conf | number| search|dwnlded|evicted|| number|dwnlded|


| default | 2 | 0 | 0 | 0 || 2 | 0 |


:: retrieving :: org.apache.spark#spark-submit-parent-33a37e02-1a24-498d-9217-e7025eeebd10

confs: [default]

0 artifacts copied, 2 already retrieved (0kB/6ms)

WARN NativeCodeLoader: This message means the systme is unable to load native-hadoop library for your platform… using built-in Java classes where applicable.

Set the default log level to “WARN”.

To adjust logging level, use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Welcome to


/ / ___ ___/ /

\ / _ / _ `/ __/ '/

/__ / .__/_,// /_/_\ version 2.4.0

/_/

Using Python version 2.7.5

SparkSession is available as ‘spark’.

>>> from pyspark.sql import SparkSession

>>> my_spark = SparkSession
... .builder
... .appName("myApp")
... .config("spark.mongodb.input.uri", "mongodb://root:[email protected]:27017/local.oplog.rs?authSource=admin")
... .config("spark.mongodb.output.uri", "mongodb://root:[email protected]:27017/sparkoutput?authSource=admin")
... .getOrCreate()

Make sure you are using the correct authentication source (i.e., where you authenticate yourself in the Mongo server).

3. Perform Queries on the Mongo Collection

Now you can perform queries on your remote Mongo collection through the Spark instance. For example, the below query finds the schema from the collection.

>>> df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()
>>> df.printSchema()
root
|-- h: long (nullable = true)
|-- ns: string (nullable = true)
|-- o: struct (nullable = true)
| |-- $set: struct (nullable = true)
| | |-- lastUse: timestamp (nullable = true)
| |-- $v: integer (nullable = true)

Originally published by Pradeeban Kathiravelu at https://dzone.com

Learn more

☞ The Complete Developers Guide to MongoDB

☞ Master MongoDB, the NOSQL leader with Clarity and Confidence

☞ MongoDB, NoSQL & Node: Mongoose, Azure & Database Management

☞ Build a ChatApp with: (Nodejs,Socketio, Expressjs ,MongoDB)

☞ GraphQL: Learning GraphQL with Node.Js

☞ Apache Spark 2.0 with Scala - Hands On with Big Data!

☞ Taming Big Data with Apache Spark and Python - Hands On!

☞ Apache Spark with Scala - Learn Spark from a Big Data Guru

☞ Apache Spark Hands on Specialization for Big Data Analytics

Data Lake & Hadoop : How can they power your Analytics?

Data Lake & Hadoop : How can they power your Analytics?

Powering analytics through a data lake and Hadoop is one of the most effective ways to increase ROI. It’s also an effective way to ensure that the analytics team has all the right information moving forward.

Powering analytics through a data lake and Hadoop is one of the most effective ways to increase ROI. It’s also an effective way to ensure that the analytics team has all the right information moving forward.

There are many challenges that research teams have to face regularly, and Hadoop can aid in effective data management.

From storage to analysis, Hadoop can provide the necessary framework to enable research teams to do their work. Hadoop is also not confined to any single model of working or any only language. That's why it's a useful tool when it comes to scaling up. Since companies can perform greater research, there is more data generated. The data can be fed back into the system to create unique results for the final objective.

Data lakes are essential to maintaining as well. Since the core data lake enables your organization to scale, it's necessary to have a single repository of all enterprise data. Over 90% of the world’s data has been generated over the last few years, and data lakes have been a positive force in the space.

Why Hadoop is effective?

From a research stand-point, Hadoop is useful in more ways than one. It runs on a cluster of commodity servers and can scale up support thousands of nodes. This means that the quantity of data being handled is massive, and many data sources can be treated at the same time. This increases the effectiveness of Big Data, especially in the cases of IoT, Artificial intelligence, Machine Learning, and other new technologies.

It also provides rapid data access across the nodes in the cluster. Users can get an authorized access to a large subset of the data or the entire database. This makes the job of the researcher and the admin that much easier. Hadoop can also be scaled up as the requirement increases over time.

If an individual node fails, the entire cluster can take over. That’s the best part about Hadoop and why companies across the world use it for their research activities. Hadoop is being redefined year over year and has been an industry standard for decades now. Its full potential can be discovered best in the research and analytics space with data lakes.

HDFS – The Hadoop Distributed File System (HDFS) is the primary storage system that Hadoop employs, using a NameNode and DataNode architecture. It provides higher performance across the board and acts as a data distribution system for the enterprise.

YARN – YARN is the cluster resource manager that allocates system resources to apps and jobs. This simplifies the process of mapping out the adequate resources necessary. It’s one of the core components within the Hadoop infrastructure and schedules tasks around the nodes.

MapReduce – It’s a highly effective framework that converts data into a more simple set with individual elements broken down into tuples. From here, the data is translated into games and efficiencies are created within the data network. This is an excellent component for making sense of large data sets within the research space.

Hadoop Common - The common is a collection of standard utilities and libraries that support other modules. It’s a core component of the Hadoop framework and ensures that the resources are allocated correctly. They also provide a framework for the processing of data and information.

Hadoop and Big Data Research

Hadoop is highly effective when it comes to Big Data. This is because there are greater advantages associated with using the technology to it's fullest potential. Researchers can access a higher tier of information and leverage insights based on Hadoop resources. Hadoop can also enable better processing of data, across various systems and platforms.

Anytime there are complex calculations to be done and difficult simulations to execute, Hadoop needs to be put in place. Hadoop can help parallel computation across various coding environments to enable Big Data to create novel insights. Otherwise, there may be overlaps in processing, and the architecture could fail to produce ideas.

From a BI perspective, Hadoop is crucial. This is because while researchers can produce raw data over a significant period, it's essential to have streamlined access to it. Additionally, from a business perspective, it's necessary to have strengths in Big Data processing and storage. The availability of data is as important as access to it. This increases the load on the server, and a comprehensive architecture is required to process the information.

That's where Hadoop comes in. Hadoop can enable better processing and handling of the data being produced. It can also integrate different systems into a single data lake foundation. Added to that, Hadoop can enable better configuration across the enterprise architecture. Hadoop can take raw data and convert it into more useful insights. Anytime there is complexities and challenges, Hadoop can provide more clarity.

Hadoop is also a more enhanced version of simple data management tools. Hadoop can take raw data and insight and present it in a more consumable format. From here, researchers can make their conclusions and prepare intelligence reports that signify results. They can also accumulate on-going research data and feed it back into the central system. This makes for greater on-going analysis, while Hadoop becomes the framework to accomplish it on.

Security on Hadoop and Implementing Data Lakes

There are a significant number of attacks on big data warehouses and data lakes on an on-going basis. It’s essential to have an infrastructure that has a steady security feature built-in. This is where Hadoop comes in. Hadoop can provide those necessary security tools and allow for more secure data transitions.

In the healthcare space, data is critical to preserving. If patient data leaks out, it could lead to complications and health scares. Additionally, in the financial services domain, if data on credit card information and customer SSN leaks out, then there is a legal and PR problem on the rise. That’s why companies opt for greater control using the Hadoop infrastructure. Hadoop is also beneficial regarding providing a better framework for cybersecurity and interoperability. Data integrity is preserved throughout the network, and there is increased control via dashboards provided.

From issuing Kerberos to introducing physical authentication, the Hadoop cluster is increasingly useful in its operations. There is an additional layer of security built into the group, giving rise to a more consistent database environment. Individual tickets can be granted on the Kerberos framework and users can get authenticated using the module.

Security can be enhanced by working with third-party developers to improve your overall Hadoop and Data lake security. You can also increase the security parameters around the infrastructure by creating a stricter authentication and user-management portal and policy. From a cyber-compliance perspective, it’s a better mechanism to work through at scale.

Apache Ranger is also a useful tool to monitor the flow of data as well. This is increasingly important when performing research on proprietary data in the company. Healthcare companies know all too well the value of data, which is why the Ranger can monitor the flow of data throughout the organization. Apache YARN has enabled an exact Data lake approach when it comes to information architecture.

That's why the Ranger is effective in maintaining security. The protocol can be set at the admin level, and companies can design the right tool to take their research ahead. The Ranger can also serve as the end-point management system for when different devices connect onto the cluster.

The Apache Ranger is also a handy centralized interface. This gives greater control to researchers, and all stakeholders in the research and analytics space are empowered. Use-cases emerge much cleaner when there is smooth handling of all data. There is also a more systematic approach to analytics, as there is an access terminal of all authorized personnel. Certain tiers of researchers can gain access to certain types of data and others can get a broader data overview. This can help streamline the data management process and make the analytics process that much more effective.

The Ranger serves as a visa processing system that gives access based on the required authorization. This means that junior researchers don't get access to highly classified information. Senior level researchers can gain the right amount of insight into the matter at hand and dig deep into core research data. Additionally, analysts can gain access to the data they've been authorized to use.

This enables researchers to use Hadoop as an authorization management portal as well. Data can be back-tracked to figure out who used the data portal last. The entire cluster can become unavailable to increase security against outsiders. However, when researchers want a second opinion, they can turn towards consultants who can gain tertiary access to the portal.

Recognizing the analytics needs of researchers and data scientists

It’s important for researchers to understand the need for analytics and vice versa. Hadoop provides that critical interface connect disconnected points in the research ecosystem. Additionally, it creates a more collaborative environment within the data research framework.

Healthcare, Fintech, Consumer Goods and Media & Research companies need to have a more analytical approach when conducting research. That’s why Hadoop becomes critical to leverage, as it creates a more robust environment. The analytics needs are fully recognized by Hadoop, providing more tools for greater analytics.

For forming the right data lake, there needs to be a search engine in place. This helps in streamlining the data and adding a layer of analysis to the raw information. Additionally, researchers can retrieve specific information through the portal. They’re able to perform more excellent analysis of core data that is readily available to them.

Data scientists can uncover accurate insights when they’re able to analyze larger data sets. With emerging technologies like Spark and HBase, Hadoop becomes that much more advanced as an analytics tool. There are more significant advantages to operating with Hadoop and data scientists can see more meaningful results. Over time, there is more convergence with unique data management platforms providing a more coherent approach to data.

The analysis is fully recognized over Hadoop, owing to its scale and scope of work. Hadoop can become the first Data lake ecosystem, as it has a broad range within multiple applications. There is also greater emphasis given to the integrity of data, which is what all researchers need. From core principles to new technology additions, there are components within Hadoop that make it that much more reliable.

Democratizing data for Researchers & Scientists

Having free access to data within the framework is essential. Hadoop helps in developing that democratic data structure within the network. Forecasting to trend analysis can be made that much more straightforward, with a more democratic approach to data. Data can be indeed sorted and retrieved based on the access provided.

Data can also be shared with resources to enable a more collaborative environment in the research process. Otherwise, data sets may get muddied as more inputs stream into the data lake. The lake needs to have a robust democratized approach so that researchers can gain access to that when needed. Additionally, it's essential to have more streamlined access to the data, which is another advantage of using Hadoop. Researchers need to deploy the technology at scale to obtain benefits that come along with it.

Data scientists can also acquire cleaner data that is error-free. This is increasingly important when researchers want to present their findings to stakeholders as there is no problem with integrity. The democratization of data ensures that everyone has access to the data sets that they're authorized to understand. Outsiders may not gain access and can be removed from the overall architecture.

Scientists can also study some aspects of the data lake and acquire unique insights that come with it. From a healthcare perspective, a single outbreak or an exceptional case can bring in new ideas that weren’t previously there before. This also adds immense value to distributed instances wherein there is no single source identified. Unique participants can explore the data like and uncover what is required from it.

It’s essential to have a more democratic approach when it comes to data integrity and data lake development. When the data lake is well maintained, it creates more opportunities for analysis within the research space. Researchers can be assured that their data is being presented in the best light possible. They can also uncover hidden trends and new insights based on that initial connection. The information is also sorted and classified better, using Hadoop’s extensive line of solutions and tools built-in.

Researchers have an affinity for using Hadoop, owing to its scale-readiness and great solution base. They can also be used to present information via other platforms, providing it with a more democratic outlook. The data can also be transmitted and shared via compatible platforms across the board. The researchers present within the ecosystem can even compile data that is based on your initial findings. This helps in maintaining a clean record and a leaner model of data exploration.

Benefits & Challenges of Hadoop for enterprise analysis

Hadoop is one of the most excellent solutions in the marketplace for extensive research and enterprise adoption. This is because of its scale and tools available to accomplish complex tasks. Researchers can also leverage the core technology to avail its benefits across a wide range of solution models.

The data can also be shared from one platform to another, creating a community data lake wherein different participants can emerge. However, overall integrity is maintained throughout the ecosystem. This enables better communication within the system, giving rise to an enhanced approach to systems management.

Hadoop benefits the research community in the four main data formats –

Core research information – This is data produced during trials, research tests and any algorithms that may be running on Machine Learning or Artificial Intelligence. This also includes raw information that is shared with another resource. It also provides information that can be presented across the board.

Manufacturing and batches – This data is essential to maintain as it aids in proper verification of any tools or products being implemented. It also helps in the check of process owners and supplies chain leads.

Customer care – For enterprise-level adoption, customer care information must be presented and stored effectively.

Public records – Security is vital when it comes to handling public records. This is why Hadoop is used to provide security measures adequately.

One of the main problems with Hadoop is leveraging massive data sets to one detailed insight. Since Hadoop requires the right talent to uncover insights, it becomes a complicated procedure. Owing to its complexities it also needs better compliance frameworks that can define specific rules based on instances.

Additionally, as Hadoop is scaled up, there are challenges with storage and space management. That’s why cloud computing is emerging as a viable solution to prevent data loss due to storage errors. Hadoop is also facing complexities regarding data synchronization. That's why researchers need to ensure that all systems are compliant with Hadoop and can leverage the scope of the core platform. For best results, it's ideal to have a more holistic approach to Hadoop.

Conclusion

With the vast quantities of data being generated every day, there is a need for greater analytics and insight in the research space. While every industry, from Healthcare to Automobile, relies on data in some form, it’s essential to have a logical research architecture built into the system. Otherwise, there may be data inefficiencies and chances of the data lake getting contaminated. It’s best to opt for a hybrid Hadoop model with proper security and networking capabilities. When it comes to performing accurate research, it’s essential to have all the right tools with you.

Get started with Apache Spark and TensorFlow on Azure Databricks

Get started with Apache Spark and TensorFlow on Azure Databricks

Get started with Apache Spark and TensorFlow on Azure Databricks - TensorFlow is now available on Apache Spark framework, but how do you get started? It called TensorFrame

Get started with Apache Spark and TensorFlow on Azure Databricks - TensorFlow is now available on Apache Spark framework, but how do you get started? It called TensorFrame

TL;DR

This is a step by step tutorial on how to get new Spark TensorFrame library running on Azure Databricks.

Big Data is a huge topic that consists of many domains and expertise. All the way from DevOps, Data Engineers to Data Scientist, AI, Machine Learning, algorithm developers and many more. We all struggle with massive amounts of data. When we deal with a massive amount of data, we need the best minds and tools. This is where the magicalcombination of Apache Spark and Tensor Flow takes place and we call it TensorFrame.

Apache Spark took over the Big Data world, giving answers and supporting Data Engineers to be a more successful while, Data Scientist had to figure their way around the limitation of the machine learning library that Spark provides, the Spark MLlib.

But no more, now, there is TensorFlow support for Apache Spark users. These tools combined makes the work of Data Scientists more productive, more accurate and faster. And taking outcome from Research to Develop to Production, faster than ever.

Before we start, let’s align the terms:

  • Tensor Flow is an open source machine learning framework for high-performance numerical computations created by Google. It comes with strong support for AI: machine learning and deep learning.
  • Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Azure Databricks also acts as Software as a Service( SaaS) / Big Data as a Service (BDaaS).
  • TensorFrames is an Apache Spark component that enables us to create our own scalable TensorFlow learning algorithms on Spark Clusters.
1- the workspace:

First, we need to create the workspace, we are using Databricks workspace and here is a tutorial for creating it.

2- the cluster:

After we have the workspace, we need to create the cluster itself. Let’s create our spark cluster using this tutorial, make sure you have the next configurations in your cluster:

  • Tensor Flow is an open source machine learning framework for high-performance numerical computations created by Google. It comes with strong support for AI: machine learning and deep learning.
  • Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Azure Databricks also acts as Software as a Service( SaaS) / Big Data as a Service (BDaaS).
  • TensorFrames is an Apache Spark component that enables us to create our own scalable TensorFlow learning algorithms on Spark Clusters.

The configuration:

with Databricks runtime versions or above :

Press start to launch the cluster

3- import the library:

Under Azure Databricks, go to Common Tasks and click Import Library:

TensorFrame can be found on maven repository, so choose the Maven tag. Under Coordinates, insert the library of your choice, for now, it will be:

databricks:tensorframes:0.6.0-s_2.11

Click the Create button.

Click Install.

You will see this:

BOOM. you have TensorFrame on your Databricks Cluster.

4- the notebook:

We use the notebook as our code notebook where we can write the code and run it directly on our Spark Cluster.

Now that we have a running cluster, let’s run a notebook:

Click the New Notebook and choose the programming language of your choice ( here we chose Scala)

This is how it looks like with scala code on the notebook portal with TensorFrames:

The code example is from Databricks git repository.

You can check it here as well:

import org.tensorframes.{dsl => tf}
import org.tensorframes.dsl.Implicits._

val df = spark.createDataFrame(Seq(1.0->1.1, 2.0->2.2)).toDF("a", "b")

// As in Python, scoping is recommended to prevent name collisions.
val df2 = tf.withGraph {
    val a = df.block("a")
    // Unlike python, the scala syntax is more flexible:
    val out = a + 3.0 named "out"
    // The 'mapBlocks' method is added using implicits to dataframes.
    df.mapBlocks(out).select("a", "out")
}

// The transform is all lazy at this point, let's execute it with collect:
df2.collect()
// res0: Array[org.apache.spark.sql.Row] = Array([1.0,4.0], [2.0,5.0]) 

-The End-

Now that you have everything up and running, create your own triggered/scheduled job that uses TensorFrame in Apache Spark cluster.

and …

=================================================

Thanks for reading :heart: If you liked this post, share it with all of your programming buddies! Follow me on Facebook | Twitter

Learn More

☞ Complete Guide to TensorFlow for Deep Learning with Python

☞ Master Deep Learning with TensorFlow in Python

☞ Complete Data Science & Machine Learning Bootcamp - Python 3

☞ Machine Learning A-Z™: Hands-On Python & R In Data Science

☞ Python for Data Science and Machine Learning Bootcamp

☞ Machine Learning, Data Science and Deep Learning with Python

☞ [2019] Machine Learning Classification Bootcamp in Python

Spark MLlib tutorial | Machine Learning On Spark | Apache Spark Tutorial

Spark MLlib tutorial | Machine Learning On Spark | Apache Spark Tutorial

This video on Spark MLlib Tutorial will help you learn about Spark's machine learning library. You will understand the different types of machine learning algorithms - supervised, unsupervised, and reinforcement learning.

Spark MLlib tutorial | Machine Learning On Spark | Apache Spark Tutorial

This video on Spark MLlib Tutorial will help you learn about Spark's machine learning library. You will understand the different types of machine learning algorithms - supervised, unsupervised, and reinforcement learning.

Then, you will get an idea about the various tools that Spark's MLlib component provides. You will see the different data types and some fundamental statistical analysis that you can perform using MLlib.

Finally, you will understand about classification and regression algorithms and implement it using linear and logistic regression. Now, let's get started and learn Spark MLlib.

Below topics are explained in this Spark MLlib tutorial:

  1. What is Spark MLlib? 00:42
  2. What is Machine Learning? 02:27
  3. Machine Learning Algorithms 04:51
  4. Spark MLlib Tools 09:14
  5. Spark MLlib Data Types 09:55
  6. Machine Learning Pipelines 22:18
  7. Clasification & Regression 24:13
  8. Spark MLlib Use Case Demo 31:51