Karim Aya

Karim Aya

1559750657

Integrating Kafka With Spark Structured Streaming

Learn the method to integrate Kafka with Spark for consuming streaming data amd discover how to unleash your streaming analytics needs

Kafka is a messaging broker system that facilitates the passing of messages between producer and consumer. On the other hand, Spark Structure streaming consumes static and streaming data from various sources (like Kafka, Flume, Twitter, etc.) that can be processed and analyzed using a high-level algorithm for Machine Learning and pushes the result out to an external storage system. The main advantage of structured streaming is to get continuous incrementing of the result as the streaming data continue to arrive.

Kafka has its own stream library and is best for transforming Kafka topic-to-topic whereas Spark streaming can be integrated with almost any type of system. For more detail, you can refer to this blog.

In this blog, I’ll cover an end-to-end integration of Kafka with Spark structured streaming by creating Kafka as a source and Spark structured streaming as a sink.

Let’s create a Maven project and add following dependencies in pom.xml.

 <dependency>
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-core_2.11</artifactId>
     <version>2.1.1</version>
 </dependency>
 <dependency>
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-sql_2.11</artifactId>
     <version>2.1.1</version>
 </dependency>
 <dependency>
     <groupId>org.apache.kafka</groupId>
     <artifactId>kafka-clients</artifactId>
     <version>0.10.2.0</version>
 </dependency>
<dependency>
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-streaming-kafka_2.10</artifactId>
     <version>1.6.3</version>
 </dependency>

Now, we will be creating a Kafka producer that produces messages and pushes them to the topic. The consumer will be the Spark structured streaming DataFrame.

First, setting the properties for the Kafka producer.

val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")

bootstrap.servers: This contains the full list of servers with hostname and port. The list should be in the form of host1: port, host2: port , and so on.

  •  

key.serializer: Serializer class for the key that implements serializer interface.

  •  

value.serializer: Serializer class for the key that implements the serializer interface.

Creating a Kafka producer and sending topic over the stream:

val producer = new KafkaProducer[String,String](props)
for(count <- 0 to 10) 
  producer.send(new ProducerRecord[String, String](topic, "title "+count.toString,"data from topic"))
println("Message sent successfully")
producer.close()

The send is asynchronous, and this method will return immediately once the record has been stored in the buffer of records waiting to be sent. This allows sending many records in parallel without blocking to wait for the response after each one. The result of the send is a RecordMetadata specifying the partition the record was sent to and the offset it was assigned. After sending the data, close the producer using the close method.

Kafka as a Source 

Now, Spark will be a consumer of streams produced by Kafka. For this, we need to create a Spark session.

val spark = SparkSession
  .builder
  .appName("sparkConsumer")
  .config("spark.master", "local")
  .getOrCreate()

This is getting the topics from Kafka and reading it in Spark stream by subscribing to a particular topic that is to be provided in option. Following is the code to subscribe Kafka topics in Spark stream and read it using readstream.

val dataFrame = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "mytopic")
  .load()

Printing the schema of the DataFrame:

 ds1.printSchema()

The output for the schema includes all the fields related to Kafka metadata.

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

Create a dataset from DataFrame by casting the key and value from the topic as a string:

val dataSet: Dataset[(String, String)] =dataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]

Write the data in the dataset to the console and hold the program from exit using the method awaitTermination:

val query: StreamingQuery = dataSet.writeStream
 .outputMode("append")
 .format("console")
 .start()
 query.awaitTermination()

The complete code is on my GitHub.

 

#bigdata #apachespark #kafka 

What is GEEK

Buddha Community

Integrating Kafka With Spark Structured Streaming
Teresa  Jerde

Teresa Jerde

1597452410

Spark Structured Streaming – Stateful Streaming

Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Internals of Structured Streaming“. And this blog pertains to Stateful Streaming in Spark Structured Streaming. So let’s get started.

Let’s start from the very basic understanding of what is Stateful Stream Processing. But to understand that, let’s first understand what Stateless Stream Processing is.

In my previous blogs of this series, I’ve discussed Stateless Stream Processing.

You can check them before moving ahead – Introduction to Structured Streaming and Internals of Structured Streaming

#analytics #apache spark #big data and fast data #ml #ai and data engineering #scala #spark #streaming #streaming solutions #tech blogs #stateful streaming #structured streaming

Roberta  Ward

Roberta Ward

1595344320

Wondering how to upgrade your skills in the pandemic? Here's a simple way you can do it.

Corona Virus Pandemic has brought the world to a standstill.

Countries are on a major lockdown. Schools, colleges, theatres, gym, clubs, and all other public places are shut down, the country’s economy is suffering, human health is on stake, people are losing their jobs and nobody knows how worse it can get.

Since most of the places are on lockdown, and you are working from home or have enough time to nourish your skills, then you should use this time wisely! We always complain that we want some ‘time’ to learn and upgrade our knowledge but don’t get it due to our ‘busy schedules’. So, now is the time to make a ‘list of skills’ and learn and upgrade your skills at home!

And for the technology-loving people like us, Knoldus Techhub has already helped us a lot in doing it in a short span of time!

If you are still not aware of it, don’t worry as Georgia Byng has well said,

“No time is better than the present”

– Georgia Byng, a British children’s writer, illustrator, actress and film producer.

No matter if you are a developer (be it front-end or back-end) or a data scientisttester, or a DevOps person, or, a learner who has a keen interest in technology, Knoldus Techhub has brought it all for you under one common roof.

From technologies like Scala, spark, elastic-search to angular, go, machine learning, it has a total of 20 technologies with some recently added ones i.e. DAML, test automation, snowflake, and ionic.

How to upgrade your skills?

Every technology in Tech-hub has n number of templates. Once you click on any specific technology you’ll be able to see all the templates of that technology. Since these templates are downloadable, you need to provide your email to get the template downloadable link in your mail.

These templates helps you learn the practical implementation of a topic with so much of ease. Using these templates you can learn and kick-start your development in no time.

Apart from your learning, there are some out of the box templates, that can help provide the solution to your business problem that has all the basic dependencies/ implementations already plugged in. Tech hub names these templates as xlr8rs (pronounced as accelerators).

xlr8rs make your development real fast by just adding your core business logic to the template.

If you are looking for a template that’s not available, you can also request a template may be for learning or requesting for a solution to your business problem and tech-hub will connect with you to provide you the solution. Isn’t this helpful 🙂

Confused with which technology to start with?

To keep you updated, the Knoldus tech hub provides you with the information on the most trending technology and the most downloaded templates at present. This you’ll be informed and learn the one that’s most trending.

Since we believe:

“There’s always a scope of improvement“

If you still feel like it isn’t helping you in learning and development, you can provide your feedback in the feedback section in the bottom right corner of the website.

#ai #akka #akka-http #akka-streams #amazon ec2 #angular 6 #angular 9 #angular material #apache flink #apache kafka #apache spark #api testing #artificial intelligence #aws #aws services #big data and fast data #blockchain #css #daml #devops #elasticsearch #flink #functional programming #future #grpc #html #hybrid application development #ionic framework #java #java11 #kubernetes #lagom #microservices #ml # ai and data engineering #mlflow #mlops #mobile development #mongodb #non-blocking #nosql #play #play 2.4.x #play framework #python #react #reactive application #reactive architecture #reactive programming #rust #scala #scalatest #slick #software #spark #spring boot #sql #streaming #tech blogs #testing #user interface (ui) #web #web application #web designing #angular #coronavirus #daml #development #devops #elasticsearch #golang #ionic #java #kafka #knoldus #lagom #learn #machine learning #ml #pandemic #play framework #scala #skills #snowflake #spark streaming #techhub #technology #test automation #time management #upgrade

akshay L

akshay L

1572344038

Kafka Spark Streaming | Kafka Tutorial

In this kafka spark streaming tutorial you will learn what is apache kafka, architecture of apache kafka & how to setup a kafka cluster, what is spark & it’s features, components of spark and hands on demo on integrating spark streaming with apache kafka and integrating spark flume with apache kafka.

# Kafka Spark Streaming #Kafka Tutorial #Kafka Training #Kafka Course #Intellipaat

Stream Data Pipeline using Apache Kafka and Spark Structured Streaming with Python

Objective: 

Main purpose of this session is to help audience be familiar with how to develop stream data processing application by Apache Kafka and Spark Structured Streaming in order to encourage them to start playing with these technologies. 

Description: 

In Big Data era, massive amount of data is generated at high speed by various types of devices. Stream processing technology plays an important role so that such data can be consumed by realtime application. 

In this talk, Takanori will present how to implement stream data pipeline and its application by using Apache Kafka and Spark Structured Streaming with Python. He will be elaborating on how to develop application rather than explaining system architectural design in order to help audience be familiar with stream processing implementation by Python. 

Takanori will introduce examples of application using Tweet data and pseudo-data of mobile device. In addition, he will also explain how to integrate streaming data into other data store technologies such as Apache Cassandra and Elasticsearch. 

Note: - Python codes to build these applications will be uploaded on GitHub.

#python #apache #kafka #spark #streaming

 

Tyrique  Littel

Tyrique Littel

1608883639

Throttle Spark-Kafka Streaming Volume

Here’s how to avoid streaming bottlenecks in your Apache Spark loads. Using Kafka data loads as an example, here’s how to tweak your settings.

Time to dive into the settings to configure your loads.

This article will help any new developer who wants to control the volume of Spark Kafka streaming.

A Spark streaming job internally uses a micro-batch processing technique to stream and process data. The initial state of the job will be in the “queued” status, then it will then move to the “processing” status, and then it is marked with the “completed” status.

Prerequisites

  • The developer should be familiar with Spark streaming
  • The developer should have some knowledge of Kafka and Spark.

#spark-kafka #kafka #spark #developer