Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline.
Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.
Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data like a messaging system.
Apache Cassandra is a distributed and wide-column NoSQL data store.
To start the application, we’ll need Kafka, Spark and Cassandra installed locally on our machine. The minimum requirements for the application:
Java 1.8+, Scala 2.12.10, SBT 1.3.8, spark 2.4.0 , Kafka 2.3.0 , Cassandra 3.10
data from Kafka topic we will get Dataset[Car] as a result. We can apply s
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "kafkaToCassandra")
.option("startingOffsets", "earliest")
.load()
.selectExpr("cast(value as string) as value")
.select(from_json(col("value"), carSchema).as[Car])
In the above code snippet, reading JSON data from Kafka Topic “kafkaToCassandra” which contain information of Cars. The Car Model looks like below:
case class Car(
Name: String,
Miles_per_Gallon: Option[Double],
Cylinders: Option[Long],
Displacement: Option[Double],
Horsepower: Option[Long],
Weight_in_lbs: Option[Long],
Acceleration: Option[Double],
Year: String,
Origin: String
)
#apache kafka #apache spark #big data and fast data #cassandra #messagesapi #scala #spark #streaming #data analysis #datastream api