Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline.

Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.

Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data like a messaging system.

Apache Cassandra is a distributed and wide-column NoSQL data store.

Minimum Requirements and Installations

To start the application, we’ll need Kafka, Spark and Cassandra installed locally on our machine. The minimum requirements for the application:

Java 1.8+, Scala 2.12.10, SBT 1.3.8, spark 2.4.0 , Kafka 2.3.0 , Cassandra 3.10

Dependencies

data from Kafka topic we will get Dataset[Car] as a result. We can apply s

Connecting to Kafka and reading streams.

spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe", "kafkaToCassandra")
      .option("startingOffsets", "earliest")
      .load()
      .selectExpr("cast(value as string) as value")
      .select(from_json(col("value"), carSchema).as[Car])

In the above code snippet, reading JSON data from Kafka Topic “kafkaToCassandra” which contain information of Cars. The Car Model looks like below:

 case class Car(
								Name: String,
								Miles_per_Gallon: Option[Double],
								Cylinders: Option[Long],
								Displacement: Option[Double],
								Horsepower: Option[Long],
								Weight_in_lbs: Option[Long],
								Acceleration: Option[Double],
								Year: String,
								Origin: String
     )

#apache kafka #apache spark #big data and fast data #cassandra #messagesapi #scala #spark #streaming #data analysis #datastream api

Creating Data Pipeline with Spark streaming, Kafka and Cassandra
6.35 GEEK