Creating Data Pipeline with Spark streaming, Kafka and Cassandra

Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline.

Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.

Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data like a messaging system.

Apache Cassandra is a distributed and wide-column NoSQL data store.

Minimum Requirements and Installations

To start the application, we’ll need Kafka, Spark and Cassandra installed locally on our machine. The minimum requirements for the application:

Java 1.8+, Scala 2.12.10, SBT 1.3.8, spark 2.4.0 , Kafka 2.3.0 , Cassandra 3.10


Connecting to Kafka and reading streams.

      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe", "kafkaToCassandra")
      .option("startingOffsets", "earliest")
      .selectExpr("cast(value as string) as value")
      .select(from_json(col("value"), carSchema).as[Car])

In the above code snippet, reading JSON data from Kafka Topic “kafkaToCassandra” which contain information of Cars. The Car Model looks like below:

 case class Car(
                                Name: String,
                                Miles_per_Gallon: Option[Double],
                                Cylinders: Option[Long],
                                Displacement: Option[Double],
                                Horsepower: Option[Long],
                                Weight_in_lbs: Option[Long],
                                Acceleration: Option[Double],
                                Year: String,
                                Origin: String

