After the previous post wherein we explored Apache Kafka, let us now take a look at Apache Spark. This blog post covers working within Spark’s interactive shell environment, launching applications (including onto a standalone cluster), streaming data and lastly, structured streaming using Kafka. To get started right away, all of the examples will run inside Docker containers.

Spark

Image for post

Image credit

Spark was initially developed at UC Berkeley’s AMPLab in 2009 by Matei Zaharia, and open-sourced in 2010. In 2013 its codebase was donated to the Apache Software Foundation which released it as Apache Spark in 2014.

“Apache Spark™ is a unified analytics engine for large-scale data processing”

It offers APIs for Java, Scala, Python and R. Furthermore, it provides the following tools:

  • Spark SQL: used for SQL and structured data processing.
  • MLib: used for machine learning.
  • GraphX: used for graph processing.
  • Structured Streaming: used for incremental computation and stream processing.

#python #spark #docker #kafka #streaming

Structured Streaming in Spark 3.0 Using Kafka
4.20 GEEK