PySpark ETL from MySQL and MongoDB to Cassandra

PySpark ETL from MySQL and MongoDB to Cassandra

In Apache Spark/PySpark we use abstractions and the actual processing is done only when we want to materialize the result of the operation. To connect to different databases and file systems we use mostly ready-made libraries.

In Apache Spark/PySpark we use abstractions and the actual processing is done only when we want to materialize the result of the operation. To connect to different databases and file systems we use mostly ready-made libraries. In this story you will learn how to combine data with MySQL and MongoDB and then save it in Apache Cassandra.

Environment

The ideal moment to use Docker, or more precisely Docker Compose. We will run all databases and Jupyter with Apache Spark.

## Use root/example as user/password credentials
version: '3.1'

services:
  notebook:
    image: jupyter/all-spark-notebook
    ports:
      - 8888:8888
      - 4040:4040
    volumes:
      - ./work:/home/jovyan/work

  cassandra:
    image: 'bitnami/cassandra:latest'

  mongo:
    image: mongo
    environment:
      MONGO_INITDB_ROOT_USERNAME: root
      MONGO_INITDB_ROOT_PASSWORD: example

  mysql:
    image: mysql:5.7
    environment:
      MYSQL_DATABASE: 'school'
      MYSQL_USER: 'user'
      MYSQL_PASSWORD: 'password'
      MYSQL_ROOT_PASSWORD: 'password'

Adding data to MongoDB

We need some data. I wrote a simple script in Python. Let’s assume that there are students’ data in Mongo.

etl mongodb spark pyspark cassandra

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

How To Start with Apache Spark and Apache Cassandra

Apache Cassandra is a specific database that scales linearly. This has its price: specific table modelling, configurable consistency and limited analytics. Apple performs millions of operations per second on over 160,000 Cassandra instances while collecting over 100 PBs of data. You can bypass these limited analytics with the Apache Spark and the DataStax connector, and that’s what the story is about.

PySpark Tutorial For Beginners | Apache Spark With Python Tutorial

PySpark Tutorial For Beginners | Apache Spark With Python Tutorial will help you understand what PySpark is, the different features of PySpark, and the comparison of Spark with Python and Scala. Learn the various PySpark contents - SparkConf, SparkContext, SparkFiles, RDD, StorageLevel, DataFrames, Broadcast and Accumulator. You will get an idea about the various Subpackages in PySpark. You will look at a demo using PySpark SQL to analyze Walmart Stocks data

MongoDB Database and java applications

Enroll for free demo to acquire the best knowledge on the schema-less database from live industry experts through MongoDB training

Creating Data Pipeline with Spark streaming, Kafka and Cassandra

Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline.

Apache Spark Tutorial | Spark Tutorial For Beginners

You will learn what apache spark is, the features of Apache Spark, and the architecture of Apache Spark. You will understand the various components of Apache Spark, such as Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX. You will look into a case study of Spark for OpenTable company. Finally, you will do a demo on linear regression and logistic regression using PySpark.