In Apache Spark/PySpark we use abstractions and the actual processing is done only when we want to materialize the result of the operation. To connect to different databases and file systems we use mostly ready-made libraries. In this story you will learn how to combine data with MySQL and MongoDB and then save it in Apache Cassandra.

Environment

The ideal moment to use Docker, or more precisely Docker Compose. We will run all databases and Jupyter with Apache Spark.

## Use root/example as user/password credentials
version: '3.1'

services:
  notebook:
    image: jupyter/all-spark-notebook
    ports:
      - 8888:8888
      - 4040:4040
    volumes:
      - ./work:/home/jovyan/work

  cassandra:
    image: 'bitnami/cassandra:latest'

  mongo:
    image: mongo
    environment:
      MONGO_INITDB_ROOT_USERNAME: root
      MONGO_INITDB_ROOT_PASSWORD: example

  mysql:
    image: mysql:5.7
    environment:
      MYSQL_DATABASE: 'school'
      MYSQL_USER: 'user'
      MYSQL_PASSWORD: 'password'
      MYSQL_ROOT_PASSWORD: 'password'

Adding data to MongoDB

We need some data. I wrote a simple script in Python. Let’s assume that there are students’ data in Mongo.

#etl #mongodb #spark #pyspark #cassandra

PySpark ETL from MySQL and MongoDB to Cassandra
3.90 GEEK