Apache Spark is the most developed library that you can utilize for many of your Machine Learning applications. It provides the users with the ease of developing ML-based algorithms in data scientist’s favorite scientific prototyping environment Jupyter Notebooks.

The Main Differences between MapReduce HDFS & Apache Spark

MapReduce needs files to be stored in a Hadoop Distributed File System, Spark doesn’t need that and it is really really fast compared to MapReduce. In contrast, Spark uses Resilient Distributed Datasets aka RDDs to store data which highly resembles Pandas dataframes. RDDs are also where Spark shines really brightly because you can harness the mighty power of the pandas-like data preprocessing and transformation through them.

The Main Installation Choices for Using Spark

  1. You can launch an AWS Elastic Map Reduce Service and use Zeppelin Notebooks but this is a premium service and you have to deal with creating an AWS account.
  2. Databricks Ecosystem → Has its own file system and dataframe syntax, it is a service started by one of the Spark’s founders, however this will result in a vendor lock-in and it is also not free.
  3. You can install it on your local Ubuntu or Windows too, but this process is very very cumbersome.

How to Install Spark on you local environment with Docker

Disclaimer, I am not going to dive into installation details of docker on your computer here for which many tutorials are available online. This command(embedded below) instantly gives you a Jupyter notebook environment with all the bells & whistles ready to go! I can’t do enough justice to be able to start a development environment with a single command. Therefore I highly highly recommend anyone who wants to take their Data Science & Engineering skills to the next level to learn at the very least how to effectively utilize Docker Hub images.

#artificial-intelligence #python #machine-learning #apache-spark #docker

Apache Spark MLlib & Ease-of Prototyping With Docker
1.55 GEEK