Apache Spark is the most developed library that you can utilize for many of your Machine Learning applications. It provides the users with the ease of developing ML-based algorithms in data scientist’s favorite scientific prototyping environment Jupyter Notebooks.
The Main Differences between MapReduce HDFS & Apache Spark
MapReduce needs files to be stored in a Hadoop Distributed File System, Spark doesn’t need that and it is really really fast compared to MapReduce. In contrast, Spark uses Resilient Distributed Datasets aka RDDs to store data which highly resembles Pandas dataframes. RDDs are also where Spark shines really brightly because you can harness the mighty power of the pandas-like data preprocessing and transformation through them.
The Main Installation Choices for Using Spark
How to Install Spark on you local environment with Docker
Disclaimer, I am not going to dive into installation details of docker on your computer here for which many tutorials are available online. This command(embedded below) instantly gives you a Jupyter notebook environment with all the bells & whistles ready to go! I can’t do enough justice to be able to start a development environment with a single command. Therefore I highly highly recommend anyone who wants to take their Data Science & Engineering skills to the next level to learn at the very least how to effectively utilize Docker Hub images.
#artificial-intelligence #python #machine-learning #apache-spark #docker