Big data is part of our lives now and most companies collecting data have to deal with big data in order to gain meaningful insights from them. While we know complex neural networks work beautifully and accurately when we have a big data set, at times they are not the most ideal. In a situation where the complexity of prediction is high, however, the prediction does need to be fast and efficient. Therefore, we need a scalable machine learning solution.

Apache spark comes with SparkML. SparkML has great inbuilt machine learning algorithms which are optimised for parallel processing and hence are very time-efficient on Big data. In this article, we will take a simple example of SparkML pipeline for cleaning, processing and generating predictions on big data.

Image for post

We will take the weather data of JFK airport and make try several inbuilt classifiers in SparkML. The data set contains columns like wind speed, humidity, station pressure, etc. and we will try to classify the wind direction based on other inputs.

Lets being by cleaning the data set using Spark. Please note, I will leave a link to my GitHub repo for this code so you don’t have to copy it from here. However, I will explain the code in this article

#machine-learning #spark #big-data #python #sparkml

Machine Learning with Apache Spark
9.55 GEEK