Data Science and Machine Learning with Scala and Spark

Spark’s inventors chose Scala to write the low-level modules. In Data Science and Machine Learning with Scala and Spark (Episode 01/03), we covered the basics of Scala programming language while using a Google Colab environment. In this article, we learn about the Spark ecosystem and its higher-level API for Scala users. As before, we still use Spark 3.0.0 and Google Colab for practicing some code snippets.What is Apache Spark?According to Apache Spark and Delta Lake Under the Hood

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of the time this writing, Spark is the most actively developed open source engine for this task; making it the de facto tool for any developer or data scientist interested in big data. Spark supports multiple widely used programming languages (Python, Java, Scala and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale up to big data processing or incredibly large scale.

#machine-learning #spark

towardsdatascience.com

Data Science and Machine Learning with Scala and Spark