Pure Functions and Lazy Evaluations — The Crux of Distributed Data Computations

Apache Spark has become the most commonly used tool in the Big Data universe today. It has the capability of running solo code, extending APIs to Python, Scala, Java, and many more tools. It can be used to query datasets, and the most inspiring part of its architecture is the capability of running analysis on Real-Time Streaming data without explicitly storing it anywhere. Spark originated from Scala and was designed as a distributed cluster-computing software framework. From resource management, multithreading, task distribution to actually running the logic, Spark does everything under the hood. From an end-user perspective, it is an analysis tool where huge amounts of data can be fed and required analyses can be drawn within minutes. But, how does Spark achieve it? What are some core principles of using Spark to work with large datasets?

To ramp up on the basics of Spark, its architecture, and implementation in the Big Data and Cloud world, refer to the story linked below.

#spark #data-science #data #big-data #functional-programming

The Ultimate Guide to Functional Programming for Big Data
2.10 GEEK