Spark, defined by its creators is a fast and general engine for large-scale data processing.
The fast part means that it’s faster than previous approaches to work with Big Data like classical MapReduce. The secret for being faster is that Spark runs on Memory (RAM), and that makes the processing much faster than on Disk.
The general part means that it can be use for multiple things, like running distributed SQL, create data pipelines, ingest data into a database, run Machine Learning algorithms, work with graphs, data streams and much more.

#democratization #pyspark #data-quality #open-source #spark

Introducing a new pySpark’s library: owl-data-sanitizer
3.65 GEEK