The process of extracting, transforming and loading data from disparate sources (ETL) have become critical in the last few years with the growth of data science applications. In addition, data availability, timeliness, accuracy and consistency are key requirements at the beginning of any data project.

Even though there are guidelines, there is not a one-fits-all architecture to build ETL data pipelines. It depends on multiple factors such as the type of the data, the frequency, the volume and the expertise of the people that will be maintaining these. Data pipelines need to be reliable and scalable but also relatively straight forward for data engineers and data scientists to integrate with new sources and make changes to the underlying data structures.

There is a myriad of tools that can be used for ETL but Spark is probably one of the most used data processing platforms due to it speed at handling large data volumes. In addition to data processing, Spark has libraries for machine learning, streaming, data analytics among others so it’s a great platform for implementing end-to-end data projects. It also supports Python (PySpark) and R (SparkR, sparklyr), which are the most used programming languages for data science.

#data-science #data-engineering #etl #delta-lake #spark

Building a notebook-based ETL framework with Spark and Delta Lake
2.60 GEEK