"Dynamic Data Pipelining with Luigi" - Trey Hakanson (Pyohio 2019)

As the scale of modern data has grown, so too has the need for modern tooling to handle its growing list of needs. Databases have had to become more horizontally scalable, less centralized, and more fault tolerant to handle the expectations of modern users. As such, the concept of data-warehouses and data-engineering are relatively new concepts, and engineers are still hard at work to solve core problems of this new sector. One problem of particular interest is that of dynamic data pipelining and workflows. Ingesting large amounts of data, transforming streams dynamically into a standardized format, and maintaining checkpoints and dependencies in order to ensure that proper prerequisites are met before beginning a given task are all difficult problems. This talk will describe how these problems can be solved using Luigi, Spotify’s robust tool for constructing complex data pipelines and workflows.

Luigi allows for complex pipelines to be described programmatically, handling multiple dependencies and dependents. This allows it to be used for a wide variety of batch jobs, and the option to use the centralized scheduler makes it easy to monitor job progress across data warehouses. In addition, Luigi’s robust checkpoint system allows for pipelines to resumed at any point they may fail at. Each task is well-defined, specifying required inputs and resulting outputs, so creating or editing pipelines is a breeze.

As the scale of modern data has grown, so has the need for tooling to handle its growing list of challenges. Whether performing reporting, bulk ingestion, or ETL processes, it is important to maintain flexibility and ensure proper monitoring. Luigi provides a robust toolkit to perform a wide variety of data pipelining tasks, and can be easily integrated into existing workflows with ease.

#big data

youtube.com

"Dynamic Data Pipelining with Luigi" - Trey Hakanson (Pyohio 2019)