What is Persistent ETL and Why Does it Matter?

If you’ve made it to this blog you’ve probably heard the term “persistent” thrown around with ETL, and are curious about what they really mean together. Extract, Transform, Load (ETL) is the generic concept of taking data from one or more systems and placing it in another system, often in a different format. Persistence is just a fancy word for storing data. Simply put, persistent ETL is adding a storage mechanism to an ETL process. That pretty much covers the what, but the why is much more interesting…

ETL processes have been around forever. They are a necessity for organizations that want to view data across multiple systems. This is all well and good, but what happens if that ETL process gets out of sync? What happens when the ETL process crashes? What about when one of the end systems updates? These are all very real possibilities when working with data storage and retrieval systems. Adding persistence to these processes can help ease or remove many of these concerns.

With a persistent ETL process, a full historical view is recorded as the data is moved. Direct copies of the source data are saved, and depending on the implementation, the output can be saved as well. Holding a full history of transactions enables developers and architects to investigate and troubleshoot issues as they arise. As someone who has done his fair share of troubleshooting, I can assure you that it’s easier to diagnose an issue if you can recreate or walk through the steps that caused it. Additionally, capturing history creates audit capabilities, a key requirement for most organizations.

If setup properly, a persistent ETL tool could serve as the source of record for auditing of all data in an organization. One major concern related to adding persistence to an ETL process is performance. While it may not seem intuitive, often adding persistence can actually increase performance. Standard ETL tools rely on in-memory transformations, which often limit the quantity and/or complexity of records that can be migrated at any one time. Instead, by persisting data, the system can execute the process incrementally. Multiple extracts, step-by-step transformations, and multiple loads are all possible due to persistence. As long as the system is tuned properly, adding persistence should provide the means to improve functionality and performance.

#persistence #etl #data storage #edge #data analysis

dzone.com

What is Persistent ETL and Why Does it Matter?