Not so far ago, the approach taken to table design in source systems (application databases) used to be — we don’t care about ETL. Figure it out, we’ll concentrate on building the application. The last couple of years have been great for the development of ETL methodologies with a lot of open-source tools coming in from some of the big tech companies like Airbnb, LinkedIn, Google, Facebook and so on. And with cloud going mainstream, providers like Azure, Google and Microsoft have made sure that they build upon and support all the open source technologies in the data engineering space.

I have been a part of many ETL projects, some of which have failed miserably and the rest have succeeded. There are many ways an ETL project can go wrong. We’ll talk about one of the most important aspects today — table design in the source system.

ETL pipelines are as good as the source systems they’re built upon.

This statement holds completely true irrespective of the effort one puts in the T layer of the ETL pipeline. The transform layer is usually misunderstood as the layer which fixes everything that is wrong with your application and the data generated by the application. That is absolutely untrue. Without further ago, let’s look at the bare minimum that you should take into account while designing tables which are going to be ETL’d to a target system —

#data-science #sql #data-engineering #etl #pipeline

Table Design Best Practices for ETL
1.10 GEEK