Databricks PySpark Type 2 SCD Function for Azure Synapse Analytics

Slowly Changing Dimensions (SCD) is a commonly used dimensional modelling technique used in data warehousing to capture the changing data within the dimension (Image 1) over time. The three most commonly used SCD Types are 0, 1, 2.The majority of DW/BI projects have type 2 dimensions where a change to an attribute causes the current dimension record to be end dated and a new record created allowing for a complete history of the data changes. See example belowData Before

After

I used this in my latest project which was with an electrical distribution company with the aim to track the approved Australian Energy Regulator Tariff changes over time.Today I’m going to share with you have to how to create a reusable PySpark function that can be reused across Databricks workflows with minimal effort.Type 2 SCD PySpark FunctionBefore we start writing code we must understand the Databricks Azure Synapse Analytics connector. It supports read/write operations and accepts valid SQL statements in pre-action or post-action operations before or after writing to the table. Therefore to create this function the code must form the valid SQL statement that it passes to the connector.Prerequisite

Azure Synapse Staging and Destination tables must be created for optimal performance and storage

Functionality

The function accepts an input DataframeWhen lookup columns are passed to the function they are used to join the staging and destination tables. The lookup columns provide the ability to only close off rows coming in from the Input Dataframe that are currently open in the destination table with an Effective End Date of 9999–12–31 with the current date-timeIf no lookup column(s) are passed no join condition is used and all current records are closed off and all the new input records are the open recordsThe function is capable of handling different input different Date\DateTime formats and will output a uniform DateTime format agreed with the business. Example below:

#azure-synapse-analytics #databricks #azure #pyspark

pub.towardsai.net

Databricks PySpark Type 2 SCD Function for Azure Synapse Analytics