Databricks: Upsert to Azure SQL using PySpark

An Upsert is an RDBMS feature that allows a DML statement’s author to automatically either insert a row or if the row already exists, UPDATE that existing row instead.From my experience building multiple Azure Data Platforms I have been able to develop reusable ELT functions that I can use from project to project, one being an Azure SQL upsert function.Today I’m going to share with you have to how to create an Azure SQL Upsert function using PySpark. It can be reused across Databricks workflows with minimal effort and flexibility.

Basic Upsert Logic

Two tables are created, one staging table and one target tableData is loaded into the staging tableThe tables are joined on lookup columns and/or a delta column to identify the matchesIf the record in the staging table exists in the target table, the record is updated in the target tableIf the record in the staging table does not exist in the target table, it is inserted into the target table

Azure SQL Upsert PySpark FunctionFunctionality

An input data frame is written to a staging table on Azure SQLThe function accepts a parameter for multiple Lookup columns and/or an optional Delta column to join the staging and target tablesIf a delta column is passed to the function it will update the record in the target table only if the staging table record is newer than the target table recordThe function will dynamically read the Dataframe columns to form part of the SQL Merge upsert and insert statements

Before writing code, it is critical to understand the Spark Azure SQL Database connector. The connector does not support preUpdate or postUpdate statements following writing to a table. For this reason, we need to write the Dataframe to the staging table and subsequently pass the valid SQL merge statements to the PyODBC connector to execute the upsert.

#azure-sql-database #databricks #pyspark #spark #upsert #azure sql

pub.towardsai.net

Databricks: Upsert to Azure SQL using PySpark