How to connect Databricks to your Azure Data Lake

1. Introduction

TLTR: Clone this git project, set params and run 0_script.sh to deploy 1 ALDSgen2 hub and N Databricks spokes

A data lake is a centralized repository of data that allows enterprises to create business value from data. Azure Databricks is a popular tool to analyze data and build data pipelines. In this blog, it is discussed how Azure Databricks can be connected to an ADLSgen2 storage account in a secure and scalable way. In this, the following is key:

Defense in depth: ADLSgen2 contains sensitive data and shall be secured using private endpoints and Azure AD (disabling access keys). Databricks can only access ADLSgen2 using private link and Azure AD
Access control: Business units typically have their own Databricks workspace. Multiple workspaces shall be granted access to ADLSgen2 File Systems using Role Based Access Control (RBAC)
Hub/spoke architecture: Only one hub network can access the ADLSgen2 account using private link. Databricks spoke networks peer to the hub network to simplify networking

2. Setup ADLSgen2 with N Databricks workspaces

In the remainder of this blog, the project is deployed using the following steps:

2.1 Prerequisites
2.2 Create 1 ADLSgen2 account with hub network
2.3 Create N Databricks workspaces with spoke networks
2.4 Connect Databricks with ADLSgen2 account using private link
2.5 Mount storage account with Databricks

2.1 Prerequisites

The following resources are required in this tutorial:

Azure Account
Azure DevOps or Ubuntu terminal to run shell scripts
Azure CLI

Finally, clone the project below or add the repository in your Azure DevOps project.

This repository contains 5 scripts in which 0_script.sh triggers the other 4 scripts where the deployment is done. It also contains a params_template.sh file. The variables value in this file need to be substituted with your variables and renamed to params.sh. You are then ready to run the scripts. The 4 scripts are discussed in the remaining of this blog.

2.2 Create 1 ADLSgen2 account with hub network

In script 1_deploy_resources_1_hub.sh the following steps are executed:

Create an ADLSgen2 account with hierarchical namespace enabled. This allows to create folders and to do fine grained access control
Create a VNET and add a private endpoint to the storage account
Create a private dns zone using the private endpoint as zone

2.3 Create N Databricks workspaces with spoke networks

In script 2_deploy_resources_N_spokes.sh the following steps are executed:

Create N Databricks workspaces. Workspaces are deployed in their own VNET and possibly in different subscriptions. Databricks is deployed with clusters only have a private IP.
For each Databricks workspace, create a service principal. Grant service principal access rights to its own File System in the Storage account

2.4 Connect Databricks with ADLSgen2 account using private link

In script 3_configure_network_N_spokes.sh the following steps are executed:

Create a peering for each Databricks spoke VNET to the hub VNET of the storage account
Vice versa, create a peering from the hub VNET to each Databricks spoke VNET
Add all Databricks VNETs to the private dns zone such that private endpoint of the storage account can be used in Databricks notebooks

2.5 Mount storage account with Databricks

In script 4_mount_storage_N_spokes.sh the following steps are executed:

For each Databricks workspace, add the mount notebooks to workspace using the Databricks REST API
For each Databricks workspace, store the credentials of the service principals in a Databricks backed secret scope
Create a cluster and run the notebook on the cluster. Notebook will fetch the service principal credentials from the storage account and mount to its own File System in the storage account using the private endpoint of the storaged, see also screenshort below

#databricks #data-science #azure #databricks