1. Introduction
TLTR: Clone this git project, set params and run 0_script.sh to deploy 1 ALDSgen2 hub and N Databricks spokes
A data lake is a centralized repository of data that allows enterprises to create business value from data. Azure Databricks is a popular tool to analyze data and build data pipelines. In this blog, it is discussed how Azure Databricks can be connected to an ADLSgen2 storage account in a secure and scalable way. In this, the following is key:
- Defense in depth: ADLSgen2 contains sensitive data and shall be secured using private endpoints and Azure AD (disabling access keys). Databricks can only access ADLSgen2 using private link and Azure AD
- Access control: Business units typically have their own Databricks workspace. Multiple workspaces shall be granted access to ADLSgen2 File Systems using Role Based Access Control (RBAC)
- Hub/spoke architecture: Only one hub network can access the ADLSgen2 account using private link. Databricks spoke networks peer to the hub network to simplify networking
See also picture below:
In the remaining of this blogpost, the project will be explained in more detail. In the next chapter, the project will be deployed.
2. Setup ADLSgen2 with N Databricks workspaces
In the remainder of this blog, the project is deployed using the following steps:
- 2.1 Prerequisites
- 2.2 Create 1 ADLSgen2 account with hub network
- 2.3 Create N Databricks workspaces with spoke networks
- 2.4 Connect Databricks with ADLSgen2 account using private link
- 2.5 Mount storage account with Databricks
2.1 Prerequisites
The following resources are required in this tutorial:
Finally, clone the project below or add the repository in your Azure DevOps project.
This repository contains 5 scripts in which 0_script.sh triggers the other 4 scripts where the deployment is done. It also contains a params_template.sh file. The variables value in this file need to be substituted with your variables and renamed to params.sh. You are then ready to run the scripts. The 4 scripts are discussed in the remaining of this blog.
2.2 Create 1 ADLSgen2 account with hub network
In script 1_deploy_resources_1_hub.sh the following steps are executed:
- Create an ADLSgen2 account with hierarchical namespace enabled. This allows to create folders and to do fine grained access control
- Create a VNET and add a private endpoint to the storage account
- Create a private dns zone using the private endpoint as zone
2.3 Create N Databricks workspaces with spoke networks
In script 2_deploy_resources_N_spokes.sh the following steps are executed:
- Create N Databricks workspaces. Workspaces are deployed in their own VNET and possibly in different subscriptions. Databricks is deployed with clusters only have a private IP.
- For each Databricks workspace, create a service principal. Grant service principal access rights to its own File System in the Storage account
2.4 Connect Databricks with ADLSgen2 account using private link
In script 3_configure_network_N_spokes.sh the following steps are executed:
- Create a peering for each Databricks spoke VNET to the hub VNET of the storage account
- Vice versa, create a peering from the hub VNET to each Databricks spoke VNET
- Add all Databricks VNETs to the private dns zone such that private endpoint of the storage account can be used in Databricks notebooks
2.5 Mount storage account with Databricks
In script 4_mount_storage_N_spokes.sh the following steps are executed:
- For each Databricks workspace, add the mount notebooks to workspace using the Databricks REST API
- For each Databricks workspace, store the credentials of the service principals in a Databricks backed secret scope
- Create a cluster and run the notebook on the cluster. Notebook will fetch the service principal credentials from the storage account and mount to its own File System in the storage account using the private endpoint of the storaged, see also screenshort below
#databricks #data-science #azure #databricks