Notebooks are the primary runtime on Databricks from data science exploration to ETL and ML in production. This emphasis on notebooks calls for a change in our understanding of production quality code. We have to do away with our hesitancy about messy notebooks and ask ourselves: How do we move notebooks into our production pipelines? How do we perform unit and integration tests on notebooks? Can we treat notebooks as artifacts of a DevOps pipeline?

Databricks Notebooks as first class citizens

When choosing Databricks as the compute platform your best option is to also run notebooks in your production environment. This decision is dictated by the overwhelming support for the notebooks runtime versus classic python scripting. We argue that, one should fully embrace the notebooks approach and choose the best methods to test and deploy notebooks in a production environment. In this blog, we use Azure DevOps pipelines for notebook (unit, integration) testing using transient Databricks clusters and notebook artifact registration.

Notebooks: entry point of a python package

Notebooks can live in isolation, but we prefer them as part of a Git repository, which has the following structure. It contains a notebooks directory to check in Databricks notebooks as Source files, a Python package (‘my_model’) containing functionality to be imported in a notebook, a tests directory with unit tests for the Python package, an Azure DevOps pipeline and a cluster-config.json to configure our transient Databricks clusters. Additionally we use Poetry for Python dependency management and packaging based on the pyproject.toml specification.

notebooks/
-        run_model.py  ## Databricks notebook checked in as .py file
my_model/
-        preprocessing.py  ## Python module imported in notebook
tests/
azure-pipelines.yml
cluster-config.yml
pyproject.toml
...

Notebooks can be committed into a Git repository either by linking a Git repository to the notebook in the Databricks Workspace or by manually exporting the notebook as a Source File. In both cases, the notebooks are available in the repository as a Python file with Databricks markup commands. The notebook entry point of our repository is shown below. Notice that it installs and imports the Python package ‘my_model’ build from the containing repository. The package versioning will be worked out in detail later. Any notebook logic is captured in the main function. After executing the main function dbutils.notebook.exit() is called, which signals successful completion and allows a result value to be returned to the caller.

#mlops #cicd #azure-devops #databricks

Elegant CICD with Databricks notebooks
1.80 GEEK