Introducing Azure Databricks for Data Science

Introduction

This post will briefly describe the rich features of Azure Databricks managed service and its capabilities. As a Machine Learning engineer, if you are equipped with this tool arsenal, then it will take your development skills to new heights.

Let’s first understand what is this Databricks as-a-service offering by Microsoft Azure.

Azure Databricks is a fully-managed platform service offering by Microsoft Azure, in a nutshell, it is a Big Data and Machine Learning platform. It is an amalgamation of a joint effort by the team that started Apache Spark and Microsoft. The service is a single platform for Big Data processing and Machine Learning.

Azure Databricks enables you as a data engineers to run large-scale Spark workloads due to the underlying massively scalable computing power of Azure, due to this they can match unparallel performance and cost-efficiency in the cloud through auto-scaling, caching, indexing, query optimization.

Azure DataBricks

Databricks was founded by Apache Spark, Delta Lake, and MLflow & Spark, a unified processing engine that can analyze big data using SQL, machine learning, graph processing, or real-time stream analysis.

Azure Databricks Architecture - Custom Image

The core of the Azure Databricks architecture is a Databricks runtime engine, it has optimized Spark offering, Delta Lake, and Databricks I/O for Optimized Data Access Layer engine. This core engine offers massive processing power for data science workloads. It also provides native integration capabilities with different Azure data services, such as Azure Data Factory and Synapse Analytics. It also offers various ML runtime environments, such as Tensorflow & PyTorch. The notebooks can be integrated with the MLFlow + Azure Machine Learning service.

Under the Hood — Its Spark

Azure Databricks service is a notebook-oriented Apache Spark-as-a-service workspace environment. It provides the analytics engine for large-scale data processing and machine learning. In a true sense, it can process High volume, High velocity, Variety of Big Data. Apache Spark clusters are groups of computers that are treated as a single computer and handle the execution of commands issued from notebooks. This cluster has a _driver _to distribute _tasks _to its _executors _and process them through available slots. Furthermore, the driver allocates the Jobs to the executor to perform the task with the partition of its data, this job is divided into stages and execute in a sequence flow. The result through each stage of the job is sent to the driver for consolidation. That’s the gist of the Spark processing architecture.

#machine-learning #databricks #azure

Introduction

Azure DataBricks

Under the Hood — Its Spark

towardsdatascience.com

Introducing Azure Databricks for Data Science