In this article, we will show how we can use the Azure Data Factory to transform a specific data set.

Data Transformation Overview

Azure Data Factory supports various data transformation activities. These activities include:

  • Mapping data flow activity: Visually designed data transformation that allows you to design a graphical data transformation logic without the need to be an expert developer. The mapping data flow will be executed as an activity within the Azure Data Factory pipeline on an ADF fully managed scaled-out Spark cluster
  • Wrangling data flow activity: A code-free data preparation activity that integrates with Power Query Online in order to make the Power Query M functions available for data wrangling using spark execution
  • HDInsight Hive activity: Allows you to run Hive queries on your own or on-demand HDInsight cluster
  • HDInsight Pig activity: Allows you to run Pig queries on your own or on-demand HDInsight cluster
  • HDInsight MapReduce activity: Allows you to run MapReduce programs on your own or on-demand HDInsight cluster
  • HDInsight Streaming activity: Allows you to run Hadoop Streaming programs on your own or on-demand HDInsight cluster
  • HDInsight Spark activity: Allows you to run Spark programs on your own HDInsight cluster
  • Machine Learning activities: Allows you to use a published Azure Machine Learning web service for predictive analytics
  • Stored procedure activity: Allows you to execute a stored procedure in the Azure relational data platforms such as Azure SQL DB and Azure SQL Data Warehouse
  • Data Lake Analytics U-SQL activity: Allows you to run a U-SQL script on an Azure Data Lake Analytics cluster
  • Databricks Notebook activity: Allows you to run a Databricks notebook in your Azure Databricks workspace
  • Databricks Jar activity: Allows you to run a Spark Jar in your Azure Databricks cluster
  • Databricks Python activity: Allows you to run a Python file in your Azure Databricks cluster
  • Custom activity: Allows you to define your own data transformation logic in Azure Data Factory

Compute environments

Azure Data Factory supports two compute environments to execute the transform activities. The On-demand compute environment, on which the computing environment is fully managed by the Data factory, where the cluster will be created to execute the transform activity and removed automatically when the activity is completed. The second approach is Bring Your Own environment, where the compute environment is managed by both, you and the Data factory.

Transform Data with Mapping Data Flows

Mapping Data Flows activity can be created individually or within an Azure Data Factory pipeline. In this demo, and in order to test the Data Flow activity execution, we will create a new pipeline and create a Data Flow activity to be executed inside that pipeline.

First, you need to open the Azure Data Factory using the Azure portal, then click on Author & Monitor option. From the opened Data Factory, click on the Author button then click on the plus sign to add a New pipeline, as shown below:

From the Pipeline design window, provide a unique name for the pipeline and drag then drop the Data Flow activity to the design space and click on the Data flow activity, as below:

In the New Mapping dataflow window, choose to add a Data flow type, then click OK, as shown below:

The displayed Mapping Data Flow authoring canvas consists of two main parts: the graph that displays the transformation stream, and the configuration panel that shows the settings specific to the currently selected transformation.

The purpose of this Data Flow activity is to read data from an Azure SQL Database table and calculate the average value of the users’ age then save the result to another Azure SQL Database table.

#azure #azure data factory

Transform data using a Mapping Data Flow in Azure Data Factory
1.90 GEEK