Processing file sets with U-SQL in Azure Data Lake Analytics

Processing file sets with U-SQL in Azure Data Lake Analytics

In this tutorial, you will learn to process file sets with U-SQL in Azure Data Lake Analytics.

This article will help you process file sets with U-SQL in Azure Data Lake Analytics.

Introduction

In the previous articles, we learned how to create database objects in Azure Data Lake Analytics using U-SQL. We also looked at some of the frequently used DDL and DML commands. So far, we have learned different aspects of Azure Data Lake Analytics and U-SQL using a single file stored in Azure Data Lake Storage as the source data. In real-world scenarios, often there are tens to thousands of files that are stored on the data lake storage, and the same must be processed in parallel based on specific criteria. Considering Azure Data Lake Storage forms the storage layer of a data lake in Azure, massive volumes of data hosted in different types of files are expected to be stored and processed. We will understand how to process this data using U-SQL in this article.

Processing file sets using U-SQL

In the previous parts of this Azure Data Lake Analytics series, we installed Visual Studio Data Tools and set up a sample U-SQL application that has a few sample scripts as well as sample data. To proceed with the rest of the steps explained in this article, one would need this setup on their local machine. It’s assumed that this setup is in place.

Navigate to the solution explorer window and expand the TestData as shown below. You would be able to see the sample data files. The files starting with the term “vehicle” are the series of files that resemble the types of files that typically exist in production environments. These files are classified by numbers and dates. We already installed sample data on the Azure Data Lake Storage account in the previous parts of this series, so the data visible below exists in the same folder hierarchy on our Azure Data Lake account as well.

If you open any one of the files, you would be able to see the schema and data as shown below. These files are in CSV format, and contains a few fields and a reasonable number of records as well.

If you open the same file using Azure Data Lake Explorer, you would be able to preview the data in a more formatted manner. Navigate back to the solution explorer window and open the script titled Ambulance-3-1-FileSets as shown below. This script contains the U-SQL code that would process the data in these vehicle-related files.

Let’s attempt to understand this U-SQL script step-by-step.

  • The first step sets the path where these files are stored on the Azure Data Lake Storage account
  • In the next step, we are forming the path to the file set from which we intend to extract data. We intend to be able to select the files based on the desired criteria. So, we need to specify the metadata in the file using a regular expression. This expression is defined by literals – “vehicle”, virtual streamset fields “vid” where we want to specify criteria, and regular expressions like “{*}”. The virtual fields can be thought of as parameters
  • In the next step, we are extracting data from the vehicle files. Here the FROM clause used the variable the has the path to those files

In the next step, we are filtering and selecting the extracted data based on a filter criterion. We have specified predicate to select where vid is equal to filter files based on a vid and date range. If you analyze the names of files, you would find four files that have the value of “vehicle1” as the name of the file. So only those files will be selected. And then the date range also will be applied.

azure sql azure

What is Geek Coin

What is GeekCash, Geek Token

Best Visual Studio Code Themes of 2021

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Introduction to Structured Query Language SQL pdf

SQL stands for Structured Query Language. SQL is a scripting language expected to store, control, and inquiry information put away in social databases. The main manifestation of SQL showed up in 1974, when a gathering in IBM built up the principal model of a social database. The primary business social database was discharged by Relational Software later turning out to be Oracle.

How to set up Azure Data Sync between Azure SQL databases and on-premises SQL Server

In this article, you learn how to set up Azure Data Sync services. In addition, you will also learn how to create and set up a data sync group between Azure SQL database and on-premises SQL Server.

Demo: Configure Azure SQL with Azure CLI | Azure SQL for beginners (Ep. 17)

In this video, see how to configure your Azure SQL connectivity leveraging the Azure CLI and PowerShell notebooks in Azure Data Studio. For the full Azure SQ...

Sourcing data from Azure SQL Database in Azure Machine Learning

In this tutorial, we will show how to source data from Azure SQL Database to use in a Machine Learning workflow.

What is Azure SQL? | Azure SQL for beginners (Ep. 3)

Azure SQL is composed of Azure SQL Database, Azure SQL Managed Instance, and SQL Server in Azure VM. Learn about the key differentiators between them. For th...