Best of Crypto

Best of Crypto

1603184400

Azure Data Factory pipelines: Filling in the gaps

Azure Data Factory is a cloud based data orchestration tool that many ETL developers began using instead of SSIS. In this article, Rodney Landrum recalls a Data Factory project where he had to depend on another service, Azure Logic Apps, to fill in for some lacking functionality.

Through the term was not really in the vernacular as it is today, I have been a full or part-time “Data Engineer” my entire career. I have been quite comfortable with Microsoft ETL tools like SSIS for many years, dating back to the DTS days. My comfort with SSIS came from many years of trial and error via experimentation as well as adhering to the best practices put forth and tested by many of my colleagues in the SQL Server field. It was and still is a widely used and well-documented ETL platform. With the release of Azure Data Factory several years ago, though it was not touted as an SSIS replacement, many data engineers started working with and documenting this code-free or low-code orchestration experience, and I was one of them.

As with any technology, only with knowledge and experience will you be able to take advantage of all its key benefits, and by the same token will you uncover its severe limitations. On a recent assignment to build a complex logical data workflow in Azure Data Factory, that ironically had less “data” and more “flow” to engineer, I discovered not only benefits and limitations in the tool itself but also in the documentation that provided arcane and incomplete guidance at best. Some of the incomplete knowledge I needed was intrinsic to Azure Logic Apps, which I grant I had done very little with until this project, but it played a pivotal role as an activity called from the pipeline. I wanted to share a few pieces of this project with you here in hopes to bolster, however small, the available sources for quick insight into advanced challenges with ADF and to a lesser extent, Azure Logic Apps.

Specifically, I was asked to create a pipeline-driven workflow that sends approval emails with a file attachment and waits for the recipients to either approve, reject or ignore the email. If the approvers do not respond to the emails in the time frame defined by several variables like time of day and type of file, then a reminder email must be sent. Again, the recipients can approve, reject or ignore the reminder. Finally, a third email is sent to yet another approver with the same options. Ultimately the process will either copy the approved file to a secure FTP site after both of the initial two recipients or the final recipient approves the file, or it will send an email to the business saying the file was rejected. It may sound simple enough, even in a flow diagram; however, there were several head-scratchers and frustrated, lengthy ceiling stares that I may have easily avoided with a bit of foreknowledge.

The following are the four challenges I had to overcome to call the project a success:

When sending an approval email from Azure Logic Apps, which is initiated via a Webhook activity from the ADF pipeline, how do I force a response by a specific time of day? For example, if the initial emails must be approved or rejected by 9:00 AM, and it is triggered at 8:26 AM which is itself a variable time to start, how do I force the email to return control to the pipeline in 34 minutes?

The second challenge came with the Webhook activity itself. The Logic App needed to return status values back to the calling pipeline. While there was some minimal documentation that explained that a callbackURI was needed in an HTTP Post from within the Logic App, what I found informationally lacking was how to actually pass back values.

The third challenge was processing a rejection. The logic stated that if either of the initial approvers rejected the file, then the pipeline needed to stop further processing immediately and notify the business so a secondary file may be created and run through the workflow again. If the two initial emails to approvers were set to timeout after 34 minutes with no response (following the example above) and one of the approvers rejected the file in 3 minutes, the pipeline could not dilly dally for another 31 minutes spinning cycles waiting for the other approver.

Finally, each step in the process needed to be written to a logging table in an Azure SQL Database. That was not too difficult as it was a simple matter of passing dynamic values to a parameterized stored procedure. However, the number of times this needed to happen brought a much-unexpected consequence to my attention.

#bi #cloud development #homepage #sql prompt #data-science

What is GEEK

Buddha Community

Azure Data Factory pipelines: Filling in the gaps
 iOS App Dev

iOS App Dev

1620466520

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

How to Debug a Pipeline in Azure Data Factory

In the previous article, How to schedule Azure Data Factory pipeline executions using Triggers, we discussed the three main types of the Azure Data Factory triggers, how to configure it then use it to schedule a pipeline.

In this article, we will see how to use the Azure Data Factory debug feature to test the pipeline activities during the development stage.

Why debug

When developing complex and multi-stage Azure Data Factory pipelines, it becomes harder to test the functionality and the performance of the pipeline as one block. Instead, it is highly recommended to test such pipelines when you develop each stage, so that you can make sure that this stage is working as expected, returning the correct result with the best performance, before publishing the changes to the data factory.

Take into consideration that debugging any pipeline activity will execute that activity and perform the action configured in it. For example, if this activity is a copy activity from an Azure Storage Account to an Azure SQL Database, the data will be copied, but the only difference is that the pipeline execution logs in the debug mode will be written to the pipeline output tab only and will not be shown under the pipeline runs in the Monitor page.

#azure #sql azure #azure data factory #pipeline

Gerhard  Brink

Gerhard Brink

1620629020

Getting Started With Data Lakes

Frameworks for Efficient Enterprise Analytics

The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.

This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.

Introduction

As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).


This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management

Kennith  Kuhic

Kennith Kuhic

1625915820

Azure Machine Learning Notebook Code and run as pipeline — Automate usingAzureData Factory

Ability to run notebook code as Pipeline

Prerequisite

  • Azure Account
  • Azure Machine learning
  • Create a compute instance
  • Create a compute cluster as cpu-cluster
  • Select Standard D series version
  • Create Train file to train the model
  • Create a pipeline file to run the as pipeline

Steps

Create Train file as train.py

  • Create a directory ./train_src
  • Create a train.py
  • Should be a python file not notebook

Create Pipeline code

  • Load the workspace config

  • Get the default store information

  • Create compute cluster

  • Load the package dependencies

  • Load the data set

  • set the dataset as input

  • Setup output optional

  • I am only creating single step

  • setup the pipeline config and assign

  • Validate the pipeline

  • Now time to submit the pipeline

  • Wait for pipeline to finish

  • Now lets publish the pipeline

  • Every publish will create a REST endpoint

  • I logged into the Azure ML Studio

  • Go to Pipeline on the left menu

  • Click on pipeline endpoint

  • should see a pipeline — Published_Titanic_Pipeline_Notebook

  • Click submit and see if the pipeline line runs

  • Now go to ADF or Synapse Integrate

  • Create a New pipeline

  • Name is AzureMLPipelinetest

  • Drag and drop Azure Machine learning services (only to run published pipeline)

  • Create a New Source for Azure Machine learning using service principal account

#data-factory #machine-learning #azure-ai #azure-machine-learning #azure data factory

Cyrus  Kreiger

Cyrus Kreiger

1618039260

How Has COVID-19 Impacted Data Science?

The COVID-19 pandemic disrupted supply chains and brought economies around the world to a standstill. In turn, businesses need access to accurate, timely data more than ever before. As a result, the demand for data analytics is skyrocketing as businesses try to navigate an uncertain future. However, the sudden surge in demand comes with its own set of challenges.

Here is how the COVID-19 pandemic is affecting the data industry and how enterprises can prepare for the data challenges to come in 2021 and beyond.

#big data #data #data analysis #data security #data integration #etl #data warehouse #data breach #elt