Submitting spark job in Azure HDInsight through Apache Livy

Submitting spark job in Azure HDInsight through Apache Livy

Apache Livy is designed in a way to easily interact with any remote cluster running spark, synchronously/asynchronously, through a REST interface, without having much control over the cluster. You don’t need an edge node to access spark. Added to it, the livy server itself manages long-running spark contexts, instead of running them in the spark server. This blog is all about how to access spark running HDInsight cluster, from your local machine.

Apache Livy is designed in a way to easily interact with any remote cluster running spark, synchronously/asynchronously, through a REST interface, without having much control over the cluster. You don’t need an edge node to access spark. Added to it, the livy server itself manages long-running spark contexts, instead of running them in the spark server. This blog is all about how to access spark running HDInsight cluster, from your local machine.

Livy has types —

▹ Interactive mode - provided by spark-shell, pySpark, and SparkR REPLs.

▹ Batch mode - uses spark-submit to submit an application to cluster, without interaction in the middle of run-time

Creating HDInsight cluster

Follow the official documentation to create an HDInsight cluster along with an ADLS gen2 storage. Apart from being cluster integrated storage, it can serve the purpose of storing all the packages/jar that supports the spark code. which spark job would use. By default, HDInsight cluster will have the livy running in port 8998.

To trigger a job in the cluster, it would need a couple of parameters—

  1. The public endpoint of the cluster (https://{myclustername}.azurehdinsight.net)
  2. Cluster credentials

Just make sure whether the livy server is up and running through Ambari dashboard.

Ambari dashboard showing livy server running

Livy-submitting the job

To start with, it requires pip installing an open-source package ‘Livy-Submit’, which uses livy’s interactive mode to connect with the cluster.

pip install Livy-Submit

To submit a job, make a call with the required parameters.

livy_submit --livy-url https://{clustername}.azurehdinsight.net/livy -u <cluster_username> \
-p <cluster_password> \
-s <local path of python file>

This library under the hood passes REST API requests to clusters, passing the provided parameters.

Submitting a job with package dependencies

An ideal scenario of executing a job is to have multiple python files as package dependencies. In spark-submit, this can be done by passing the archive as a jar. The same archival procedure can be followed here as well, which works very fine when run in YARN mode. The other way is to zip the files and access them from ADLS storage, integrated with the cluster.

Zip the packages and make them available in the ADLS gen2 container. To import it, add the argument to “- -py-files” in the below format.

azure-hdinsight cloud pyspark

What is Geek Coin

What is GeekCash, Geek Token

Best Visual Studio Code Themes of 2021

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Multi-cloud Spending: 8 Tips To Lower Cost

Mismanagement of multi-cloud expense costs an arm and leg to business and its management has become a major pain point. Here we break down some crucial tips to take some of the management challenges off your plate and help you optimize your cloud spend.

What are the benefits of cloud migration? Reasons you should migrate

To move or not to move? Benefits are multifold when you are migrating to the cloud. Get the correct information to make your decision, with our cloud engineering expertise.

Azure Spring Cloud: A Comprehensive Overview

In this article, you will learn about Azure Spring Cloud and its main features quickly and with ease, through a very down-to-earth approach.

AWS vs. Azure vs. Google: Which Is the Best for Cloud Computing?

In the world of cloud technology, there are three vendors that reign supreme, and this article briefly outlines some of the merits and use cases for each. AWS vs. Azure vs. Google: Which Is the Best for Cloud Computing?

Google Cloud: Caching Cloud Storage content with Cloud CDN

In this Lab, we will configure Cloud Content Delivery Network (Cloud CDN) for a Cloud Storage bucket and verify caching of an image. Cloud CDN uses Google’s globally distributed edge points of presence to cache HTTP(S) load-balanced content close to our users.