Distributed Deep Learning with Ansible, AWS and Pytorch Lightning.

How to automate and scale your deep learning experiments with Ansible, AWS cloud infrastructure and Pytorch Lightning library.

Let’s say you are a deep learning practitioner, but you don’t have an in-house GPU cluster or a machine learning platform at your disposal. Nobody trains their models on a CPU for almost a decade. Even worse, with models and datasets getting bigger, you have to deal with distributed deep learning and scale your training either in a model-parallel and/or data-parallel regimes. What can we do about it?

We can follow the modern cloud paradigm and utilize some GPU-as-a-service. It will allow you to allocate the necessary infrastructure dynamically on demand and release it once you have finished. It works well, but this is where the main complexity lies. Modern deep learning frameworks like PyTorch Lightning or Horovod make data-parallel distributed learning easy nowadays. The most annoying and time-consuming thing is creating a proper environment because often we have to do it manually. Even for services that hide a lot of infrastructure complexity from you — like [Google Collab] — some manual work still needs to be done.

I’m a strong believer in that repetitive routine work is your enemy. Why? Here is my list of personal concerns:

  1. Reproducibility of results. Have you ever heard of a so-called human factor? We are very error-prone creatures and we are not good at memorizing something in great detail. The more human work some process involves the harder it will be to reproduce it in the future.
  2. Mental distractions. Deep learning is an empirical endeavor and your progress in it relies deeply on your ability to iterate quickly and test as many hypotheses as you can. And due to that fact, anything that distracts you from your main task, — training and evaluating your models or analyzing the data, — negatively affects the success of an overall process.
  3. Effectiveness. Computers do many things a lot faster than we, humans, do. When you have to repeat the same slow procedure over and over it all adds up.

Routine is your enemy

In this article, I’ll describe how you can automate the way you conduct your deep learning experiments.

Automate your Deep Learning experiments

The following are three main ideas of this article:

  1. Utilize cloud-based infrastructure to dynamically allocate resources for your training purposes;
  2. Use DevOps automation toolset to manage all manual work on the experiment environment setup;
  3. Write your training procedure in a modern deep learning framework that makes it capable of data-parallel distributed learning effortlessly.

AWS EC2, Ansible and Pytorch Lightning. Image by author

To actually implement these ideas we will utilize AWS cloud infrastructure, Ansible automation tool, and PyTorch Lightning deep learning library.

Our work will be divided into two parts. In this article we will provide a minimal working example which:

  • Automatically creates and destroys EC2 instances for our deep learning cluster;
  • Establishes connectivity between them necessary for Pytorch and Pytorch Lightning distributed training;
  • Creates a local ssh config file to enable connection to the cluster;
  • Creates a Python virtual environment and installs all library dependencies for the experiment;
  • Provides a submit script to run distributed data-parallel workloads on the created cluster.

In the next article, we will add additional features and build a fully automated environment for distributed learning experiments.

Now, let’s take a brief overview of the chosen technology stack.

What is AWS EC2?

AWS Elastic Compute Cloud (EC2) is a core AWS service that allows you to manage virtual machines in Amazon data centers. With this service you can dynamically create and destroy your machines either manually via AWS Console or via API provided by AWS SDK.

As of today, AWS provides a range of GPU-enabled instances for our purposes with one or multiple GPUs per instance and different choices of NVIDIA GPUs: Tesla GRID K520, M60, K80, T4, V100. See the official site for a full list.

What is Ansible?

Ansible is a tool for software and infrastructure provisioning and configuration management. With Ansible you can remotely provision a whole cluster of remote servers, provision software on them, and monitor them.

It is an open-source project written in Python. It uses a declarative approach: you define a desired system state, and Ansible executes necessary actions. To do that you use ordinary YAML files. Declarative nature of Ansible also means that most of the instructions you define for it are idempotent: if you run it more than once it would not cause any undesirable side effects.

One of the distinctive features of Ansible is that it is agent-less, i.e. it doesn’t require any agent software to be installed on the manageable nodes. It operates solely via SSH protocol. So the only thing you need to ensure is the SSH connectivity between the control host on which you run Ansible commands and the inventory hosts you want to manage.

#deep-learning #ansible #pytorch #data-science #distributed-deep-learning

What is GEEK

Buddha Community

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning.

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning.

How to automate and scale your deep learning experiments with Ansible, AWS cloud infrastructure and Pytorch Lightning library.

Let’s say you are a deep learning practitioner, but you don’t have an in-house GPU cluster or a machine learning platform at your disposal. Nobody trains their models on a CPU for almost a decade. Even worse, with models and datasets getting bigger, you have to deal with distributed deep learning and scale your training either in a model-parallel and/or data-parallel regimes. What can we do about it?

We can follow the modern cloud paradigm and utilize some GPU-as-a-service. It will allow you to allocate the necessary infrastructure dynamically on demand and release it once you have finished. It works well, but this is where the main complexity lies. Modern deep learning frameworks like PyTorch Lightning or Horovod make data-parallel distributed learning easy nowadays. The most annoying and time-consuming thing is creating a proper environment because often we have to do it manually. Even for services that hide a lot of infrastructure complexity from you — like [Google Collab] — some manual work still needs to be done.

I’m a strong believer in that repetitive routine work is your enemy. Why? Here is my list of personal concerns:

  1. Reproducibility of results. Have you ever heard of a so-called human factor? We are very error-prone creatures and we are not good at memorizing something in great detail. The more human work some process involves the harder it will be to reproduce it in the future.
  2. Mental distractions. Deep learning is an empirical endeavor and your progress in it relies deeply on your ability to iterate quickly and test as many hypotheses as you can. And due to that fact, anything that distracts you from your main task, — training and evaluating your models or analyzing the data, — negatively affects the success of an overall process.
  3. Effectiveness. Computers do many things a lot faster than we, humans, do. When you have to repeat the same slow procedure over and over it all adds up.

Routine is your enemy

In this article, I’ll describe how you can automate the way you conduct your deep learning experiments.

Automate your Deep Learning experiments

The following are three main ideas of this article:

  1. Utilize cloud-based infrastructure to dynamically allocate resources for your training purposes;
  2. Use DevOps automation toolset to manage all manual work on the experiment environment setup;
  3. Write your training procedure in a modern deep learning framework that makes it capable of data-parallel distributed learning effortlessly.

AWS EC2, Ansible and Pytorch Lightning. Image by author

To actually implement these ideas we will utilize AWS cloud infrastructure, Ansible automation tool, and PyTorch Lightning deep learning library.

Our work will be divided into two parts. In this article we will provide a minimal working example which:

  • Automatically creates and destroys EC2 instances for our deep learning cluster;
  • Establishes connectivity between them necessary for Pytorch and Pytorch Lightning distributed training;
  • Creates a local ssh config file to enable connection to the cluster;
  • Creates a Python virtual environment and installs all library dependencies for the experiment;
  • Provides a submit script to run distributed data-parallel workloads on the created cluster.

In the next article, we will add additional features and build a fully automated environment for distributed learning experiments.

Now, let’s take a brief overview of the chosen technology stack.

What is AWS EC2?

AWS Elastic Compute Cloud (EC2) is a core AWS service that allows you to manage virtual machines in Amazon data centers. With this service you can dynamically create and destroy your machines either manually via AWS Console or via API provided by AWS SDK.

As of today, AWS provides a range of GPU-enabled instances for our purposes with one or multiple GPUs per instance and different choices of NVIDIA GPUs: Tesla GRID K520, M60, K80, T4, V100. See the official site for a full list.

What is Ansible?

Ansible is a tool for software and infrastructure provisioning and configuration management. With Ansible you can remotely provision a whole cluster of remote servers, provision software on them, and monitor them.

It is an open-source project written in Python. It uses a declarative approach: you define a desired system state, and Ansible executes necessary actions. To do that you use ordinary YAML files. Declarative nature of Ansible also means that most of the instructions you define for it are idempotent: if you run it more than once it would not cause any undesirable side effects.

One of the distinctive features of Ansible is that it is agent-less, i.e. it doesn’t require any agent software to be installed on the manageable nodes. It operates solely via SSH protocol. So the only thing you need to ensure is the SSH connectivity between the control host on which you run Ansible commands and the inventory hosts you want to manage.

#deep-learning #ansible #pytorch #data-science #distributed-deep-learning

Marget D

Marget D

1618317562

Top Deep Learning Development Services | Hire Deep Learning Developer

View more: https://www.inexture.com/services/deep-learning-development/

We at Inexture, strategically work on every project we are associated with. We propose a robust set of AI, ML, and DL consulting services. Our virtuoso team of data scientists and developers meticulously work on every project and add a personalized touch to it. Because we keep our clientele aware of everything being done associated with their project so there’s a sense of transparency being maintained. Leverage our services for your next AI project for end-to-end optimum services.

#deep learning development #deep learning framework #deep learning expert #deep learning ai #deep learning services

Mikel  Okuneva

Mikel Okuneva

1603735200

Top 10 Deep Learning Sessions To Look Forward To At DVDC 2020

The Deep Learning DevCon 2020, DLDC 2020, has exciting talks and sessions around the latest developments in the field of deep learning, that will not only be interesting for professionals of this field but also for the enthusiasts who are willing to make a career in the field of deep learning. The two-day conference scheduled for 29th and 30th October will host paper presentations, tech talks, workshops that will uncover some interesting developments as well as the latest research and advancement of this area. Further to this, with deep learning gaining massive traction, this conference will highlight some fascinating use cases across the world.

Here are ten interesting talks and sessions of DLDC 2020 that one should definitely attend:

Also Read: Why Deep Learning DevCon Comes At The Right Time


Adversarial Robustness in Deep Learning

By Dipanjan Sarkar

**About: **Adversarial Robustness in Deep Learning is a session presented by Dipanjan Sarkar, a Data Science Lead at Applied Materials, as well as a Google Developer Expert in Machine Learning. In this session, he will focus on the adversarial robustness in the field of deep learning, where he talks about its importance, different types of adversarial attacks, and will showcase some ways to train the neural networks with adversarial realisation. Considering abstract deep learning has brought us tremendous achievements in the fields of computer vision and natural language processing, this talk will be really interesting for people working in this area. With this session, the attendees will have a comprehensive understanding of adversarial perturbations in the field of deep learning and ways to deal with them with common recipes.

Read an interview with Dipanjan Sarkar.

Imbalance Handling with Combination of Deep Variational Autoencoder and NEATER

By Divye Singh

**About: **Imbalance Handling with Combination of Deep Variational Autoencoder and NEATER is a paper presentation by Divye Singh, who has a masters in technology degree in Mathematical Modeling and Simulation and has the interest to research in the field of artificial intelligence, learning-based systems, machine learning, etc. In this paper presentation, he will talk about the common problem of class imbalance in medical diagnosis and anomaly detection, and how the problem can be solved with a deep learning framework. The talk focuses on the paper, where he has proposed a synergistic over-sampling method generating informative synthetic minority class data by filtering the noise from the over-sampled examples. Further, he will also showcase the experimental results on several real-life imbalanced datasets to prove the effectiveness of the proposed method for binary classification problems.

Default Rate Prediction Models for Self-Employment in Korea using Ridge, Random Forest & Deep Neural Network

By Dongsuk Hong

About: This is a paper presentation given by Dongsuk Hong, who is a PhD in Computer Science, and works in the big data centre of Korea Credit Information Services. This talk will introduce the attendees with machine learning and deep learning models for predicting self-employment default rates using credit information. He will talk about the study, where the DNN model is implemented for two purposes — a sub-model for the selection of credit information variables; and works for cascading to the final model that predicts default rates. Hong’s main research area is data analysis of credit information, where she is particularly interested in evaluating the performance of prediction models based on machine learning and deep learning. This talk will be interesting for the deep learning practitioners who are willing to make a career in this field.


#opinions #attend dldc 2020 #deep learning #deep learning sessions #deep learning talks #dldc 2020 #top deep learning sessions at dldc 2020 #top deep learning talks at dldc 2020

PyTorch For Deep Learning 

What is Pytorch ?

Pytorch is a Deep Learning Library Devoloped by Facebook. it can be used for various purposes such as Natural Language Processing , Computer Vision, etc

Prerequisites

Python, Numpy, Pandas and Matplotlib

Tensor Basics

What is a tensor ?

A Tensor is a n-dimensional array of elements. In pytorch, everything is a defined as a tensor.

#pytorch #pytorch-tutorial #pytorch-course #deep-learning-course #deep-learning

Pytorch model in Deep Java Library

Simple end to end workflow from PyTorch model building to DJL inference

PyTorch is a fast growing and very popular open source Machine Learning framework. Its imperative design combined with “numpy” like workflow makes it a compelling first choice for beginners and professionals alike. However, serving these model in production is not straightforward and things are particularly difficult if the goal is to serve them natively in Java.

Amazon’s Deep Java Library (DJL) aims to solve this particular pain point by providing high level APIs that can run inference on PyTorch models with very little code. My recent test drive with DJL tells me that it could be a very powerful tool but the existing example set and community guidance (aka stackoverflow help :) ) could be a little intimidating for new folks, specially those who come from python background and are unfamiliar with the java style. This simple demo, hopefully, makes things easier for them. All the scripts are also available at my git repo here.

#deep-learning #pytorch #machine-learning #java #aws #pytorch model in deep java library