Edward Jackson

Edward Jackson

1566640252

Automated Machine Learning on the Cloud in Python

Automated Machine Learning using TPOT. Automated machine learning (abbreviated auto-ml) aims to algorithmically design and optimize a machine learning pipeline for a particular problem. In this context, the machine learning pipeline consists of: Feature Preprocessing: imputation, scaling, and constructing new features.

Two trends have recently become apparent in data science:

  1. Data analysis and model training is done using cloud resources
  2. Machine learning pipelines are algorithmically developed and optimized

This article will cover a brief introduction to these topics and show how to implement them, using Google Colaboratory to do automated machine learning on the cloud in Python.

Cloud Computing using Google Colab

Originally, all computing was done on a mainframe. You logged in via a terminal, and connected to a central machine where users simultaneously shared a single large computer. Then, along came microprocessors and the personal computer revolution and everyone got their own machine. Laptops and desktops work fine for routine tasks, but with the recent increase in size of datasets and computing power needed to run machine learning models, taking advantage of cloud resources is a necessity for data science.

Cloud computing in general refers to the “delivery of computing services over the Internet”. This covers a wide range of services, from databases to servers to software, but in this article we will run a simple data science workload on the cloud in the form of a Jupyter Notebook. We will use the relatively new Google Colaboratory service: online Jupyter Notebooks in Python which run on Google’s servers, can be accessed from anywhere with an internet connection, are free to use, and are shareable like any Google Doc.

Google Colab has made the process of using cloud computing a breeze. In the past, I spent dozens of hours configuring an Amazon EC2 instance so I could run a Jupyter Notebook on the cloud and had to pay by the hour! Fortunately, last year, Google announced you can now run Jupyter Notebooks on their Colab servers for up to 12 hours at a time completely free. (If that’s not enough, Google recently began letting users add a NVIDIA Tesla K80 GPU to the notebooks). The best part is these notebooks come pre-installed with most data science packages, and more can be easily added, so you don’t have to worry about the technical details of getting set up on your own machine.

To use Colab, all you need is an internet connection and a Google account. If you just want an introduction, head to colab.research.google.com and create a new notebook, or explore the tutorial Google has developed (called Hello, Colaboratory). To follow along with this article, get the notebook here. Sign into your Google account, open the notebook in Colaboratory, click File > save a copy in Drive, and you will then have your own version to edit and run.

Data science is becoming increasingly accessible with the wealth of resources online, and the Colab project has significantly lowered the barrier to cloud computing. For those who have done prior work in Jupyter Notebooks, it’s a completely natural transition, and for those who haven’t, it’s a great opportunity to get started with this commonly used data science tool!

Automated Machine Learning using TPOT

Automated machine learning (abbreviated auto-ml) aims to algorithmically design and optimize a machine learning pipeline for a particular problem. In this context, the machine learning pipeline consists of:

  1. Data analysis and model training is done using cloud resources
  2. Machine learning pipelines are algorithmically developed and optimized

There are an almost infinite number of ways these steps can be combined together, and the optimal solution will change for every problem! Designing a machine learning pipeline can be a time-consuming and frustrating process, and at the end, you will never know if the solution you developed is even close to optimal. Auto-ml can help by evaluating thousands of possible pipelines to try and find the best (or near-optimal) solution for a particular problem.

It’s important to remember that machine learning is only one part of the data science process, and automated machine learning is not meant to replace the data scientist. Instead, auto-ml is meant to free the data scientist so she can work on more valuable aspects of the process, such as gathering data or interpreting a model.

There are a number of auto-ml tools — H20, auto-sklearn, Google Cloud AutoML — and we will focus on TPOT: Tree-based Pipeline Optimization Tool developed by Randy Olson. TPOT (your “data-science assistant”) uses genetic programming to find the best machine learning pipeline.

Interlude: Genetic Programming

To use TPOT, it’s not really necessary to know the details of genetic programming, so you can skip this section. For those who are curious, at a high level, genetic programming for machine learning works as follows:

  1. Data analysis and model training is done using cloud resources
  2. Machine learning pipelines are algorithmically developed and optimized

(For more details on genetic programming, check out this short article.)

The primary benefit of genetic programming for building machine learning models is exploration. Even a human with no time restraints will not be able to try out all combinations of preprocessing, models, and hyperparameters because of limited knowledge and imagination. Genetic programming does not display an initial bias towards any particular sequence of machine learning steps, and with each generation, new pipelines are evaluated. Furthermore, the fitness function means that the most promising areas of the search space are explored more thoroughly than poorer-performing areas.

Putting it together: Automated Machine Learning on the Cloud

With the background in place, we can now walk through using TPOT in a Google Colab notebook to automatically design a machine learning pipeline. (Follow along with the notebook here).

Our task is a supervised regression problem: given New York City energy data, we want to predict the Energy Star Score of a building. In a previous series of articles (part one, part two, part three, code on GitHub), we built a complete machine learning solution for this problem. Using manual feature engineering, dimensionality reduction, model selection, and hyperparameter tuning, we designed a Gradient Boosting Regressor model that achieved a mean absolute error of 9.06 points (on a scale from 1–100) on the test set.

The data contains several dozen continuous numeric variables (such as energy use and area of the building) and two one-hot encoded categorical variables (borough and building type) for a total of 82 features.

The score is the target for regression. All of the missing values have been encoded as np.nan and no feature preprocessing has been done to the data.

To get started, we first need to make sure TPOT is installed in the Google Colab environment. Most data science packages are already installed, but we can add any new ones using system commands (preceded with a ! in Jupyter):

!pip install TPOT

After reading in the data, we would normally fill in the missing values (imputation) and normalize the features to a range (scaling). However, in addition to feature engineering, model selection, and hyperparameter tuning, TPOT will automatically impute the missing values and do feature scaling! So, our next step is to create the TPOT optimizer:

# Import the optimizer class
from tpot import TPOTRegressor

# Create a tpot optimizer with parameters
tpot = TPOTRegressor(scoring = 'neg_mean_absolute_error', 
                     max_time_mins = 480, 
                     n_jobs = -1,
                     verbosity = 2,
cv = 5)

The default parameters for TPOT optimizers will evaluate 100 populations of pipelines, each with 100 generations for a total of 10,000 pipelines. Using 10-fold cross validation, this represents 100,000 training runs! Even though we are using Google’s resources, we do not have unlimited time for training. To avoid running out of time on the Colab server (we get a max of 12 hours of continuous run time), we will set a maximum of 8 hours (480 minutes) for evaluation. TPOT is designed to be run for days, but we can still get good results from a few hours of optimization.

We set the following parameters in the call to the optimizer:

  • scoring = neg_mean_absolute error: Our regression performance metric
  • max_time_minutes = 480: Limit evaluation to 8 hours
  • n_jobs = -1: Use all available cores on the machine
  • verbosity = 2: Show a limited amount of information while training
  • cv = 5: Use 5-fold cross validation (default is 10)

There are other parameters that control details of the genetic programming method, but leaving them at the default works well for most cases. (If you want to play around with the parameters, check out the documentation.)

The syntax for TPOT optimizers is designed to be identical to that for Scikit-Learn models so we can train the optimizer using the .fit method.

# Fit the tpot optimizer on the training data
tpot.fit(training_features, training_targets)

During training, we get some information displayed:

Due to the time limit, our model was only able to get through 15 generations. With 100 populations, this still represents 1500 different individual pipelines that were evaluated, quite a few more than we could have tried by hand!

Once the model has trained, we can see the optimal pipeline using tpot.fitted_pipeline_. We can also save the model to a Python script:

# Export the pipeline as a python script file
tpot.export('tpot_exported_pipeline.py')

Since we are in a Google Colab notebook, to get the pipeline onto a local machine from the server, we have to use the Google Colab library:

# Import file management
from google.colab import file

# Download the pipeline for local use
files.download('tpot_exported_pipeline.py')

We can then open the file (available here) and look at the completed pipeline:

# Preprocessing steps
imputer = Imputer(strategy="median")
imputer.fit(training_features)
training_features = imputer.transform(training_features)
testing_features = imputer.transform(testing_features)

# Final pipeline from TPOT
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=LassoLarsCV(normalize=True)),
    GradientBoostingRegressor(alpha=0.95, learning_rate=0.1, loss="lad", 
                              max_depth=7, max_features=0.75, 
                              min_samples_leaf=3, min_samples_split=18, 
                              n_estimators=100, subsample=0.60)
)

We see that the optimizer imputed the missing values for us and built a complete model pipeline! The final estimator is a stacked model meaning that it uses two machine learning algorithms ( [LassoLarsCV]([http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html)](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html) "http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsCV.html)") and [GradientBoostingRegressor]([http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) "http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)") ), the second of which is trained on the predictions of the first (If you run the notebook again, you may get a different model because the optimization process is stochastic). This is a complex method that I probably would not have been able to develop on my own!

Now, the moment of truth: performance on the testing set. To find the mean absolute error, we can use the .score method:

# Evaluate the final model
print(tpot.score(testing_features, testing_targets))

8.642

In the series of articles where we developed a solution manually, after many hours of development, we built a Gradient Boosting Regressor model that achieved a mean absolute error of 9.06. Automated machine learning has significantly improved on the performance with a drastic reduction in the amount of development time.

From here, we can use the optimized pipeline and try to further refine the solution, or we can move on to other important phases of the data science pipeline. If we use this as our final model, we could try and interpret the model (such as by using LIME: Local Interpretable Model-Agnostic Explainations) or write a well-documented report.

Conclusions

In this post, we got a brief introduction to both the capabilities of the cloud and automated machine learning. With only a Google account and an internet connection, we can use Google Colab to develop, run, and share machine learning or data science work loads. Using TPOT, we can automatically develop an optimized machine learning pipeline with feature preprocessing, model selection, and hyperparameter tuning. Moreover, we saw that auto-ml will not replace the data scientist, but it will allow her to spend more time on higher value parts of the workflow.

While being an early adopter does not always pay off, in this case, TPOT is mature enough to be easy to use and relatively issue-free, yet also new enough that learning it will put you ahead of the curve. With that in mind, find a machine learning problem (perhaps through Kaggle) and try to solve it! Running automatic machine learning in a notebook on Google Colab feels like the future and with such a low barrier to entry, there’s never been a better time to get started!

#machine-learning #python #data-science

What is GEEK

Buddha Community

Automated Machine Learning on the Cloud in Python
Ray  Patel

Ray Patel

1625843760

Python Packages in SQL Server – Get Started with SQL Server Machine Learning Services

Introduction

When installing Machine Learning Services in SQL Server by default few Python Packages are installed. In this article, we will have a look on how to get those installed python package information.

Python Packages

When we choose Python as Machine Learning Service during installation, the following packages are installed in SQL Server,

  • revoscalepy – This Microsoft Python package is used for remote compute contexts, streaming, parallel execution of rx functions for data import and transformation, modeling, visualization, and analysis.
  • microsoftml – This is another Microsoft Python package which adds machine learning algorithms in Python.
  • Anaconda 4.2 – Anaconda is an opensource Python package

#machine learning #sql server #executing python in sql server #machine learning using python #machine learning with sql server #ml in sql server using python #python in sql server ml #python packages #python packages for machine learning services #sql server machine learning services

Ray  Patel

Ray Patel

1619518440

top 30 Python Tips and Tricks for Beginners

Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.

1) swap two numbers.

2) Reversing a string in Python.

3) Create a single string from all the elements in list.

4) Chaining Of Comparison Operators.

5) Print The File Path Of Imported Modules.

6) Return Multiple Values From Functions.

7) Find The Most Frequent Value In A List.

8) Check The Memory Usage Of An Object.

#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners

Ray  Patel

Ray Patel

1619643600

Top Machine Learning Projects in Python For Beginners [2021]

If you want to become a machine learning professional, you’d have to gain experience using its technologies. The best way to do so is by completing projects. That’s why in this article, we’re sharing multiple machine learning projects in Python so you can quickly start testing your skills and gain valuable experience.

However, before you begin, make sure that you’re familiar with machine learning and its algorithm. If you haven’t worked on a project before, don’t worry because we have also shared a detailed tutorial on one project:

#artificial intelligence #machine learning #machine learning in python #machine learning projects #machine learning projects in python #python

Top Machine Learning Projects in Python For Beginners [2021] | upGrad blog

If you want to become a machine learning professional, you’d have to gain experience using its technologies. The best way to do so is by completing projects. That’s why in this article, we’re sharing multiple machine learning projects in Python so you can quickly start testing your skills and gain valuable experience.

However, before you begin, make sure that you’re familiar with machine learning and its algorithm. If you haven’t worked on a project before, don’t worry because we have also shared a detailed tutorial on one project:

The Iris Dataset: For the Beginners

The Iris dataset is easily one of the most popular machine learning projects in Python. It is relatively small, but its simplicity and compact size make it perfect for beginners. If you haven’t worked on any machine learning projects in Python, you should start with it. The Iris dataset is a collection of flower sepal and petal sizes of the flower Iris. It has three classes, with 50 instances in every one of them.

We’ve provided sample code on various places, but you should only use it to understand how it works. Implementing the code without understanding it would fail the premise of doing the project. So be sure to understand the code well before implementing it.

#artificial intelligence #machine learning #machine learning in python #machine learning projects #machine learning projects in python #python

Adaline  Kulas

Adaline Kulas

1594162500

Multi-cloud Spending: 8 Tips To Lower Cost

A multi-cloud approach is nothing but leveraging two or more cloud platforms for meeting the various business requirements of an enterprise. The multi-cloud IT environment incorporates different clouds from multiple vendors and negates the dependence on a single public cloud service provider. Thus enterprises can choose specific services from multiple public clouds and reap the benefits of each.

Given its affordability and agility, most enterprises opt for a multi-cloud approach in cloud computing now. A 2018 survey on the public cloud services market points out that 81% of the respondents use services from two or more providers. Subsequently, the cloud computing services market has reported incredible growth in recent times. The worldwide public cloud services market is all set to reach $500 billion in the next four years, according to IDC.

By choosing multi-cloud solutions strategically, enterprises can optimize the benefits of cloud computing and aim for some key competitive advantages. They can avoid the lengthy and cumbersome processes involved in buying, installing and testing high-priced systems. The IaaS and PaaS solutions have become a windfall for the enterprise’s budget as it does not incur huge up-front capital expenditure.

However, cost optimization is still a challenge while facilitating a multi-cloud environment and a large number of enterprises end up overpaying with or without realizing it. The below-mentioned tips would help you ensure the money is spent wisely on cloud computing services.

  • Deactivate underused or unattached resources

Most organizations tend to get wrong with simple things which turn out to be the root cause for needless spending and resource wastage. The first step to cost optimization in your cloud strategy is to identify underutilized resources that you have been paying for.

Enterprises often continue to pay for resources that have been purchased earlier but are no longer useful. Identifying such unused and unattached resources and deactivating it on a regular basis brings you one step closer to cost optimization. If needed, you can deploy automated cloud management tools that are largely helpful in providing the analytics needed to optimize the cloud spending and cut costs on an ongoing basis.

  • Figure out idle instances

Another key cost optimization strategy is to identify the idle computing instances and consolidate them into fewer instances. An idle computing instance may require a CPU utilization level of 1-5%, but you may be billed by the service provider for 100% for the same instance.

Every enterprise will have such non-production instances that constitute unnecessary storage space and lead to overpaying. Re-evaluating your resource allocations regularly and removing unnecessary storage may help you save money significantly. Resource allocation is not only a matter of CPU and memory but also it is linked to the storage, network, and various other factors.

  • Deploy monitoring mechanisms

The key to efficient cost reduction in cloud computing technology lies in proactive monitoring. A comprehensive view of the cloud usage helps enterprises to monitor and minimize unnecessary spending. You can make use of various mechanisms for monitoring computing demand.

For instance, you can use a heatmap to understand the highs and lows in computing visually. This heat map indicates the start and stop times which in turn lead to reduced costs. You can also deploy automated tools that help organizations to schedule instances to start and stop. By following a heatmap, you can understand whether it is safe to shut down servers on holidays or weekends.

#cloud computing services #all #hybrid cloud #cloud #multi-cloud strategy #cloud spend #multi-cloud spending #multi cloud adoption #why multi cloud #multi cloud trends #multi cloud companies #multi cloud research #multi cloud market