Supercharging Hyperparameter Tuning with Dask

Dask improves scikit-learn parameter search speed by over 100x, and Spark by over 40x

Image for post

Photo by Spencer Davis on Unsplash

Image for post

Disclaimer: I’m a Senior Data Scientist at Saturn Cloud — we make enterprise data science fast and easy with Python and Dask.


Hyperparameter tuning is a crucial, and often painful, part of building machine learning models. Squeezing out each bit of performance from your model may mean the difference of millions of dollars in ad revenue, or life-and-death for patients in healthcare models. Even if your model takes one minute to train, you can end up waiting hours for a grid search to complete (think a 10x10 grid, cross-validation, etc.). Each time you wait for a search to finish breaks an iteration cycle and increases the time it takes to produce value with your model. Shortly put:

  • Faster runtime means more iterations to improve accuracy before your deadline
  • Faster runtime means quicker delivery so you can tackle another project
  • Both bullet points mean driving value to the bottom line of your organization

In this post, we will see show how you can improve the speed of your hyperparameter search by over 100x by replacing a few lines of your scikit-learn pipeline with Dask code on Saturn Cloud. This turns a traditionally overnight parameter search to a matter of waiting a few seconds. We also try a comparable grid search with Apache Spark which requires significantly more code change while still being much slower than Dask.

First, what is Dask?

Dask is a flexible and robust parallel computing framework built in, and for, Python. It works with common data structures such as Arrays and DataFrames, but can also be used to parallelize complex operations that do not fit nicely into those. In fact, the parallel Arrays and DataFrames are actually collections of familiar numpy and pandas objects, and have matching APIs. This way, data scientists do not need to learn entirely new frameworks to be able to execute their code on big data.

#dask #apache-spark #data-science #hyperparameter-tuning #scikit-learn #apache

What is GEEK

Buddha Community

Supercharging Hyperparameter Tuning with Dask

Supercharging Hyperparameter Tuning with Dask

Dask improves scikit-learn parameter search speed by over 100x, and Spark by over 40x

Image for post

Photo by Spencer Davis on Unsplash

Image for post

Disclaimer: I’m a Senior Data Scientist at Saturn Cloud — we make enterprise data science fast and easy with Python and Dask.


Hyperparameter tuning is a crucial, and often painful, part of building machine learning models. Squeezing out each bit of performance from your model may mean the difference of millions of dollars in ad revenue, or life-and-death for patients in healthcare models. Even if your model takes one minute to train, you can end up waiting hours for a grid search to complete (think a 10x10 grid, cross-validation, etc.). Each time you wait for a search to finish breaks an iteration cycle and increases the time it takes to produce value with your model. Shortly put:

  • Faster runtime means more iterations to improve accuracy before your deadline
  • Faster runtime means quicker delivery so you can tackle another project
  • Both bullet points mean driving value to the bottom line of your organization

In this post, we will see show how you can improve the speed of your hyperparameter search by over 100x by replacing a few lines of your scikit-learn pipeline with Dask code on Saturn Cloud. This turns a traditionally overnight parameter search to a matter of waiting a few seconds. We also try a comparable grid search with Apache Spark which requires significantly more code change while still being much slower than Dask.

First, what is Dask?

Dask is a flexible and robust parallel computing framework built in, and for, Python. It works with common data structures such as Arrays and DataFrames, but can also be used to parallelize complex operations that do not fit nicely into those. In fact, the parallel Arrays and DataFrames are actually collections of familiar numpy and pandas objects, and have matching APIs. This way, data scientists do not need to learn entirely new frameworks to be able to execute their code on big data.

#dask #apache-spark #data-science #hyperparameter-tuning #scikit-learn #apache

Jerad  Bailey

Jerad Bailey

1597275960

Hyperparameters Tuning Using GridSearchCV And RandomizedSearchCV

While building a Machine learning model we always define two things that are model parameters and model hyperparameters of a predictive algorithm. Model parameters are the ones that are an internal part of the model and their value is computed automatically by the model referring to the data like support vectors in a support vector machine. But hyperparameters are the ones that can be manipulated by the programmer to improve the performance of the model like the learning rate of a deep learning model. They are the one that commands over the algorithm and are initialized in the form of a tuple.

In this article, we will explore hyperparameter tuning. We will see what are the different parts of a hyperparameter, how it is done using two different approaches – GridSearchCV and RandomizedSearchCV. For this experiment, we will use the Boston Housing Dataset that can be downloaded from Kaggle. We will first build the model using default parameters, then we will build the same model using a hyperparameter tuning approach and then will compare the performance of the model.

What We Will Learn From This Article?

  1. What is Hyper Parameter Tuning?
  2. What steps to follow to do Hyper Parameter Tuning?
  3. Implementation of Regression Model
  4. Implementation of Model using GridSearchCV
  5. Implementation of Model using RandomizedSearchCV
  6. Comparison of Different Models

1. What Is Hyperparameter Tuning?

Hyperparameter tuning is the process of tuning the parameters present as the tuples while we build machine learning models. These parameters are defined by us which can be manipulated according to programmer wish. Machine learning algorithms never learn these parameters. These are tuned so that we could get good performance by the model. Hyperparameter tuning aims to find such parameters where the performance of the model is highest or where the model performance is best and the error rate is least. We define the hyperparameter as shown below for the random forest classifier model. These parameters are tuned randomly and results are checked.

#developers corner #hyperparameter tuning #hyperparameters #machine learning #parameter tuning

Paula  Hall

Paula Hall

1623396211

Making Pandas fast with Dask parallel computing

So you, my dear Python enthusiast, have been learning Pandas and Matplotlib for a while and have written a super cool code to analyze your data and visualize it. You are ready to run your script that reads a huge file and all of a sudden your laptop starts making un ugly noise and burning like hell. Sounds familiar?

Well, I have got a couple of good news for you: this issue doesn’t need to happen anymore and you no, you don’t need to upgrade your laptop or your server.

Introducing Dask:

Dask is a flexible library for parallel computing with Python. It provides multi-core and distributed parallel execution on larger-than-memory datasets. It figures out how to break up large computations and route parts of them efficiently onto distributed hardware.

A massive cluster is not always the right choice

Today’s laptops and workstations are surprisingly powerful and, if used correctly, can handle datasets and computations for which we previously depended on clusters. A modern laptop has a multi-core CPU, 32GB of RAM, and flash-based hard drives that can stream through data several times faster than HDDs or SSDs of even a year or two ago.

As a result, Dask can empower analysts to manipulate 100GB+ datasets on their laptop or 1TB+ datasets on a workstation without bothering with the cluster at all.

The project has been a massive plus for the Python machine learning Ecosystem because it democratizes big data analysis. Not only can you save money on bigger servers, but also it copies the Pandas API so you can run your Panda script changing very few lines of code.

#making pandas fast with dask parallel computing #dask parallel computing #pandas #pandas fast #dask #dask parallel

Zakary  Goyette

Zakary Goyette

1600966800

Hyperparameter tuning with Keras and Ray Tune

In my previous article, I had explained how to build a small and nimble image classifier and what are the advantages of having variable input dimensions in a convolutional neural network. However, after going through the model building code and training routine, one can ask questions such as:

  1. How to choose the number of layers in a neural network?
  2. How to choose the optimal number of units/filters in each layer?
  3. What would be the best data augmentation strategy for my dataset?
  4. What batch size and learning rate would be appropriate?

Building or training a neural network involves figuring out the answers to the above questions. You may have an intuition for CNNs, for example, as we go deeper the number of filters in each layer should increase as the neural network learns to extract more and more complex features built on simpler features extracted in the earlier layers. However, there might be a more optimal model (for your dataset) with a lesser number of parameters that might outperform the model that you have designed based on your intuition.

In this article, I’ll explain what these parameters are and how do they affect the training of a machine learning model. I’ll explain how do machine learning engineers choose these parameters and how can we automate this process using a simple mathematical concept. I’ll be starting with the same model architecture from my previous article and will be modifying it to make most of the training and architectural parameters tunable.

#data-science #machine-learning #deep-learning #hyperparameter-tuning #bayesian-optimization

Rusty  Shanahan

Rusty Shanahan

1598062920

Fine Tuning XGBoost model

Tuning the model is the way to supercharge the model to increase their performance. Let us look into an example where there is a comparison between the untuned XGBoost model and tuned XGBoost model based on their RMSE score. Later, you will know about the description of the hyperparameters in XGBoost.

Below is the code example for untuned parameters in XGBoost model:

#Importing necessary libraries
	import pandas as pd
	import numpy as np 
	import xgboost as xg

	#Load the data
	house = pd.read_csv("ames_housing_trimmed_pricessed.csv")
	X,y = house[house.columns.tolist()[:-1]],
	            house[house.columns.tolist()[-1]]

	#Converting it into DMatrix
	house_dmatrix = xgb.DMatrix(data = X, label = y)

	#Parameter configuration
	param_untuned = {"objective":"reg:linear"}

	cv_untuned_rmse = xg.cv(dtrain = house_dmatrix, params = param_untuned, nfold = 4, 
	                        metrics = "rmse", as_pandas = True, seed= 123)
	print("RMSE Untuned: %f" %((cv_untuned_rmse["test-rmse-mean"]).tail(1)))
view raw
tune_1.py hosted with ❤ by GitHub

Output: 34624.229980

Now let us look to the value of RMSE when the parameters are tuned to some extent:

#Importing necessary libraries
	import pandas as pd
	import numpy as np 
	import xgboost as xg

	#Load the data
	house = pd.read_csv("ames_housing_trimmed_pricessed.csv")
	X,y = house[house.columns.tolist()[:-1]],
	            house[house.columns.tolist()[-1]]

	#Converting it into DMatrix
	house_dmatrix = xgb.DMatrix(data = X, label = y)

	#Parameter Configuration
	param_tuned = {"objective":"reg:linear", 'colsample_bytree': 0.3,
	               'learning_rate': 0.1, 'max_depth': 5}

	cv_tuned_rmse = xg.cv(dtrain = house_dmatrix, params = param_tuned, nfold = 4,
	                      num_boost_round = 200, metrics = "rmse", as_pandas = True, seed= 123)
	print("RMSE Tuned: %f" %((cv_tuned_rmse["test-rmse-mean"]).tail(1)))
view raw
tune_2.py hosted with ❤ by GitHub

Output: 29812.683594

It can be seen that there is around 15% reduction in the RMSE score when the parameters got tuned.

#machine-learning #hyperparameter #artificial-intelligence #hyperparameter-tuning #xgboost #deep learning