1594583160

**W**hen it comes to Machine Learning (or even life for that matter), there is no free lunch. As Data Scientists, we must test all possible algorithms for data at hand to identify the champion algorithm. Besides, picking the right algorithm is not enough. We must also choose the right configuration of the algorithm for a dataset by tuning the hyper-parameters. Data scientists competing in Kaggle competitions often come up with winning solutions using ensembles of advanced machine learning algorithms. One particular model that is typically part of such ensembles is Gradient Boosting Machines (GBMs). Gradient boosting is a machine learning method used for the solution of regression and classification problems which employs an “ensemble” of weak prediction models (typically decision trees) to produce a powerful “committee.”

Sometimes your all time favorite algorithm fails when we are estimating the uncertainty in the predictions of a machine learning model is crucial for production deployments in the real world areas such as healthcare, actuarial science, weather forecasting or engineering. Not only do we want our models to make accurate predictions, but we also want a correct estimate of uncertainty along with each prediction. When model predictions are part of an automated decision-making workflow or production line, predictive uncertainty estimates are important for determining manual fallback alternatives or for human inspection and intervention.

Probabilistic prediction (or probabilistic forecasting), which is the approach where the model outputs a full probability distribution over the entire outcome space, is a natural way to quantify those uncertainties. sorecently published a new algorithm in theirStanford ML Group,calledpaper.NGBoost

#python #artificial-intelligence #machine-learning #deep-learning

1596962880

The reign of the Gradient Boosters were almost complete in the land of tabular data. In most real world as well as competitions, there was hardly a solution which did not have at least one model from one of the gradient boosting algorithms. But as the machine learning community matured, and the machine learning applications started to be more in use, the need for uncertainty output became important. For classification, the output from Gradient Boosting was already in a form which lets you understand the confidence of the model in its prediction. But for regression problems, it wasn’t the case. The model spat out a number and told us this was its prediction. How do you get uncertainty estimates from a point prediction? And this problem was not just for Gradient Boosting algorithms. but was for almost all the major ML algorithms. This is the problem that the new kid on the block — NGBoost seeks to tackle.

If you’ve not read the previous parts of the series, I strongly advise you to read up, at least the first one where I talk about the, because I am going to take it as a given that you already know what Gradient Boosting is. I would also strongly suggest to read the VI(A) so that you have a better understanding of whatGradient Boosting algorithmare.Natural Gradients

The key innovation in NGBoost is the use of Natural Gradients instead of regular gradients in the boosting algorithm. And by adopting this probabilistic route, it models a full probability distribution over the outcome space, conditioned on the covariates.

The paper modularizes their approach into three components -

- Base Learner
- Parametric Distribution
- Scoring Rule

As in any boosting technique, there are base learners which are combined together to get a complete model. And the NGBoost doesn’t make any assumptions and states that the base learners can be any simple model. The implementation supports a Decision Tree and ridge Regression as base learners out of the box. But you can replace them with any other sci-kit learn style models just as easily.

Here, we are not training a model to predict the outcome as a point estimate, instead, we are predicting a full probability distribution. And every distribution is parametrized by a few parameters. For eg, the normal distribution is parametrized by its mean and standard deviation. You don’t need anything else to define a normal distribution. So, if we train the model to predict these parameters, instead of the point estimate, we will have a full probability distribution as the prediction.

Any machine learning system works on a learning objective, and more often than not, it is the task of minimizing some loss. In point prediction, the predictions are compared with data with a loss function. Scoring rule is the analogue from the probabilistic regression world. The scoring rule compares the estimated probability distribution with the observed data.

A proper scoring rule, *S*, takes as input a forecasted probability distribution, *P*, and one observation y_(outcome)_, and assigns a score _S(P,y) _to the forecast such that the true distribution of the outcomes gets the best score in expectation.

The most commonly used proper scoring rule is the logarithmic score *L*, which, when minimized we get the MLE

which is nothing but the log likelihood that we have seen in so many places. And the scoring rule is parametrized by θ because that is what we are predicting as part of the machine learning model.

Another example is CRPS(Continuous Ranked Probability Score). While the logarithmic score or the log likelihood generalizes Mean Squared Error to a probabilistic space, CRPS does the same to Mean Absolute Error.

In the last part of the series, we saw what Natural Gradient was. And in that discussion, we talked about KL Divergences, because traditionally, Natural Gradients were defined on the MLE scoring rule. But the paper proposes a generalization of the concept and provide a way to extend the concept to CRPS scoring rule as well. They generalized KL Divergence to a general Divergence and provided derivations for CRPS scoring rule.

#machine-learning #ngboost #gradient-boosting #the-gradient-boosters #deep learning

1593766336

The Boosting Algorithm is one of the most powerful learning ideas introduced in the last twenty years. Gradient Boosting is an supervised machine learning algorithm used for classification and regression problems. It is an ensemble technique which uses multiple weak learners to produce a strong model for regression and classification.

Gradient Boosting relies on the intuition that the best possible next model , when combined with the previous models, minimizes the overall prediction errors. The key idea is to set the target outcomes from the previous models to the next model in order to minimize the errors. This is another boosting algorithm(few others are Adaboost, XGBoost etc.).

**Input requirement for Gradient Boosting:**

- A Loss Function to optimize.
- A weak learner to make prediction(Generally Decision tree).
- An additive model to add weak learners to minimize the loss function.

The loss function basically tells how my algorithm, models the data set.In simple terms it is difference between actual values and predicted values.

**Regression Loss functions:**

- L1 loss or Mean Absolute Errors (MAE)
- L2 Loss or Mean Square Error(MSE)
- Quadratic Loss

**Binary Classification Loss Functions:**

- Binary Cross Entropy Loss
- Hinge Loss

A gradient descent procedure is used to minimize the loss when adding trees.

Weak learners are the models which is used sequentially to reduce the error generated from the previous models and to return a strong model on the end.

Decision trees are used as weak learner in gradient boosting algorithm.

In gradient boosting, decision trees are added one at a time (in sequence), and existing trees in the model are not changed.

This is our data set. Here Age, Sft., Location is independent variables and Price is dependent variable or Target variable.

**Step 1**: Calculate the average/mean of the target variable.

**Step 2**: Calculate the residuals for each sample.

**Step 3: **Construct a decision tree. We build a tree with the goal of predicting the Residuals.

In the event if there are more residuals then leaf nodes(here its 6 residuals),some residuals will end up inside the same leaf. When this happens, we compute their average and place that inside the leaf.

After this tree become like this.

**Step 4**: Predict the target label using all the trees within the ensemble.

Each sample passes through the decision nodes of the newly formed tree until it reaches a given lead. The residual in the said leaf is used to predict the house price.

**Calculation above for Residual value (-338) and (-208) in Step 2**

Same way we will calculate the **Predicted Price** for other values

**Note:** We have initially taken 0.1 as learning rate.

**Step 5** : Compute the new residuals

**When Price is 350 and 480 Respectively.**

#gradient-boosting #data-science #boosting #algorithms #algorithms

1599348960

Boosting is a very popular ensemble technique in which we combine many weak learners to transform them into a strong learner. Boosting is a sequential operation in which we build weak learners in series which are dependent on each other in a progressive manner i.e weak learner m depends on the output of weak learner m-1. The weak learners used in boosting have high bias and low variance. In nutshell boosting can be explained as boosting = weak learners + additive combing.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees (Wikipedia definition

#data-science #gradient-boosting #ensemble-learning #boosting #machine-learning

1596003660

In order to understand the Gradient Boosting Algorithm, effort has been made to implement it from first principles using **pytorch **to perform the necessary optimizations (minimize loss function) and calculate the residuals (partial derivatives with respect to predictions) of the loss function and **decision tree regressor from sklearn** to create the regression decision trees.

A line of trees

The outline of the algorithm can be seen in the following figure:

Steps of the algorithm

We will try to implement this step by step and also try to understand why the steps the algorithm takes make sense along the way.

What is crucial is that the algorithm tries to minimize a loss function, be it square distance in the case of regression or binary cross entropy in the case of classification.

The loss function parameters are the predictions we make for the training examples : **L(prediction_for_train_point_1, prediction_for_train_pont_2, …., prediction_for_train_point_m).**

#pytorch #gradient-boosting #machine-learning #deep learning

1598604000

Different Regression models i.e. Linear Regression, Decision Tree Regression, Gradient Boosted Regression, and Random Forest Regression were used. The performance of those models using R² were compared. Based on these performance score, better performing model were suggested to predict house price.

First, the data was divided into independent variable X and dependent variable y. Independent variable X was used to predict the target variable y. The price, id and date column were dropped from the new_df dataframe to create the variable X. The price column from the new_df dataframe were used to create the variable y. Different metrics were used to the performance of the Regression models such as Mean squared errors, Root mean squared errors, R-squared score, Mean absolute deviation, Mean absolute percent errors, etc. Root mean squared error and R-squared score were used to evaluate the performance of the regression model. In order to save the metrics of the model, a data frame was created and it was named metrics. Next, the data was splitted into training and testing set. 80% of the randomly selected data were kept as a training set and 20% of the randomly selected data as a testing set. The model was learned using the 80% of the data, and the rest 20% testing data were used as an unseen future dataset to predict the house price.

Linear Regression Model was built using the default parameters, and the model was fitted using the training dataset. X_test data was used to predict using the model. Then, Mean squared error (MSE), Root mean squared error (RMSE), R-squared score (r2_score), Mean absolute deviation (MAD), and Mean absolute percent error (MAPE) were calculated.

Backward elimination method of feature selection were used. Feature selection is the process of selecting a subset of relevant features that may improve the performance of the model. First, the worst attribute from the feature was removed. The date_sold_month were removed because it has a very weak correlation with the price of the house. Then, year_built_decade_mapped were removed from the feature set. Then, a univariate feature selection package called *SelectKbest* from the sklearn library was tried. Below are correlation coefficients for different features.

#gradient-boosting #machine-learning-models #random-forest-regressor #decision-tree-regressor #natural-hazards